Figure 3 - uploaded by Weirong Jiang
Content may be subject to copyright.
The parse tree for regular expression "b * c (a|b) * [ac]#".

The parse tree for regular expression "b * c (a|b) * [ac]#".

Source publication
Conference Paper
Full-text available
In this paper we present a novel architecture for high-speed and high-capacity regular expression matching (REM) on FPGA. The proposed REM architecture, based on nondeter- ministic nite automaton (RE-NFA), eciently constructs regular expression matching engines (REME) of arbitrary regular patterns and character classes in a uniform struc- ture, uti...

Similar publications

Conference Paper
Full-text available
The modern Web requires new ways for creating applications. We present our approach combining a web framework with a modern object-oriented database. It makes it easier to develop web applications by rising the level of abstraction. In contrast to many existing solutions, where the business logic is developed in an object-oriented programming langu...

Citations

... There are also hardware-based implementations that make use of the inherent parallelism that is available in hardware, where circuit elements compute independently and in parallel. There are several proposals that use field-programmable gate arrays (FPGAs) to support the matching of regexes [138,139,140,73,74,141,142,143]. A number of digital application-specific integrated circuit (ASIC) accelerators [52] have been designed to achieve high throughput for regex matching. ...
Thesis
Full-text available
Recent technological advances are causing an enormous proliferation of streaming data, i.e., data that is generated in real-time. Such data is produced at an overwhelming rate that cannot be processed in traditional manners. This thesis aims to provide programming language support for real-time data processing through three approaches: (1) creating a language for specifying complex computations over real-time data streams, (2) developing software-hardware co-design to efficiently match regular patterns in a streaming setting, and (3) designing a system for parallel stream processing with the preservation of sequential semantics. The first part of this thesis introduces StreamQL, a high-level language for specifying complex streaming computations through a combination of stream transformations. StreamQL integrates relational, dataflow, and temporal constructs, offering an expressive and modular approach for programming streaming computations. Performance comparisons against popular streaming engines show that the StreamQL library consistently achieves higher throughput, making it a useful tool for prototyping complex real-world streaming algorithms. The second part of this thesis focuses on hardware acceleration for regular pattern matching, specifically targeting the matching of regular expressions with bounded repetitions. A hardware architecture inspired by nondeterministic counter automata is presented, which uses counter and bit vector modules to efficiently handle bounded repetitions. A regex-to-hardware compiler is developed in this work, which provides static analysis over regular expressions and translates them into hardware-recognizable programs. Experimental results show that our solution provides significant improvements in energy efficiency and area reduction compared to existing solutions. Finally, this thesis presents a novel programming system for parallelizing the processing of streaming data on multicore CPUs with the preservation of sequential semantics. This system addresses challenges in preserving the sequential semantics when dealing with identical timestamps, dynamic item rates, and non-linear task parallelism. A Rust library called ParaStream is developed to support semantics-preserving parallelism in stream processing, outperforming state-of-the-art tools in terms of single-threaded throughput and scalability. Real-world benchmarks show substantial performance gains with increasing degrees of parallelism, highlighting the practicality and efficiency of ParaStream.
... HARE achieves a 32Gbps throughput but has limited support for Kleene operators (which only allow single Many prior works [5,48] focus on FPGA and GPU hardware architectures to take advantage of their configurability and parallelism. [67] and [51] provide support for regexes with counting on FPGA hardware. [63] extends the DFA ambiguity expressed in [49] to NFA with counters by defining the character class ambiguity, a problem that arises when the intersection between two adjacent character class with constraint repetitions (CCR) is non-empty. ...
Preprint
Full-text available
Regular pattern matching is used in numerous application domains, including text processing, bioinformatics, and network security. Patterns are typically expressed with an extended syntax of regular expressions that include the computationally challenging construct of bounded iteration or counting, which describes the repetition of a pattern a fixed number of times. We develop a design for a specialized in-memory hardware architecture for NFA execution that integrates counter and bit vector elements. The design is inspired by the theoretical model of nondeterministic counter automata (NCA). A key feature of our approach is that we statically analyze regular expressions to determine bounds on the amount of memory needed for the occurrences of counting. The results of this analysis are used by a regex-to-hardware compiler in order to make an appropriate selection of counter or bit vector elements. We evaluate the performance of our hardware implementation on a simulator based on circuit parameters collected by SPICE simulation using a TSMC 28nm process. We find the usage of counter and bit vector quickly outperforms unfolding solutions by orders of magnitude with small counting quantifiers. Experiments concerning realistic workloads show up to 76% energy reduction and 58% area reduction in comparison to traditional in-memory NFA processors.
... HARE achieves a 32Gbps throughput but has limited support for Kleene operators (which only allow single Many prior works [5,48] focus on FPGA and GPU hardware architectures to take advantage of their configurability and parallelism. [67] and [51] provide support for regexes with counting on FPGA hardware. [63] extends the DFA ambiguity expressed in [49] to NFA with counters by defining the character class ambiguity, a problem that arises when the intersection between two adjacent character class with constraint repetitions (CCR) is non-empty. ...
... SQLi detection attempts to identify possible web-based attacks by examining packet payloads for known attack data. The SQLi implementation uses a regular expression matching engine (REME) to find keywords in the GET and POST request lines of an HTTP packet [25]. In the design, a REME can take at most 64 input characters. ...
... In our system, TCPreplay 6 is used to send packets ranging in size from 54 to 1514 bytes through SQLi detectors via 10 Gbps ports at varying speeds. Our regular expression matching engine has similarities to a previous implementation [25] except that a CAM is used to preprocess each character. ...
Article
Full-text available
Network function virtualization (NFV) is a powerful networking approach that leverages computing resources to perform a time-varying set of network processing functions. Although microprocessors can be used for this purpose, their performance limitations and lack of specialization present implementation challenges. In this article, we describe a new heterogeneous hardware-software NFV platform called CoNFV that provides scalability and programmability while supporting significant hardware-level parallelism and reconfiguration. Our computing platform takes advantage of both field-programmable gate arrays (FPGAs) and microprocessors to implement numerous virtual network functions (VNF) that can be dynamically customized to specific network flow needs. The most distinctive feature of our system is the use of global network state to coordinate NFV operations. Traffic management and hardware reconfiguration functions are performed by a global coordinator that allows for the rapid sharing of network function states and continuous evaluation of network function needs. With the help of state sharing mechanism offered by the coordinator, customer-defined VNF instances can be easily migrated between heterogeneous middleboxes as the network environment changes. A resource allocation and scheduling algorithm dynamically assesses resource deployments as network flows and conditions are updated. We show that our deployment algorithm can successfully reallocate FPGA and microprocessor resources in a fraction of a second in response to changes in network flow capacity and network security threats including intrusion.
... Prior work has investigated implementing finite automata processing on FP-GAs [28,119,196,204,241,256]. Because automata can be thought of as circuitswhere each state transition element is a specialized logic gate-they can be naturally implemented in an FPGA fabric. ...
Thesis
The adoption of hardware accelerators, such as Field-Programmable Gate Arrays, into general-purpose computation pipelines continues to rise, driven by recent trends in data collection and analysis as well as pressure from challenging physical design constraints in hardware. The architectural designs of many of these accelerators stand in stark contrast to the traditional von Neumann model of CPUs. Consequently, existing programming languages, maintenance tools, and techniques are not directly applicable to these devices, meaning that additional architectural knowledge is required for effective programming and configuration. Current programming models and techniques are akin to assembly-level programming on a CPU, thus placing significant burden on developers tasked with using these architectures. Because programming is currently performed at such low levels of abstraction, the software development process is tedious and challenging and hinders the adoption of hardware accelerators. This dissertation explores the thesis that theoretical finite automata provide a suitable abstraction for bridging the gap between high-level programming models and maintenance tools familiar to developers and the low-level hardware representations that enable high-performance execution on hardware accelerators. We adopt a principled hardware/software co-design methodology to develop a programming model providing the key properties that we observe are necessary for success, namely performance and scalability, ease of use, expressive power, and legacy support. First, we develop a framework that allows developers to port existing, legacy code to run on hardware accelerators by leveraging automata learning algorithms in a novel composition with software verification, string solvers, and high-performance automata architectures. Next, we design a domain-specific programming language to aid programmers writing pattern-searching algorithms and develop compilation algorithms to produce finite automata, which supports efficient execution on a wide variety of processing architectures. Then, we develop an interactive debugger for our new language, which allows developers to accurately identify the locations of bugs in software while maintaining support for high-throughput data processing. Finally, we develop two new automata-derived accelerator architectures to support additional applications, including the detection of security attacks and the parsing of recursive and tree-structured data. Using empirical studies, logical reasoning, and statistical analyses, we demonstrate that our prototype artifacts scale to real-world applications, maintain manageable overheads, and support developers' use of hardware accelerators. Collectively, the research efforts detailed in this dissertation help ease the adoption and use of hardware accelerators for data analysis applications, while supporting high-performance computation.
... Operators that are most resource-consuming can receive optimized implementations. Constrained repetition, e.g., can create long NFAs (as seen in states 2 through 5 in Figure 6d) and are thus typically converted to counters [54]- [56] or shift registers [57]. Subsequent work was mostly devoted to address specific scenarios and constraints, mainly area or performance. ...
... Multi-character input is the typical approach to increase throughput for regular expression matching [57]- [60], as observed for exact string matching as well. Although it substantially increases area, throughput above 10 Gbps is reported in [57]- [59]. ...
... Multi-character input is the typical approach to increase throughput for regular expression matching [57]- [60], as observed for exact string matching as well. Although it substantially increases area, throughput above 10 Gbps is reported in [57]- [59]. In [60] the authors propose a pre-processing stage to improve area efficiency, while still handling up to 7.27 Gbps. ...
Article
Network Functions Virtualization (NFV) has received considerable attention in the past few years, both from industry and academia, due to its potential for reducing capital and operational expenditures, thus enabling faster innovation in networks. NFV proposes decoupling network functions from fixed hardware platforms and implementing them as virtual machines on off-the-shelf servers. Although potentially able to provide the mentioned benefits, software-based implementations of compute-intensive network functions still struggle to perform at the desired speed, especially when requiring line-rate processing for ever-faster communication links. To address this, hardware acceleration can be used to improve the throughput and latency of virtualized network functions (VNFs). In order to provide the advantages foreseen for NFV, however, such accelerators cannot be fixed and application-specific hardware, since they need to cope with new VNFs as well as fluctuations in demand. In this context, Field-Programmable Gate Arrays (FPGAs) are particularly suitable, since they are able to provide high-performance implementations of network functions and they are completely reprogrammable, thereby being able to implement different VNFs even after deployment. There have been many recent efforts to enable the use of FPGAs in the NFV context, including efficient implementations of network functions on FPGAs, platforms to manage the integration and coexistence of multiple VNFs on an FPGA, and high-level synthesis tools especially tailored to ease the programming of VNFs for FPGAs. In this work we survey previous work covering these aspects, and discuss the main open research challenges that must be addressed before FPGA adoption in NFV infrastructures becomes effectively seamless and efficient.
... Note that the previously described design of REMU is one of the many methods of implementing regex search using FPGA. It can be replaced by other designs, such as NFA [47,61], DFA [14,27], B-FSM [33,56], bit-split automata [17,51], and so forth. Each method has its advantage and best applicable field. ...
... Extensive research has been reported in accelerating regex search over the past decade. Some researchers take advantage of GPU [30,62], SIMD [8,31,44], and multicore architectures [38], whereas others focus on FPGA/ASIC-based solutions [7,14,17,23,27,43,47,48,52,61]. ...
... Early work discusses regex search in FPGA/ASIC by mapping non-deterministic finite automata (NFA) [47,61] and deterministic finite automata (DFA) [7,14,27] to programmable logic. A recent study by Gogte et al. [17] proposed HARE, extended from Tandon et al. [52], compiles regex into subexpressions and runs bit-split automata [51] on each subexpression in parallel to achieve high throughput. ...
Article
This article presents REGISTOR, a platform for regular expression grabbing inside storage. The main idea of Registor is accelerating regular expression (regex) search inside storage where large data set is stored, eliminating the I/O bottleneck problem. A special hardware engine for regex search is designed and augmented inside a flash SSD that processes data on-the-fly during data transmission from NAND flash to host. To make the speed of regex search match the internal bus speed of a modern SSD, a deep pipeline structure is designed in Registor hardware consisting of a file semantics extractor, matching candidates finder, regex matching units (REMUs), and results organizer. Furthermore, each stage of the pipeline makes the use of maximal parallelism possible. To make Registor readily usable by high-level applications, we have developed a set of APIs and libraries in Linux allowing Registor to process files in the SSD by recombining separate data blocks into files efficiently. A working prototype of Registor has been built in our newly designed NVMe-SSD. Extensive experiments and analyses have been carried out to show that Registor achieves high throughput, reduces the I/O bandwidth requirement by up to 97%, and reduces CPU utilization by as much as 82% for regex search in large datasets.
... Helios [5] is another accelerator that processes regexps for network packet inspection at line rate. In addition, several works [189,215,156,36,214] propose mechanisms to match regexps on FPGAs. They focus on building a finite automaton and encode it in the logic of the FPGA. ...
Thesis
Emerging persistent memory (PM) technologies promise the performance of DRAM with the durability of disk. However, several challenges remain in existing hardware, programming, and software systems that inhibit wide-scale PM adoption. This thesis focuses on building efficient mechanisms that span hardware and operating systems, and programming languages for integrating PMs in future systems. First, this thesis proposes a mechanism to solve low-endurance problem in PMs. PMs suffer from limited write endurance---PM cells can be written only 10^7-10^9 times before they wear out. Without any wear management, PM lifetime might be as low as 1.1 months. This thesis presents Kevlar, an OS-based wear-management technique for PM, that requires no new hardware. Kevlar uses existing virtual memory mechanisms to remap pages, enabling it to perform both wear leveling---shuffling pages in PM to even wear; and wear reduction---transparently migrating heavily written pages to DRAM. Crucially, Kevlar avoids the need for hardware support to track wear at fine grain. It relies on a novel wear-estimation technique that builds upon Intel's Precise Event Based Sampling to approximately track processor cache contents via a software-maintained Bloom filter and estimate write-back rates at fine grain. Second, this thesis proposes a persistency model for high-level languages to enable integration of PMs in to future programming systems. Prior works extend language memory models with a persistency model prescribing semantics for updates to PM. These approaches require high-overhead mechanisms, are restricted to certain synchronization constructs, provide incomplete semantics, and/or may recover to state that cannot arise in fault-free program execution. This thesis argues for persistency semantics that guarantee failure atomicity of synchronization-free regions (SFRs) --- program regions delimited by synchronization operations. The proposed approach provides clear semantics for the PM state that recovery code may observe and extends C++11's "sequential consistency for data-race-free" guarantee to post-failure recovery code. To this end, this thesis investigates two designs for failure-atomic SFRs that vary in performance and the degree to which commit of persistent state may lag execution. Finally, this thesis proposes StrandWeaver, a hardware persistency model that minimally constrains ordering on PM operations. Several language-level persistency models have emerged recently to aid programming recoverable data structures in PM. The language-level persistency models are built upon hardware primitives that impose stricter ordering constraints on PM operations than the persistency models require. StrandWeaver manages PM order within a strand, a logically independent sequence of PM operations within a thread. PM operations that lie on separate strands are unordered and may drain concurrently to PM. StrandWeaver implements primitives under strand persistency to allow programmers to improve concurrency and relax ordering constraints on updates as they drain to PM. Furthermore, StrandWeaver proposes mechanisms that map persistency semantics in high-level language persistency models to the primitives implemented by StrandWeaver.
... It is important to note that patterns matched by regexps can easily [7] be matched by an automaton, such as: Deterministic Finite Automata (DFA) or Nondeterministic Finite Automata (NFA). A variation of the former and the latter automata is called the hybrid DFA-NFA (hybrid-FA) [6]. ...
... The design by [7] is an automatic architectural optimisation approach which spatially stacks regexps matching circuits (REMs). It then forms multiple character matching (MCMs) circuits. ...
... The structure is aimed at improving the overall design clock speed [31]. However, the challenge with the architecture designed in [7] is that the process of distributing and buffering the character matching signals was initially error-prone and difficult to implement manually. To address the problem, the approach proposed in [25] used a heuristic that automatically marshaled k-REMs with total N-states into p-pipelines. ...
Article
Full-text available
There are various kinds of Network attacks often identifiable by the patterns of data they contain. More complex regular expressions that express these patterns need to be matched at a very high speed. Most hardware-based approaches build the equivalent automata using minimal hardware resources to detect pattern variations. This paper explains the design, structure, and suitability of an optimized hardware-based automata implementation called Equivalence Class Direct Table Synthesis Nondeterministic Finite Automata (ECD R TS-NFA). The optimized approach described in this paper builds upon the earlier published version called Equivalence Class Descriptor Nondeterministic Finite Automata (ECD-NFA). The ECD R TS-NFA also uses an Equivalence Classification (EC) technique. However, the ECD R TS-NFA approach utilizes a newer form of table compression for its compressed ECs, called Equivalence Class Descriptors (ECDs). The ECDs then used to match against multi-character strings rather than the initial single character approach implemented in the ECD-NFA design. The optimized technique implemented in the ECD R TS-NFA further improves the matching speed of the design, while at the same time significantly reducing the overall resources required. This is achieved by taking full advantage of the Field Programmable Gate Array (FPGA) technology used for the hardware implementation. The design further provides higher throughput and support for quick updates, and clocks at 385.78 MHz, with a maximum throughput value of 12.34 Gigabits per second (Gbps), depicting a 3.35% improvement over the next best rival design in this paper.
... A lot of researchers tried to optimize hardware architectures in order to reduce FPGA logic utilization and support more REs, because RE matching has been primarily motivated by intrusion detection systems with a large set of REs. Throughput of the NFA architectures has been increased by multi-striding [1,3] and spatial-stacking [11,12]. Both these techniques increase the throughput by processing multiple input symbols at a single step. ...
... The circuit frequency drops because of the increasing complexity of transition logic that limits the throughput in the order of tens of Gbps [1,3,7]. Spatial stacking features the frequency drop and limitation of throughput in the order of tens of Gbps as well [11,12]. Both techniques are able to scale the throughput only to tens of Gbps. ...
Conference Paper
Regular expression matching (RE matching) is a widely used operation in network security monitoring applications. With the speed of network links increasing to 100 Gbps and 400 Gbps, it is necessary to speed up packet processing and provide RE matching at such high speeds. Although many RE matching algorithms and architectures have been designed, none of them supports 100 Gbps throughput together with fast updates of an RE set. Therefore, this paper focuses on the design of a new hardware architecture that addresses both these requirements. The proposed architecture uses multiple highly memory-efficient Delayed Input DFAs (D²FAs), which are organized to a processing pipeline. As all D²FAs in the pipeline have only local communication, the proposed architecture is able to operate at high frequency even for a large number of parallel engines, which allows scaling throughput to hundreds of gigabits per second. The paper also analyses how to scale the number of engines and the capacity of buffers to achieve desired throughput. Using the parameters obtained while matching a sample RE set represented by a D²FA in a real network traffic, the architecture can be tuned for wire-speed throughput of 400 Gbps.