The parse tree for regular expression "b * c (a|b) * [ac]#".

Language Support for Real-time Data Processing

Thesis

Full-text available

Aug 2023

Lingkun Kong

Recent technological advances are causing an enormous proliferation of streaming data, i.e., data that is generated in real-time. Such data is produced at an overwhelming rate that cannot be processed in traditional manners. This thesis aims to provide programming language support for real-time data processing through three approaches: (1) creating a language for specifying complex computations over real-time data streams, (2) developing software-hardware co-design to efficiently match regular patterns in a streaming setting, and (3) designing a system for parallel stream processing with the preservation of sequential semantics. The first part of this thesis introduces StreamQL, a high-level language for specifying complex streaming computations through a combination of stream transformations. StreamQL integrates relational, dataflow, and temporal constructs, offering an expressive and modular approach for programming streaming computations. Performance comparisons against popular streaming engines show that the StreamQL library consistently achieves higher throughput, making it a useful tool for prototyping complex real-world streaming algorithms. The second part of this thesis focuses on hardware acceleration for regular pattern matching, specifically targeting the matching of regular expressions with bounded repetitions. A hardware architecture inspired by nondeterministic counter automata is presented, which uses counter and bit vector modules to efficiently handle bounded repetitions. A regex-to-hardware compiler is developed in this work, which provides static analysis over regular expressions and translates them into hardware-recognizable programs. Experimental results show that our solution provides significant improvements in energy efficiency and area reduction compared to existing solutions. Finally, this thesis presents a novel programming system for parallelizing the processing of streaming data on multicore CPUs with the preservation of sequential semantics. This system addresses challenges in preserving the sequential semantics when dealing with identical timestamps, dynamic item rates, and non-linear task parallelism. A Rust library called ParaStream is developed to support semantics-preserving parallelism in stream processing, outperforming state-of-the-art tools in terms of single-threaded throughput and scalability. Real-world benchmarks show substantial performance gains with increasing degrees of parallelism, highlighting the practicality and efficiency of ParaStream.

Software-Hardware Codesign for Efficient In-Memory Regular Pattern Matching

Preprint

Full-text available

Sep 2022

Regular pattern matching is used in numerous application domains, including text processing, bioinformatics, and network security. Patterns are typically expressed with an extended syntax of regular expressions that include the computationally challenging construct of bounded iteration or counting, which describes the repetition of a pattern a fixed number of times. We develop a design for a specialized in-memory hardware architecture for NFA execution that integrates counter and bit vector elements. The design is inspired by the theoretical model of nondeterministic counter automata (NCA). A key feature of our approach is that we statically analyze regular expressions to determine bounds on the amount of memory needed for the occurrences of counting. The results of this analysis are used by a regex-to-hardware compiler in order to make an appropriate selection of counter or bit vector elements. We evaluate the performance of our hardware implementation on a simulator based on circuit parameters collected by SPICE simulation using a TSMC 28nm process. We find the usage of counter and bit vector quickly outperforms unfolding solutions by orders of magnitude with small counting quantifiers. Experiments concerning realistic workloads show up to 76% energy reduction and 58% area reduction in comparison to traditional in-memory NFA processors.

Software-hardware codesign for efficient in-memory regular pattern matching

Conference Paper

Full-text available

Jun 2022

CoNFV: A Heterogeneous Platform for Scalable Network Function Virtualization

Article

Full-text available

Aug 2020

Network function virtualization (NFV) is a powerful networking approach that leverages computing resources to perform a time-varying set of network processing functions. Although microprocessors can be used for this purpose, their performance limitations and lack of specialization present implementation challenges. In this article, we describe a new heterogeneous hardware-software NFV platform called CoNFV that provides scalability and programmability while supporting significant hardware-level parallelism and reconfiguration. Our computing platform takes advantage of both field-programmable gate arrays (FPGAs) and microprocessors to implement numerous virtual network functions (VNF) that can be dynamically customized to specific network flow needs. The most distinctive feature of our system is the use of global network state to coordinate NFV operations. Traffic management and hardware reconfiguration functions are performed by a global coordinator that allows for the rapid sharing of network function states and continuous evaluation of network function needs. With the help of state sharing mechanism offered by the coordinator, customer-defined VNF instances can be easily migrated between heterogeneous middleboxes as the network environment changes. A resource allocation and scheduling algorithm dynamically assesses resource deployments as network flows and conditions are updated. We show that our deployment algorithm can successfully reallocate FPGA and microprocessor resources in a fraction of a second in response to changes in network flow capacity and network security threats including intrusion.

Improving Programming Support for Hardware Accelerators Through Automata Processing Abstractions

Thesis

Jan 2020

Kevin Angstadt

The adoption of hardware accelerators, such as Field-Programmable Gate Arrays, into general-purpose computation pipelines continues to rise, driven by recent trends in data collection and analysis as well as pressure from challenging physical design constraints in hardware. The architectural designs of many of these accelerators stand in stark contrast to the traditional von Neumann model of CPUs. Consequently, existing programming languages, maintenance tools, and techniques are not directly applicable to these devices, meaning that additional architectural knowledge is required for effective programming and configuration. Current programming models and techniques are akin to assembly-level programming on a CPU, thus placing significant burden on developers tasked with using these architectures. Because programming is currently performed at such low levels of abstraction, the software development process is tedious and challenging and hinders the adoption of hardware accelerators. This dissertation explores the thesis that theoretical finite automata provide a suitable abstraction for bridging the gap between high-level programming models and maintenance tools familiar to developers and the low-level hardware representations that enable high-performance execution on hardware accelerators. We adopt a principled hardware/software co-design methodology to develop a programming model providing the key properties that we observe are necessary for success, namely performance and scalability, ease of use, expressive power, and legacy support. First, we develop a framework that allows developers to port existing, legacy code to run on hardware accelerators by leveraging automata learning algorithms in a novel composition with software verification, string solvers, and high-performance automata architectures. Next, we design a domain-specific programming language to aid programmers writing pattern-searching algorithms and develop compilation algorithms to produce finite automata, which supports efficient execution on a wide variety of processing architectures. Then, we develop an interactive debugger for our new language, which allows developers to accurately identify the locations of bugs in software while maintaining support for high-throughput data processing. Finally, we develop two new automata-derived accelerator architectures to support additional applications, including the detection of security attacks and the parsing of recursive and tree-structured data. Using empirical studies, logical reasoning, and statistical analyses, we demonstrate that our prototype artifacts scale to real-world applications, maintain manageable overheads, and support developers' use of hardware accelerators. Collectively, the research efforts detailed in this dissertation help ease the adoption and use of hardware accelerators for data analysis applications, while supporting high-performance computation.

A Survey on FPGA Support for the Feasible Execution of Virtualized Network Functions

Article

Sep 2019

Network Functions Virtualization (NFV) has received considerable attention in the past few years, both from industry and academia, due to its potential for reducing capital and operational expenditures, thus enabling faster innovation in networks. NFV proposes decoupling network functions from fixed hardware platforms and implementing them as virtual machines on off-the-shelf servers. Although potentially able to provide the mentioned benefits, software-based implementations of compute-intensive network functions still struggle to perform at the desired speed, especially when requiring line-rate processing for ever-faster communication links. To address this, hardware acceleration can be used to improve the throughput and latency of virtualized network functions (VNFs). In order to provide the advantages foreseen for NFV, however, such accelerators cannot be fixed and application-specific hardware, since they need to cope with new VNFs as well as fluctuations in demand. In this context, Field-Programmable Gate Arrays (FPGAs) are particularly suitable, since they are able to provide high-performance implementations of network functions and they are completely reprogrammable, thereby being able to implement different VNFs even after deployment. There have been many recent efforts to enable the use of FPGAs in the NFV context, including efficient implementations of network functions on FPGAs, platforms to manage the integration and coexistence of multiple VNFs on an FPGA, and high-level synthesis tools especially tailored to ease the programming of VNFs for FPGAs. In this work we survey previous work covering these aspects, and discuss the main open research challenges that must be addressed before FPGA adoption in NFV infrastructures becomes effectively seamless and efficient.

Registor: A platform for unstructured data processing inside SSD storage

Article

Mar 2019

This article presents REGISTOR, a platform for regular expression grabbing inside storage. The main idea of Registor is accelerating regular expression (regex) search inside storage where large data set is stored, eliminating the I/O bottleneck problem. A special hardware engine for regex search is designed and augmented inside a flash SSD that processes data on-the-fly during data transmission from NAND flash to host. To make the speed of regex search match the internal bus speed of a modern SSD, a deep pipeline structure is designed in Registor hardware consisting of a file semantics extractor, matching candidates finder, regex matching units (REMUs), and results organizer. Furthermore, each stage of the pipeline makes the use of maximal parallelism possible. To make Registor readily usable by high-level applications, we have developed a set of APIs and libraries in Linux allowing Registor to process files in the SSD by recombining separate data blocks into files efficiently. A working prototype of Registor has been built in our newly designed NVMe-SSD. Extensive experiments and analyses have been carried out to show that Registor achieves high throughput, reduces the I/O bandwidth requirement by up to 97%, and reduces CPU utilization by as much as 82% for regex search in large datasets.

Runtime Systems for Persistent Memories

Thesis

Jan 2019

Vaibhav Gogte

Emerging persistent memory (PM) technologies promise the performance of DRAM with the durability of disk. However, several challenges remain in existing hardware, programming, and software systems that inhibit wide-scale PM adoption. This thesis focuses on building efficient mechanisms that span hardware and operating systems, and programming languages for integrating PMs in future systems. First, this thesis proposes a mechanism to solve low-endurance problem in PMs. PMs suffer from limited write endurance---PM cells can be written only 10^7-10^9 times before they wear out. Without any wear management, PM lifetime might be as low as 1.1 months. This thesis presents Kevlar, an OS-based wear-management technique for PM, that requires no new hardware. Kevlar uses existing virtual memory mechanisms to remap pages, enabling it to perform both wear leveling---shuffling pages in PM to even wear; and wear reduction---transparently migrating heavily written pages to DRAM. Crucially, Kevlar avoids the need for hardware support to track wear at fine grain. It relies on a novel wear-estimation technique that builds upon Intel's Precise Event Based Sampling to approximately track processor cache contents via a software-maintained Bloom filter and estimate write-back rates at fine grain. Second, this thesis proposes a persistency model for high-level languages to enable integration of PMs in to future programming systems. Prior works extend language memory models with a persistency model prescribing semantics for updates to PM. These approaches require high-overhead mechanisms, are restricted to certain synchronization constructs, provide incomplete semantics, and/or may recover to state that cannot arise in fault-free program execution. This thesis argues for persistency semantics that guarantee failure atomicity of synchronization-free regions (SFRs) --- program regions delimited by synchronization operations. The proposed approach provides clear semantics for the PM state that recovery code may observe and extends C++11's "sequential consistency for data-race-free" guarantee to post-failure recovery code. To this end, this thesis investigates two designs for failure-atomic SFRs that vary in performance and the degree to which commit of persistent state may lag execution. Finally, this thesis proposes StrandWeaver, a hardware persistency model that minimally constrains ordering on PM operations. Several language-level persistency models have emerged recently to aid programming recoverable data structures in PM. The language-level persistency models are built upon hardware primitives that impose stricter ordering constraints on PM operations than the persistency models require. StrandWeaver manages PM order within a strand, a logically independent sequence of PM operations within a thread. PM operations that lie on separate strands are unordered and may drain concurrently to PM. StrandWeaver implements primitives under strand persistency to allow programmers to improve concurrency and relax ordering constraints on updates as they drain to PM. Furthermore, StrandWeaver proposes mechanisms that map persistency semantics in high-level language persistency models to the primitives implemented by StrandWeaver.

An Optimized Multi-Character Pattern Matching Circuit For Network Intrusion Detection Systems

Article

Full-text available

Aug 2018

There are various kinds of Network attacks often identifiable by the patterns of data they contain. More complex regular expressions that express these patterns need to be matched at a very high speed. Most hardware-based approaches build the equivalent automata using minimal hardware resources to detect pattern variations. This paper explains the design, structure, and suitability of an optimized hardware-based automata implementation called Equivalence Class Direct Table Synthesis Nondeterministic Finite Automata (ECD R TS-NFA). The optimized approach described in this paper builds upon the earlier published version called Equivalence Class Descriptor Nondeterministic Finite Automata (ECD-NFA). The ECD R TS-NFA also uses an Equivalence Classification (EC) technique. However, the ECD R TS-NFA approach utilizes a newer form of table compression for its compressed ECs, called Equivalence Class Descriptors (ECDs). The ECDs then used to match against multi-character strings rather than the initial single character approach implemented in the ECD-NFA design. The optimized technique implemented in the ECD R TS-NFA further improves the matching speed of the design, while at the same time significantly reducing the overall resources required. This is achieved by taking full advantage of the Field Programmable Gate Array (FPGA) technology used for the hardware implementation. The design further provides higher throughput and support for quick updates, and clocks at 385.78 MHz, with a maximum throughput value of 12.34 Gigabits per second (Gbps), depicting a 3.35% improvement over the next best rival design in this paper.

Regular Expression Matching with Pipelined Delayed Input DFAs for High-speed Networks

Conference Paper

Jul 2018

Regular expression matching (RE matching) is a widely used operation in network security monitoring applications. With the speed of network links increasing to 100 Gbps and 400 Gbps, it is necessary to speed up packet processing and provide RE matching at such high speeds. Although many RE matching algorithms and architectures have been designed, none of them supports 100 Gbps throughput together with fast updates of an RE set. Therefore, this paper focuses on the design of a new hardware architecture that addresses both these requirements. The proposed architecture uses multiple highly memory-efficient Delayed Input DFAs (D²FAs), which are organized to a processing pipeline. As all D²FAs in the pipeline have only local communication, the proposed architecture is able to operate at high frequency even for a large number of parallel engines, which allows scaling throughput to hundreds of gigabits per second. The paper also analyses how to scale the number of engines and the capacity of buffers to achieve desired throughput. Using the parameters obtained while matching a sample RE set represented by a D²FA in a real network traffic, the architecture can be tuned for wire-speed throughput of 400 Gbps.

The parse tree for regular expression "b * c (a|b) * [ac]#".

Similar publications

Citations