A high level architecture diagram of the hardware scheduler along with the custom instruction interface.

Source publication

Improving System Predictability and Performance via Hardware Accelerated Data Structures

Article

Full-text available

Dec 2012

In Dynamic Data-Driven Application Systems, applications must dynamically adapt their behavior in response to objectives and conditions that change while deployed. One approach to achieve dynamic adaptation is to offer middleware that facilitates component migration between modalities in response to such dynamic changes. The triggering, planning, a...

Context 1

... check is accomplished using a single custom instruction that returns a preempt flag set by the hardware scheduler, based on which the processor can then choose to continue the execution of the current task or to run another. A high level block diagram of the hardware scheduler is shown in Figure 2. ...

View in full-text

Context 2

... ff ects processor utilization. To ensure that tasks meet their deadlines, the scheduler’s worst-case execution times are often overestimated. This can cause a system to be underutilized and wastes CPU resources. In this paper, we examine how the scheduler overhead and its variation can be reduced by migrating scheduling functionality (along with time-tick processing) to hardware logic. The expected results of our e ff orts are increased CPU utilization and better system predictability. Another benefit is that the hardware clock provides accurate high-resolution timing. The rest of the paper is organized as follows. Section 2 presents related work on hardware schedulers. Section 3 describes the scalable hardware scheduler architecture and implementation details. The evaluation methodology and results are discussed in Sections 4 and 5. Section 6 describes a software approach to an adaptive ordered-set data structure. Conclusions and future work are presented in Section 7. Many architectures [3], [4], [5], [6], [7], [8] have been proposed to improve the performance of schedulers using hardware accelerators. A real time kernel called FASTHARD has been implemented in hardware [3]. The scheduler of FASTHARD can handle 256 tasks and 8 priority levels. The Spring scheduling coprocessor [4] was built to accelerate scheduling algorithms used in the Spring kernel [9], which was used to perform feasibility analysis of the schedule. Mooney et al. [5] implemented a configurable hardware scheduler that provided support for three scheduling disciplines, configurable during runtime. A slack stealing scheduling algorithm was implemented in hardware [6] to support scheduling of tasks (periodic and aperiodic) and to reduce scheduling overhead. A hardware scheduler for multiprocessor system on chip is presented in [7], which implements the Pfair scheduling algorithm. A real time task manager (RTM) that implements scheduling, time management, and event management in hardware is presented in [8]. That RTM supports static priority-based scheduling and is implemented as an on-chip peripheral that communicates with the processor though memory mapped interface. Most of the schedulers listed above implement some kind of priority based scheduling algorithm that requires a priority queue to sort the tasks based on their priority. Many hardware priority queue architectures have been implemented in the past, most of them in the realm of real-time networks for packet scheduling [10, 11, 12]. Moon et al. [10] compared four scalable priority queue architectures: fifo, binary tree, shift registers and systolic array based. The shift-register architecture su ff ers from bus loading, as new tasks must be broadcasted to all the queue cells. The systolic array architecture overcomes the problem of bus loading at the cost of doubling hardware storage requirements. The hardware complexity for both the shift register and systolic array architecture increases linearly with the number of elements, as each cell requires a separate comparator. This makes these architecture expensive to scale in terms of hardware resources. Bhagwan and Lin [11] proposed a new pipelined priority queue architecture based on p-heap (a new data structure similar to binary heap). A pipelined heap manager was proposed in [12] to pipeline conventional heap data structure operations. Both of these pipelined implementations of a priority queue are scalable and are designed to achieve high throughput, but at the expense of increased hardware complexity. In this paper we present a scalable hardware priority queue architecture that implements a conventional binary heap in hardware. The priority queue is used as a part of the scheduler to improve system performance and predictability. The hardware priority queue supports constant time enqueue operations and dequeue operations in O (log n ) time. The hardware utilization of the our priority queue increases logarithmically with the queue size and avoids complex pipelining logic. The hardware scheduler architecture we propose is designed to reduce time-tick processing and scheduling overhead of the system. The design also uses concurrency in hardware to make the operations on a priority queue more predictable. The instruction set architecture of the processor is correspondingly extended to support a set of custom instructions to communicate with the scheduler. The hardware scheduler executes the scheduling algorithm and returns the control to the processor along with the next task to execute, and context switching is then done in software. A software timer periodically generates interrupts to check for the availability of a higher priority task. The check is accomplished using a single custom instruction that returns a preempt flag set by the hardware scheduler, based on which the processor can then choose to continue the execution of the current task or to run another. A high level block diagram of the hardware scheduler is shown in Figure 2. The controller is the central processing unit of the scheduler. It is responsible for the execution of the scheduling algorithm. The controller processes instruction calls from the processor and monitors task queues. The timer unit keeps track of time elapsed since the start of the scheduler. This provides accurate high-resolution timing for the scheduler. The resolution of the timer-tick can be configured at runtime. The interface to the scheduler is provided through a set of custom instructions as an extension to the instruction set architecture of the processor. This removes bus dependencies for data transfer. Basic scheduler operations such as run, configure, add task, and preempt task are supported. The ready queue stores active tasks based on their priority. The sleep queue stores sleeping tasks until their activation time. The task with the earliest activation time is at the front of the sleep queue. At the core of the scheduler are the task queues, which are implemented as priority queues that keep the tasks in sorted order based on their priority (ready queue) or activation time (sleep queue). One of the common software data structures for implementing a priority queue is the binary heap, which supports enqueue and dequeue operations in O (log n ) time. The binary heap is stored as a linear array where the first element corresponds to the root. Given an index i of an element, i / 2, 2 i and 2 i + 1 are the indices of its parent, left and right child respectively. Here we present a hardware implementation of the conventional binary heap that supports enqueue and peek operations in O (1) time and dequeue operations in O (log n ) time. Although the dequeue operation takes O (log n ) time to complete, the top-priority task can be returned immediately, so that a dequeue operation overlaps its work with that of the rest of the scheduler and the application. A high level architecture diagram for the priority queue is shown in Figure 3. Central to the priority queue is the queue manager, which provides the necessary interface and executes operations on the queue. Elements in each level of the heap are stored in separate on-chip memories called Block Rams (BRAMs) to enable parallel access to heap elements, similar to [11, 12]. The address decoder generates addresses and control signals for the BRAM blocks. Queue operations are explained in detail below. Enqueue operations in a binary heap are accomplished by inserting the new element at the bottom of the heap and performing compare-swap operation with successive parents until the priority of the new element is less than its parent. The worst-case behavior occurs when the priority of the new element is greater than the rest of the nodes present in the heap. In this case, the new element bubbles-up all the way to the root of the heap. We first calculate the path from a leaf node to the root. The leaf node is always one more than the current size of the queue. This path includes all ancestors from the leaf node to the heap’s root. The heap property ensures that the elements in this path are in sorted order. The shift register mechanism, shown in Figure 3, inserts a new element in constant time. This is similar to the shift-register priority queue described in [10]. Each level of the heap is mapped to an enqueue cell, which consists of a comparator, multiplexor and a register. The element to be inserted is broadcast to all the cells during an enqueue operation. The enqueue operation is then completed in the three steps shown in Figure 4. In the first step, all the elements in the path from the leaf node to root node are loaded into the corresponding enqueue cells. The address for each BRAM block is generated by the address decoder. In the second step, the comparator in each enqueue cell compares the priority of the new element with the element stored locally and decides whether to latch the current element, new element or the element above it. In the final step, the elements along with the new entry are stored back into the heap. The dequeue operation can be divided into two parts: removing the root element from the queue (as the value to be returned by the dequeue call), and reconstruction of the heap. The root element is removed by replacing it with the last element of the queue to keep the heap balanced. The new root element is then compared with smallest of its children and swapped if the priority of new node is less than that of a child. This operation is repeated until the priority of the new root element is more than that of its children. An example of a dequeue operation is shown in Figure 5 Note that the highest priority value is obtained in constant time and as the priority queue is managed in hardware the processor is not required to wait for the operation to complete. The worst case execution time of a dequeue operation is O (log n ), which would a ff ect the rate at which consecutive operations can be performed on the queue. However, ...

View in full-text

Figure 1. Schedule of J 1 1 of τ 1 (T 1 = 102, C 1 = 24, D 1 = 102) and...

Figure 2. Schedule of J 1 1 of τ 1 (T 1 = 202, C 1 = 22, D 1 = 202), J...

Figure 4. Two upper bounds for I k (t, t + ). (a) The maximum length of...

Figure 5. The maximum cumulative length where the execution of a job J...

Figure 7. The ratio of the number of task sets schedulable by LCEDF...

Limited Non-Preemptive EDF Scheduling for a Real-Time System with Symmetry Multiprocessors

Article

Full-text available

Jan 2020

In a real-time system, a series of jobs invoked by each task should finish its execution before its deadline, and EDF (Earliest Deadline First) is one of the most popular scheduling algorithms to meet such timing constraints of a set of given tasks. However, EDF is known to be ineffective in meeting timing constraints for non-preemptive tasks (whic...

Figure 1. Main steps of the proposed approach.

Figure 3. Scheduling semantics in ROS-2 (source: ros.org).

Figure 4. Example of assigning callbacks to the executor e 1 . π 1 =...

Figure 5. Example of the bucket select algorithm (BSA). Let [K] be a...

A New Algorithm for Real-Time Scheduling and Resource Mapping for Robot Operating Systems (ROS)

Article

Full-text available

Jan 2023

Khaled Chaaban

A new version of a robot operating system (ROS-2) has been developed to address the real-time and fault constraints of distributed robotics applications. However, current implementations lack strong real-time scheduling and the optimization of response time for various tasks and applications. This may lead to inconsistent system behavior and may af...

An Improved Dynamic Non –Preemptive Real Time Group Scheduling Algorithm for Soft Real-Time Systems Using Particle Swarm Optimization

Article

Full-text available

Sep 2013

Real-time scheduling plays an important role in design of real time systems. Every real time system task is characterized by execution time, dead line and slack time. In order to meet timing constraints in real time systems, real time scheduling is very vital. Earliest dead line first (EDF) and Least Laxity First (LLF) are dynamic real time schedul...

Performance comparison of partitioned and global approaches for weakly hard real-time systems

Article

Full-text available

Jul 2020

One way to minimizing resource requirements is through the careful management and allocation, in example, scheduling. Research on weakly hard real-time scheduling on multiprocessor has been extremely limited; most prior research on real-time scheduling on weakly hard real-time has been confined to uniprocessors. The need for multiprocessor is due t...

EOE-DRTSA End-to-End Distributed Real-time System Scheduling Algorithm

Article

Full-text available

Feb 2019

In this paper, scheduling dependent threads in distributed real-time system where considered. We present a distributed real-time scheduling algorithm called (EOE-DRTSA (end-to-end distributed real time system Scheduling algorithm)). Now a day completed real-time systems are distributed. One of least developed areas of real-time scheduling is distri...

Operating Systems for Reconfigurable Computing: Concepts and Survey

Chapter

Jan 2021

Operating systems for reconfigurable computing (RCOS) facilitate the usage of Field Programmable Gate Arrays (FPGAs). RCOSes abstract from hardware details, utilise virtualisation, and provide standardised functionality. They allow different applications to run hardware tasks concurrently on the same FPGA by managing shared resources like FPGA area, I/O, and memory. Next to spatial partitioning, time multiplexed sharing of the FPGA can be reached via Dynamic Partial Reconfiguration (DPR). In this way, operating systems for reconfigurable computing support user applications to increase their performance and decrease energy consumption without the need to know the underlying concepts. Therefore, RCOSes pave the way for applications to exploit the advantages of FPGAs under consideration of their limitations like limited area and limited accessibility of configuration ports. Furthermore, RCOS can benefit from outsourcing parts of the OS into the FPGA. This survey outlines key concepts and gives an overview over state-of-the-art operating systems for reconfigurable computing. It points out general and specific limitations of RCOS. Finally, future trends are identified, which include a specialisation of RCOS with respect to their application’s requirements like real-time processing, low energy consumption, reliability, safety, and security.

Toward a Unique Representation for AVL and Red-Black Trees

Article

Full-text available

Jun 2019

We propose a unique representation of both AVL and Red-Black trees with the same time and space complexity. We describe all the maintenance operations and also the insertion and deletion algorithms. We give the implementation of the proposed tree and the results. We then make a comparison of the three structures. The simulation results confirm the performance of the new representation relatively to AVL and Red-Black trees.

A fast systolic priority queue architecture for a flow-based Traffic Manager

Conference Paper

Full-text available

Jun 2016

This paper presents a fast systolic priority queue architecture usable in a traffic manager. The purpose of the traffic manager is to schedule the departure of packets on egress ports in a network processing unit. In the context of this work, this scheduling should ensure that packets are sent in such a way to meet the allowed bandwidth quotas for each packet flow. Also, an important goal is to reduce latency to a minimum in order to best support the upcoming 5G wireless standards. The proposed hardware architecture of the systolic priority queue enables pipelined en/dequeue operations at constant time rate. Detailed description of this processing module is provided, together with the associated algorithm, and the architecture of the traffic manager. The implemented architecture is based on the C coding language and is synthesized with the Vivado High Level Synthesis tool. The obtained results are compared across a range of priority queue depths and performance metrics with existing approaches. A throughput improvement of 44% is claimed over best previously reported results. The proposed design of the traffic manager works at 118 MHz when implemented on a Kintex-7 FPGA from Xilinx.

AVL and Red Black tree as a single balanced tree

Conference Paper

Full-text available

Mar 2016

AVL and Red-black trees as a single balanced tree

Conference Paper

Full-text available

Mar 2016

Since the invention of AVL trees in 1962 and Red-black trees in 1978, researchers were divided in two separated communities, AVL supporters and Red-Black ones. Indeed, AVL trees are commonly used for retrieval applications whereas Red Black trees are used in updates operations, so, the choice of a structure must be done firstly even if the operations are not known to be searches or updates. That is the main reason why we propose a common tree with the same complexity and memory space, representing both an AVL and a Red-Black tree, this new tree allows to gather together the two communities on one hand, and to expand the scope of AVL and Red-Black tree applications on the other hand.

Scheduling Challenges in Mixed Critical Real-Time Heterogeneous Computing Platforms

Article

Full-text available

Dec 2013

In Dynamic Data-Driven Application Systems (DDDAS), applications must dynamically adapt their behavior in response to objectives and conditions that change while deployed. Often these applications may be safety critical or tightly resource constrained, with a need for graceful degradation when introduced to unexpected conditions. This paper begins by motivating and providing a vision for a dynamically adaptable mixed critical computing platform to support DDDAS applications. We then specifically focus on the need for advancements in task models and scheduling algorithms to manage the resources of such a platform. We discuss the short comings of existing task models for capturing important attributes of our envisioned computing platform, and identify challenges that must be addressed when developing scheduling algorithms that act upon our proposed extended task model.

Shared hardware data structures for hard real-time systems

Conference Paper

Oct 2012

Hardware support can reduce the time spent operating on data structures by exploiting circuit-level parallelism. Such hardware data structures (HWDSs) can reduce the latency and jitter of data structure operations, which can benefit real-time systems by reducing worst-case execution times (WCETs). For example, a hardware priority queue (HWPQ) can enqueue and dequeue prioritized items in constant time with low variance; the best software implementations are in logarithmic-time asymptotic complexity for at least one of the enqueue or dequeue operations. The main problems with HWDSs are the limited size of hardware and the complexity of sharing it. In this paper we show that software support can help circumvent the size and sharing limitations of hardware so that applications can benefit from a HWDS. We evaluate our work by showing how the choice of software or hardware affects schedulability of task sets that use multiple priority queues of varying sizes. We model task behavior on two applications that are important in real-time and embedded domains: the grey-weighted distance transform for topology mapping and Dijkstra's algorithm for GPS navigation. Our results indicate that HWDSs can reduce the WCET of applications even when a HWDS is shared by multiple data structures or when data structure sizes exceed HWDS size constraints.

LION: real-time I/O transfer control for massively parallel processor arrays

Conference Paper

Nov 2021

A Fast, Single-Instruction-Multiple-Data, Scalable Priority Queue

Article

Jun 2018
IEEE T VLSI SYST

In this paper, we address a key challenge in designing flow-based traffic managers (TMs) for next-generation networks. One key functionality of a TM is to schedule the departure of packets on egress ports. This scheduling ensures that packets are sent in a way that meets the allowed bandwidth quotas for each flow. A TM handles policing, shaping, scheduling, and queuing. The latter is a core function in traffic management and is a bottleneck in the context of high-speed network devices. Aiming at high throughput and low latency, we propose a single-instruction-multiple-data (SIMD) hardware priority queue (PQ) to sort out packets in real time, supporting independently the three basic operations of enqueuing, dequeuing, and replacing in a single clock cycle. A proof of validity of the proposed hardware PQ data structure is presented. The implemented PQ architecture is coded in C++. Vivado high-level synthesis is used to generate synthesizable register transfer logic from the C++ model. This implementation on a ZC706 field-programmable gate array (FPGA) shows the scalability of the proposed solution for various queue depths with almost constant performance. It offers a $10\times $ throughput improvement when compared to prior works, and it supports links operating at 100 Gb/s.

High-level synthesis for reduction of WCET in real-time systems

Conference Paper

Oct 2017

A high level architecture diagram of the hardware scheduler along with the custom instruction interface.

Contexts in source publication

Similar publications

Citations