Figure 2 - uploaded by Jonathan A Shidal
Content may be subject to copyright.
A high level architecture diagram of the hardware scheduler along with the custom instruction interface. 

A high level architecture diagram of the hardware scheduler along with the custom instruction interface. 

Source publication
Article
Full-text available
In Dynamic Data-Driven Application Systems, applications must dynamically adapt their behavior in response to objectives and conditions that change while deployed. One approach to achieve dynamic adaptation is to offer middleware that facilitates component migration between modalities in response to such dynamic changes. The triggering, planning, a...

Contexts in source publication

Context 1
... check is accomplished using a single custom instruction that returns a preempt flag set by the hardware scheduler, based on which the processor can then choose to continue the execution of the current task or to run another. A high level block diagram of the hardware scheduler is shown in Figure 2. ...
Context 2
... ff ects processor utilization. To ensure that tasks meet their deadlines, the scheduler’s worst-case execution times are often overestimated. This can cause a system to be underutilized and wastes CPU resources. In this paper, we examine how the scheduler overhead and its variation can be reduced by migrating scheduling functionality (along with time-tick processing) to hardware logic. The expected results of our e ff orts are increased CPU utilization and better system predictability. Another benefit is that the hardware clock provides accurate high-resolution timing. The rest of the paper is organized as follows. Section 2 presents related work on hardware schedulers. Section 3 describes the scalable hardware scheduler architecture and implementation details. The evaluation methodology and results are discussed in Sections 4 and 5. Section 6 describes a software approach to an adaptive ordered-set data structure. Conclusions and future work are presented in Section 7. Many architectures [3], [4], [5], [6], [7], [8] have been proposed to improve the performance of schedulers using hardware accelerators. A real time kernel called FASTHARD has been implemented in hardware [3]. The scheduler of FASTHARD can handle 256 tasks and 8 priority levels. The Spring scheduling coprocessor [4] was built to accelerate scheduling algorithms used in the Spring kernel [9], which was used to perform feasibility analysis of the schedule. Mooney et al. [5] implemented a configurable hardware scheduler that provided support for three scheduling disciplines, configurable during runtime. A slack stealing scheduling algorithm was implemented in hardware [6] to support scheduling of tasks (periodic and aperiodic) and to reduce scheduling overhead. A hardware scheduler for multiprocessor system on chip is presented in [7], which implements the Pfair scheduling algorithm. A real time task manager (RTM) that implements scheduling, time management, and event management in hardware is presented in [8]. That RTM supports static priority-based scheduling and is implemented as an on-chip peripheral that communicates with the processor though memory mapped interface. Most of the schedulers listed above implement some kind of priority based scheduling algorithm that requires a priority queue to sort the tasks based on their priority. Many hardware priority queue architectures have been implemented in the past, most of them in the realm of real-time networks for packet scheduling [10, 11, 12]. Moon et al. [10] compared four scalable priority queue architectures: fifo, binary tree, shift registers and systolic array based. The shift-register architecture su ff ers from bus loading, as new tasks must be broadcasted to all the queue cells. The systolic array architecture overcomes the problem of bus loading at the cost of doubling hardware storage requirements. The hardware complexity for both the shift register and systolic array architecture increases linearly with the number of elements, as each cell requires a separate comparator. This makes these architecture expensive to scale in terms of hardware resources. Bhagwan and Lin [11] proposed a new pipelined priority queue architecture based on p-heap (a new data structure similar to binary heap). A pipelined heap manager was proposed in [12] to pipeline conventional heap data structure operations. Both of these pipelined implementations of a priority queue are scalable and are designed to achieve high throughput, but at the expense of increased hardware complexity. In this paper we present a scalable hardware priority queue architecture that implements a conventional binary heap in hardware. The priority queue is used as a part of the scheduler to improve system performance and predictability. The hardware priority queue supports constant time enqueue operations and dequeue operations in O (log n ) time. The hardware utilization of the our priority queue increases logarithmically with the queue size and avoids complex pipelining logic. The hardware scheduler architecture we propose is designed to reduce time-tick processing and scheduling overhead of the system. The design also uses concurrency in hardware to make the operations on a priority queue more predictable. The instruction set architecture of the processor is correspondingly extended to support a set of custom instructions to communicate with the scheduler. The hardware scheduler executes the scheduling algorithm and returns the control to the processor along with the next task to execute, and context switching is then done in software. A software timer periodically generates interrupts to check for the availability of a higher priority task. The check is accomplished using a single custom instruction that returns a preempt flag set by the hardware scheduler, based on which the processor can then choose to continue the execution of the current task or to run another. A high level block diagram of the hardware scheduler is shown in Figure 2. The controller is the central processing unit of the scheduler. It is responsible for the execution of the scheduling algorithm. The controller processes instruction calls from the processor and monitors task queues. The timer unit keeps track of time elapsed since the start of the scheduler. This provides accurate high-resolution timing for the scheduler. The resolution of the timer-tick can be configured at runtime. The interface to the scheduler is provided through a set of custom instructions as an extension to the instruction set architecture of the processor. This removes bus dependencies for data transfer. Basic scheduler operations such as run, configure, add task, and preempt task are supported. The ready queue stores active tasks based on their priority. The sleep queue stores sleeping tasks until their activation time. The task with the earliest activation time is at the front of the sleep queue. At the core of the scheduler are the task queues, which are implemented as priority queues that keep the tasks in sorted order based on their priority (ready queue) or activation time (sleep queue). One of the common software data structures for implementing a priority queue is the binary heap, which supports enqueue and dequeue operations in O (log n ) time. The binary heap is stored as a linear array where the first element corresponds to the root. Given an index i of an element, i / 2, 2 i and 2 i + 1 are the indices of its parent, left and right child respectively. Here we present a hardware implementation of the conventional binary heap that supports enqueue and peek operations in O (1) time and dequeue operations in O (log n ) time. Although the dequeue operation takes O (log n ) time to complete, the top-priority task can be returned immediately, so that a dequeue operation overlaps its work with that of the rest of the scheduler and the application. A high level architecture diagram for the priority queue is shown in Figure 3. Central to the priority queue is the queue manager, which provides the necessary interface and executes operations on the queue. Elements in each level of the heap are stored in separate on-chip memories called Block Rams (BRAMs) to enable parallel access to heap elements, similar to [11, 12]. The address decoder generates addresses and control signals for the BRAM blocks. Queue operations are explained in detail below. Enqueue operations in a binary heap are accomplished by inserting the new element at the bottom of the heap and performing compare-swap operation with successive parents until the priority of the new element is less than its parent. The worst-case behavior occurs when the priority of the new element is greater than the rest of the nodes present in the heap. In this case, the new element bubbles-up all the way to the root of the heap. We first calculate the path from a leaf node to the root. The leaf node is always one more than the current size of the queue. This path includes all ancestors from the leaf node to the heap’s root. The heap property ensures that the elements in this path are in sorted order. The shift register mechanism, shown in Figure 3, inserts a new element in constant time. This is similar to the shift-register priority queue described in [10]. Each level of the heap is mapped to an enqueue cell, which consists of a comparator, multiplexor and a register. The element to be inserted is broadcast to all the cells during an enqueue operation. The enqueue operation is then completed in the three steps shown in Figure 4. In the first step, all the elements in the path from the leaf node to root node are loaded into the corresponding enqueue cells. The address for each BRAM block is generated by the address decoder. In the second step, the comparator in each enqueue cell compares the priority of the new element with the element stored locally and decides whether to latch the current element, new element or the element above it. In the final step, the elements along with the new entry are stored back into the heap. The dequeue operation can be divided into two parts: removing the root element from the queue (as the value to be returned by the dequeue call), and reconstruction of the heap. The root element is removed by replacing it with the last element of the queue to keep the heap balanced. The new root element is then compared with smallest of its children and swapped if the priority of new node is less than that of a child. This operation is repeated until the priority of the new root element is more than that of its children. An example of a dequeue operation is shown in Figure 5 Note that the highest priority value is obtained in constant time and as the priority queue is managed in hardware the processor is not required to wait for the operation to complete. The worst case execution time of a dequeue operation is O (log n ), which would a ff ect the rate at which consecutive operations can be performed on the queue. However, ...

Similar publications

Article
Full-text available
In a real-time system, a series of jobs invoked by each task should finish its execution before its deadline, and EDF (Earliest Deadline First) is one of the most popular scheduling algorithms to meet such timing constraints of a set of given tasks. However, EDF is known to be ineffective in meeting timing constraints for non-preemptive tasks (whic...
Article
Full-text available
A new version of a robot operating system (ROS-2) has been developed to address the real-time and fault constraints of distributed robotics applications. However, current implementations lack strong real-time scheduling and the optimization of response time for various tasks and applications. This may lead to inconsistent system behavior and may af...
Article
Full-text available
Real-time scheduling plays an important role in design of real time systems. Every real time system task is characterized by execution time, dead line and slack time. In order to meet timing constraints in real time systems, real time scheduling is very vital. Earliest dead line first (EDF) and Least Laxity First (LLF) are dynamic real time schedul...
Article
Full-text available
One way to minimizing resource requirements is through the careful management and allocation, in example, scheduling. Research on weakly hard real-time scheduling on multiprocessor has been extremely limited; most prior research on real-time scheduling on weakly hard real-time has been confined to uniprocessors. The need for multiprocessor is due t...
Article
Full-text available
In this paper, scheduling dependent threads in distributed real-time system where considered. We present a distributed real-time scheduling algorithm called (EOE-DRTSA (end-to-end distributed real time system Scheduling algorithm)). Now a day completed real-time systems are distributed. One of least developed areas of real-time scheduling is distri...

Citations

... A similar methodology is proposed in [31]. The instruction set architecture of the processor is extended to interface it with the hardware scheduler. ...
Chapter
Operating systems for reconfigurable computing (RCOS) facilitate the usage of Field Programmable Gate Arrays (FPGAs). RCOSes abstract from hardware details, utilise virtualisation, and provide standardised functionality. They allow different applications to run hardware tasks concurrently on the same FPGA by managing shared resources like FPGA area, I/O, and memory. Next to spatial partitioning, time multiplexed sharing of the FPGA can be reached via Dynamic Partial Reconfiguration (DPR). In this way, operating systems for reconfigurable computing support user applications to increase their performance and decrease energy consumption without the need to know the underlying concepts. Therefore, RCOSes pave the way for applications to exploit the advantages of FPGAs under consideration of their limitations like limited area and limited accessibility of configuration ports. Furthermore, RCOS can benefit from outsourcing parts of the OS into the FPGA. This survey outlines key concepts and gives an overview over state-of-the-art operating systems for reconfigurable computing. It points out general and specific limitations of RCOS. Finally, future trends are identified, which include a specialisation of RCOS with respect to their application’s requirements like real-time processing, low energy consumption, reliability, safety, and security.
... However it can be a very efficient structure for real time systems. In this context [22] demonstrates the usefulness of using both AVL and Red-Black tree in the priority queue in Dynamic Data-Driven Application Systems: when the system anticipates intensive search operations, the system will convert the tree to AVL, while when the system anticipates intensive updates operations, it convert the tree to Red-Black. This transformation can be done easily since we have the same code. ...
Article
Full-text available
We propose a unique representation of both AVL and Red-Black trees with the same time and space complexity. We describe all the maintenance operations and also the insertion and deletion algorithms. We give the implementation of the proposed tree and the results. We then make a comparison of the three structures. The simulation results confirm the performance of the new representation relatively to AVL and Red-Black trees.
... This author showed that the shift register architecture suffers from heavy bus loading, and that the systolic array overcomes this problem at the cost of doubling the hardware complexity. In [6,7], a hybrid priority queue architecture is proposed, based on a p-heap (which is similar to binary heap). However, the proposed priority queue supports en/dequeue operations in O(log n) time against a fixed time for the systolic array and shift register, where n is the number of keys. ...
Conference Paper
Full-text available
This paper presents a fast systolic priority queue architecture usable in a traffic manager. The purpose of the traffic manager is to schedule the departure of packets on egress ports in a network processing unit. In the context of this work, this scheduling should ensure that packets are sent in such a way to meet the allowed bandwidth quotas for each packet flow. Also, an important goal is to reduce latency to a minimum in order to best support the upcoming 5G wireless standards. The proposed hardware architecture of the systolic priority queue enables pipelined en/dequeue operations at constant time rate. Detailed description of this processing module is provided, together with the associated algorithm, and the architecture of the traffic manager. The implemented architecture is based on the C coding language and is synthesized with the Vivado High Level Synthesis tool. The obtained results are compared across a range of priority queue depths and performance metrics with existing approaches. A throughput improvement of 44% is claimed over best previously reported results. The proposed design of the traffic manager works at 118 MHz when implemented on a Kintex-7 FPGA from Xilinx.
... Our new structure can be applied in many real applications where AVL or Red-Black trees are used since it is equivalent to both of them, it can be a very efficient structure for real time systems, in this context the authors of [28] demonstrates the usefulness of using both AVL and Red-Black trees in the priority queue for Dynamic Data-Driven Application Systems: when the system anticipates intensive search operations the system will convert the tree to AVL, when the system anticipates intensive updates operations it convert the tree to Red-black. ...
... Our new structure can be applied in many real applications where AVL or Red-Black trees are used since it is equivalent to both of them, it can be a very efficient structure for real time systems, in this context the authors of [28] demonstrates the usefulness of using both AVL and Red-Black trees in the priority queue for Dynamic Data-Driven Application Systems: when the system anticipates intensive search operations the system will convert the tree to AVL, when the system anticipates intensive updates operations it convert the tree to Red-black. ...
Conference Paper
Full-text available
Since the invention of AVL trees in 1962 and Red-black trees in 1978, researchers were divided in two separated communities, AVL supporters and Red-Black ones. Indeed, AVL trees are commonly used for retrieval applications whereas Red Black trees are used in updates operations, so, the choice of a structure must be done firstly even if the operations are not known to be searches or updates. That is the main reason why we propose a common tree with the same complexity and memory space, representing both an AVL and a Red-Black tree, this new tree allows to gather together the two communities on one hand, and to expand the scope of AVL and Red-Black tree applications on the other hand.
... hardware accelerators) may have large preemption overhead and this needs to be accounted for accurate schedulability analysis. For example, consider the hardware priority queue described in [16]. A high-level hardware architecture diagram of the priority queue is shown in Figure 4. ...
Article
Full-text available
In Dynamic Data-Driven Application Systems (DDDAS), applications must dynamically adapt their behavior in response to objectives and conditions that change while deployed. Often these applications may be safety critical or tightly resource constrained, with a need for graceful degradation when introduced to unexpected conditions. This paper begins by motivating and providing a vision for a dynamically adaptable mixed critical computing platform to support DDDAS applications. We then specifically focus on the need for advancements in task models and scheduling algorithms to manage the resources of such a platform. We discuss the short comings of existing task models for capturing important attributes of our envisioned computing platform, and identify challenges that must be addressed when developing scheduling algorithms that act upon our proposed extended task model.
... The Spring Scheduling Coprocessor (SSCoP) [6] is one of the first examples of a hardware task scheduler and introduces simple queues for the set of scheduled tasks. Others have implemented hardware scheduling using some form of custom logic and a HWPQ [22,12,11,5,13]. In contrast to the prior work, which focuses on hardware support for a single fixed-size PQ, our work demonstrates how arbitrarily-large PQs can share a HWPQ. ...
Conference Paper
Hardware support can reduce the time spent operating on data structures by exploiting circuit-level parallelism. Such hardware data structures (HWDSs) can reduce the latency and jitter of data structure operations, which can benefit real-time systems by reducing worst-case execution times (WCETs). For example, a hardware priority queue (HWPQ) can enqueue and dequeue prioritized items in constant time with low variance; the best software implementations are in logarithmic-time asymptotic complexity for at least one of the enqueue or dequeue operations. The main problems with HWDSs are the limited size of hardware and the complexity of sharing it. In this paper we show that software support can help circumvent the size and sharing limitations of hardware so that applications can benefit from a HWDS. We evaluate our work by showing how the choice of software or hardware affects schedulability of task sets that use multiple priority queues of varying sizes. We model task behavior on two applications that are important in real-time and embedded domains: the grey-weighted distance transform for topology mapping and Dijkstra's algorithm for GPS navigation. Our results indicate that HWDSs can reduce the WCET of applications even when a HWDS is shared by multiple data structures or when data structure sizes exceed HWDS size constraints.
Article
In this paper, we address a key challenge in designing flow-based traffic managers (TMs) for next-generation networks. One key functionality of a TM is to schedule the departure of packets on egress ports. This scheduling ensures that packets are sent in a way that meets the allowed bandwidth quotas for each flow. A TM handles policing, shaping, scheduling, and queuing. The latter is a core function in traffic management and is a bottleneck in the context of high-speed network devices. Aiming at high throughput and low latency, we propose a single-instruction-multiple-data (SIMD) hardware priority queue (PQ) to sort out packets in real time, supporting independently the three basic operations of enqueuing, dequeuing, and replacing in a single clock cycle. A proof of validity of the proposed hardware PQ data structure is presented. The implemented PQ architecture is coded in C++. Vivado high-level synthesis is used to generate synthesizable register transfer logic from the C++ model. This implementation on a ZC706 field-programmable gate array (FPGA) shows the scalability of the proposed solution for various queue depths with almost constant performance. It offers a $10\times $ throughput improvement when compared to prior works, and it supports links operating at 100 Gb/s.