Figure 18 - uploaded by Lambert Schaelicke
Content may be subject to copyright.
Physical Memory Layout for 2 Processes

Physical Memory Layout for 2 Processes

Source publication
Article
Full-text available
ML-RSIM is an execution-driven computer system simulator that combines detailed models of modern computer hardware, including the I/O subsystem, with a fully-functional operating system kernel. These features make the simulation environment particular attractive for studies involving applications with significant I/O or operating system activity.

Similar publications

Conference Paper
Full-text available
Given a list of filtering rules with individual hitting prob- abilities, it is known that the average processing time of a linear-search based firewall can be minimized by searching rules in some appropriate order. This paper proposes a new yet simple technique called the linear-tree structure. It uti- lizes an advanced feature of modern firewalls,...
Conference Paper
Full-text available
Multi-constrained Vehicle Routing Problems are gaining steadily in importance. Especially, the dynamic version of the problem has become more emphasis due to modern service requirements, such as short-term or express delivery. With a growing number of dedicated solution approaches for these problems, we investigate a simulation-based supervised lea...
Article
Full-text available
A simulation based project is designed which can be practically implemented in a workspace (in this case, a barber shop). The design algorithm provides the user different time varying features such as number of people arrival, number of people being served, number of people waiting at the queue etc depending upon input criteria and system capabilit...
Article
Full-text available
The specific features of detecting signals and estimating their parameters in the far-field zone of an antenna in the presence of intense interferences in the near-field zone are considered using modern adaptive algorithms. An imitative simulation shows the possibilities of the adaptive methods for localizing sources in the far- and near-field ante...

Citations

... I/O devices needed to boot an OS are also simulated. Gem5 [8], ML-RSim [23], MARSSx86 [24] and PTLSim [10] are some examples of full system timing simulators. ...
... I/O devices needed to boot an OS are also simulated. gem5 [8], ML-RSim [23], MARSSx86 [24] and PTLSim [10] are some examples of full system timing simulators. ...
... As a result, full-system simulation is a necessity. This research requires additional capabilities beyond previously existing full-system simulators [26, 27, 28], such as a detailed and accurate model of the I/O subsystem, including the network interface controller (NIC), and the ability to simulate multiple networked systems in a controlled and timing-accurate fashion. A key goal of M5 is modularity. ...
Article
Many important workloads today, such as web-hosted services, are limited not by processor core performance but by interactions among the cores, the memory system, I/O devices, and the complex software layers that tie these components together. Architects who optimize system designs for these workloads are challenged to identify performance bottlenecks before the systems are built. This identification is challenging because, as in any concurrent system, overheads in one component may be hidden due to overlapping with other operations. These overlaps span the user/kernel and software/hardware boundaries, making traditional tools inadequate. Common software profiling techniques cannot account for hardware bottlenecks or situations in which software overheads are hidden due to overlapping with hardware operations. This thesis presents a methodology for identifying true end-to-end critical paths in systems composed of multiple layers of hardware and software, particularly in the domain of high-speed networking. The state machines that implicitly or explicitly govern the behavior of all the layers are modeled and their local interactions captured to build an end-to-end dependence graph that can be used to locate bottlenecks. This is done incrementally, with modest effort and only local understanding. Furthermore, it is shown that queue-based interactions are necessary and sufficient to capture information from complex protocols, multiple connections and multiple processors. The resulting dependence graph is created and analyzed distilling the huge amount of collected data into a set bottleneck locations including where the most un-overlapped time is spent, and locations where the addition of some buffering could improve the systems performance without any other optimizations. Additionally, this technique provides accurate quantitative predictions of the benefit of eliminating bottlenecks. The end result of this analysis, minutes after the data is gathered, is: 1) the identity of the component that causes the bottleneck; 2) the extent to which a component must be improved before it is no longer the bottleneck; 3) the next bottleneck that will be exposed in the system; and 4) the performance improvement that will occur before the next bottleneck is reached. The analysis can be repeated for successive bottlenecks and is far faster than the available alternatives.
... Some simulators have been developed to enable specific functionality. For example, SimOS [8] and ML-RSIM [20] support the execution of an OS. Simulators such as simg4 [16], were developed to model a specific processor (the PowerPC 7400) in detail. ...
Conference Paper
Full-text available
Supercomputers are increasingly complex systems merg- ing conventional microprocessors with system on a chip level designs that provide the network interface and router. At Sandia National Labs, we are developing a simulator to explore the complex interactions that occur at the sys- tem level. This paper presents an overview of the simula- tion framework with a focus on the enhancements needed to transform traditional simulation tools into a simulator ca- pable of modeling system level hardware interactions and running native software. Initial validation results demon- strate simulated performance that matches the Cray Red Storm system installed at Sandia. In addition, we include a ìwhat ifî study of performance implications on the Red Storm network interface.
... A prototype of the user-level I/O architecture described here has been implemented in the ML-RSIM simulation environment [18]. Based on RSIM [19], this simulation system combines detailed models of a dynamically scheduled CPU with caches, a memory controller and several I/O devices with a fully-functional Unix-like operating system. ...
Article
Full-text available
To address the growing I/O bottleneck, next-generation distributed I/O architectures employ scalable point-to-point interconnects and minimize operating system overhead by providing user-level access to the I/O subsystem. Reduced I/O overhead allows I/O intensive applications to efficiently employ latency hiding techniques for improved throughput. This paper presents the design of a novel scalable user-level I/O architecture and evaluates the impact of various architectural mechanisms in terms of overall performance improvement. Results demonstrate that eliminating data movement across protection domains is the dominant contributor to improved scalability. Eliminating system call and interrupt overhead only has a small additional benefit that may not justify the additional hardware support required. While this evaluation is based on one specific design, the conclusions can be generalized to other user-level I/O architectures.
... Conventional architectural simulators, which execute only user-level application code and functionally emulate kernel behavior, do not provide meaningful results for these workloads. Of the handful of existing full-system sim- ulators [16,22,24], none provide detailed network I/O modeling nor are easily extendable to do so. As a result, we developed our own simulator, called M5, to meet our specific needs [2]. ...
Conference Paper
Current high-performance computer systems are unable to saturate the latest available high-bandwidth networks such as 10 Gigabit Ethernet. A key obstacle in achieving 10 gigabits per second is the high overhead of communication between the CPU and network interface controller (NIC), which typically resides on a standard I/O bus with high access latency. Using several network-intensive benchmarks, we investigate the impact of this overhead by analyzing the performance of hypothetical systems in which the NIC is more closely coupled to the CPU, including integration on the CPU die. We find that systems with high-latency NICs spend a significant amount of time in the device driver. NIC integration can substantially reduce this overhead, providing significant throughput benefits when other CPU processing is not a bottleneck. NIC integration also enables cache placement of DMA data. This feature has tremendous benefits when pay-loads are touched quickly, but potentially can harm performance in other situations due to cache pollution.
... Thus conventional architectural simulators , which execute only user-level application code with functional emulation of kernel interactions, would not provide meaningful results for networking work- loads. While a few full-system simulators exist [14, 15, 21, 22] , none provided the detailed network I/O modeling we required, and none seemed easy to modify to provide this feature. We thus decided to extend our existing application-only architectural simulator, which executes the Alpha ISA, to support full-system simulation . ...
Article
Full-text available
High-bandwidth TCP/IP networking is a core component of current and future computer systems. Though networking is central to computing today, the vast majority of end-host networking research focuses on the current paradigm of the network interface being merely a peripheral device. Most optimizations focus solely on software changes or on moving some of the computation from the primary CPU to the off-chip network interface controller (NIC). We present an alternative approach for achieving high performance networking. Rather than increasing the complexity of the NIC, we directly integrate a conventional NIC on the CPU die. To evaluate this approach, we have developed a simulation environment specifically targeted for networked systems. It simulates server and client systems along with a network in a single process. Full-system simulation captures the execution of both application and OS code. Our model includes a detailed out-of-order CPU, event-driven memory hierarchy, and Ethernet interface device. Using this simulator, we find that tighter integration of the network interface can provide benefits in TCP/IP throughput and latency. We also see that the interaction of the NIC with the on-chip memory hierarchy has a greater impact on performance than the raw improvements in bandwidth and latency that come from integration.
... ML-RSIM [21] is an event-driven cycle-accurate simulator that integrates detailed processor and cache models with a complete I/O subsystem. Combined with the Unix-compatible Lamix operating system, ML-RSIM provides a unique tool that allows researchers to study the interaction of computer architecture, I/O activity, system software and applications. ...
Conference Paper
Full-text available
The constant increase in levels of integration and the reduction of the time-to-market have led to the definition of new methodologies stressing reuse. This involves not only the reuse of pre-designed processing components in the form of intellectual properties (IPs) but also that of pre-designed architectures. For such architectures to be reused for various applications they have to be heavily parameterized. Several manufacturers, in fact, produce pre-packed solutions for various classes of applications, in the form of parameterized system-on- a-chip (SOC) platforms. In this paper we present EPIC-Explorer, a framework to simulate a parameterized VLIW-based platform that will allow an embedded system designer to evaluate any instance of the platform in terms of performance, area and power consumption. The results obtained show that the framework can be effectively used to explore the space of possible configurations to evaluate the area/performance/power trade-off. The increase in levels of integration forecast for the coming decade indicate an enormous increase in the number of tran- sistors as compared with the previous decade and the imple- mentation of a whole system on a single chip. Unfortunately,
... Schaelicke and Parker ML-RSIM (Schaelicke and Parker 2002) is a derivative of URSIM (Zhang 2001) which is based on the original RSIM. It models the entire Input/Output subsystem and includes a functional operative system kernel called Lamix which is System V compatible . ...
Article
Full-text available
In this paper we present RSIM x86, a port of the widely used RSIM performance simulator for cc-NUMA multi-processors to GNU/Linux and x86 hardware. Then, we evaluate the simulation throughput obtained by RSIM in several platforms with respect to the hardware cost of each platform. We show that this port of RSIM obtains much better execution times using cheaper and more eas-ily available hardware than the original RSIM, allowing a more efficient usage of our research resources.
... As a result, fullsystem simulation is a necessity. However, previously existing full-system simulators [5], [6], [7] lack other necessary capabilities, such as a detailed and accurate model of the I/O subsystem and network interface controller (NIC) and the ability to simulate multiple networked systems in a controlled and timing-accurate fashion. The difficulty of adapting an existing full-system simulator to meet our needs seemed larger than that of developing our own system, so we embarked on the development of M5. ...
Article
Full-text available
Performance accuracy is a critical but often ne-glected aspect of architectural performance simulators. One approach to evaluating performance accuracy is to attempt to reproduce observed performance results from a real machine. In this paper, we attempt to model the performance of a Compaq Alpha XP1000 workstation using the M5 full-system simulator. There are two novel aspects to this work. First, we simulate complex TCP/IP networking workloads and use network bandwidth as our primary performance metric. Unlike conventional CPU-intensive applications, these workloads spend most of their time in the operating system kernel and include significant interactions with platform hardware such as the interrupt controller and network interface device. Second, we attempt to achieve performance accuracy without extremely precise modeling of the reference hardware. Instead, we use simple generic component models and tune them to achieve appropriate bandwidths and latencies. Overall, we were able to achieve reasonable accuracy even with our relatively imprecise model, matching the bandwidth of the real system within 15% in most cases. We also used profiling to break CPU time down into categories, and found that the simulation results correlated well with the real machine.