Figure 2 - uploaded by József Kovács
Content may be subject to copyright.
Process recovering with AltixC/R package 

Process recovering with AltixC/R package 

Source publication
Article
Full-text available
This is the second document in a series of reports describing the work that aims to integrate the SZTAKI's high-level checkpointing tool Total Checkpoint (TCKPT) with one of PSNC's low-level checkpointers. The low-level checkpointer considered in this report is AltixC/R and the high-level checkpointer remains the TCKPT. As the functionality of the...

Context in source publication

Context 1
... AltixC/R is the kernel-level checkpointing package designed for Altix [ 18] systems equipped with IA64 [17] processors and running under the ProPack [16] environment (a Linux-based environment prepared by SGI). Versions of the package that works with ProPack based on Linux kernel 2.4 as well as with a more recent ProPack that is based on Linux kernel 2.6 [19] [20] are currently available. The package is characterized by all features typical for kernel-level approach. It is easy to use, there is no assumption on the availability of source codes or the programming language that was used to write the programs that are to be checkpointed. The package has to be deployed by the system administrator. The package allows doing checkpoints of multi-process programs that communicate through System V IPC objects. Additionally, the idea of virtualization of some system global keys and identifiers has been employed in that product. Thanks to that, when the program is recovered, it is cheated that the identifiers have not changed [8] (even though, due to technological reasons, it is very likely that they have). The package has been developed as part of the SGIGrid project [9]. The workflow of checkpointing and recovery activity is essential from the point of view of integration. Therefore the way the checkpoint is taken and the process is recovered by AltixC/R is presented below. The AltixC/R allows taking checkpoints of any processes that adhere to imposed limitations, i.e. the processes that use only the resources that are supported by AltixC/R. It is not important in what language the process that is being checkpointed has been written. The checkpointed process is totally independent of the checkpointing tools. It means that the process does not have any hooks or other common elements with the AltixC/R package. Actually the being checkpointed process is completely unaware of the existence of the checkpointing tool. The tools, resources and sequence of the steps involved in the process of taking the checkpoint are presented in figure 1. When the user decides to take checkpoint, he/she executes the ”chkpnt < PID > ” command (where < PID > specifies the process being checkpointed). With help of the system calls, the procfs filesystem and special proprietary kernel modules ( syscover and ckpt modules) the chkpnt command saves the image of the process defined by PID into the file. What is important, the whole operation is transparent to the process being checkpointed and there is no way to force the process to perform some special procedures before and just after the checkpoint is done. It means that the being checkpointed process releases and reallocates non-checkpointable resources when necessary. The recovering procedure is also completely transparent to the process being recovered. The resources, tools and the sequence of operations involved in the recovering process are presented in figure 2. In order to recover the process, the user has to execute the ”resume < image > ” command, where < image > is the path to the image of the checkpointed process. To obtain the ”frame” of the process being recovered, the resume command ”forks” itself. Further, using the available system calls, the procfs filesystem and proprietary kernel modules the resume command replaces the memory space and other resources of the just forked process with those saved in the checkpoint image. Finally, the resume command finishes execution and the just recovered process begins to be executed from the point where the checkpoint was taken. The recovered process has no possibility to interact with the recovery command, so registering and invoking of any callbacks is not possible (from TCKPT point of view it is very important ...

Similar publications

Article
Full-text available
In a software ecosystem, a dependency relationship enables a client package to reuse a certain version of a provider package. Packages in a software ecosystem often release versions containing bug fixes, new functionalities, and security enhancements. Hence, updating the provider version is an important maintenance task for client packages. Despite...

Citations

Article
This paper introduces a way to transform the existing parallel checkpointing techniques to be applied for software-heterogeneous ClusterGrid infrastructures. While existing solutions are aiming at providing application transparency by building special middleware, this paper aims at targeting both application and middleware transparency at the same time by inserting checkpoint functionality into the application. The compatibility and integrity requirements are identified and corresponding conditions are established. Some of the available checkpointing systems are checked against the conditions in order to examine their conformity. Based on the conditions, a novel checkpointing method is defined and the TotalCheckpoint tool is adapted for ClusterGrid.
Article
This paper introduces a combination of the existing parallel checkpointing techniques for software heterogeneous ClusterGrid infrastructures. Most of the existing solutions are aiming at supporting application transparency (no checkpoint related code development in application), but some others build middleware transparent (no service modification) solutions. The main contribution of this paper is to introduce a solution providing both application and middleware transparency at the same time. Compatibility and integrity requirements are identified and corresponding conditions are established using Abstract State Machines. The most relevant checkpointing systems are checked against the conditions in order to examine their conformity. Based on the conditions, a novel checkpointing method is defined and a proof of concept checkpointing tool, called TotalCheckpoint (TCKPT) is introduced.