Figure 4 - uploaded by Mario Dantas
Content may be subject to copyright.
Parallel architecture 

Parallel architecture 

Source publication
Article
Full-text available
Telemedicine is a very important area in medical field that is expanding daily motivated by many researchers interested in improving medical applications. In Brazil was started in 2005, in the State of Santa Catarina has a developed server called the CyclopsDCMServer, which the purpose to embrace the HDF for the manipulation of medical images (DICO...

Context in source publication

Context 1
... supporting random access, number encoding in native format, data compression, individual data set encryption, and storage strategies for parallel I/O and multidimensional data structures; the current version is HDF5. There is two important feature of HDF. First is that files can contain binary data as multi-dimensional arrays and allow direct access to parts of the file without first parsing the entire contents [1]. Second feature is supporting for standard parallel I/O interfaces. Known as The Parallel Hierarchical Data Format 5 (Parallel HDF5), for work with it is required MPI/IO interface, which is supported through MPICH ROMIO [9]. Now days, ROMIO has support to most common file system, but only offer bracing to PVFS as distributed file system. The purpose of Parallel HDF5 is to make it easy for users to use the library, providing compatibility with serial HDF5 file. One approach is to read and write data by hyperslab [1], i.e., a multidimensional array that can be spread by rows, columns, patterns and chunks, and a hyperslab selection could be a logically contiguous collection of points, or it can be a regular pattern of points or blocks, depending on the type used. When works with parallel I/O, the properties of communication that will realize the I/O operation its important to synchronise the nodes. The Parallel HDF5 library has available two types of properties (collective and independent data access) and four hyperslab model (Contiguous Hyperslab, Regularly Spaced Data, Pattern and Chunk) [1]. In HDF5 there are two essential structures which form the base for the library: dataset and group. Dataset is a multi-dimensional array of datatype; HDF stores and organize all kinds of data from atomic to composed types, similar to the C structure construct. Special array operations, such as chunks, compression and extendibility are available through the HDF library and can be applied to a dataset. Other important structure is the dataspace. Through a dataspace is possible to require components of dataset or even an attributes are defined, as well as array ranks, sizes and types. The group is similar to UNIX directories, though cycles are allowed. Every file is started with a root group, represented as /, and could be followed by the name of another group or a dataset as show in figure 2. The works related are suchlike to our work, concerning with the use of parallel I/O as a solution for I/O bottlenecks access for large amounts of stored data. The first related work presented in Nikhil Laghave [10], is very similar to our work. This work is focused on the use of a parallel I/O library for scalability issues involving fermion dynamics for nuclear structure (MFDn). This work used the HDF5 parallel version for parallel I/O, testing with collective and independent models. As result once file sizes increasing above 20 GB, parallel HDF5 becomes more cost-eective than sequential I/O for sufficiently large datasets than sequential binary I/O. The work of A. Adelmann [11] focused on using parallel I/O for particle-based accelerator simulation which involved vast quantities of data and dimensional arrays. He used parallel I/O performance for MPI code as well parallel HD5 by developed API call H5part. H5part is a portable high performance parallel data interface for particle simulation. He compared read and write performance in simulations between H5part, mpi-io and one file per process. HDF5 showed good performance in writing, though mpi-io showed better results. Finally, H. Yu [12] even not using parallel HDF as solution, he demonstrated an interesting work which dealt with large earthquake simulations. He faced scalability issue and I/o bottleneck, due earthquake simulation required large file to storage. To solve his problem, he developed an application using parallel I/O strategies through MPI I/O to address his needs. The results was considerable to remove the I/O bottleneck and also hide pre-processing costs. Our work comes to propose a new architecture for Macedos approach, in order to avoid I/O bottlenecks and get better performance using parallel data access to HDF files stored in the PVFS distributed file system. To accomplish our work, we configured MPI environment to work with Parallel HDF5. As cited in session 2.3, the Parallel HDF5 library requires a parallel MPI/IO interface through ROMIO and when working with MPI, it is necessary to design it to running in a cluster environment. It is important to note the requirement to use the mpirun shell script to run any MPI application, which attempts to hide the differences in starting jobs for various devices from the user [13]. We created an additional procedure to work with the CyclopsDCMServer. This procedure should be called every time when is required to retrieve or store some medical information. For now, we build an application concerning with I/O access, which our application just responsible to direct access a file for reading and writing a dataset. Figure 4 illustrates how the architecture works. It was crucial to modify the H5WL functionality. Instead of having the H5WL responsible for reading and writing the binary information created by PACS, it will be treated as a new parallel application. When some client requests to store a DICOM files, it necessary to initiate the parallel application by calling mpirun shell script. After, all communication between them will be made by socket connections. The communication is done by the master process (represented by MPI process zero) and H5WL, which the messages is represent function parameters, like the location that is the target of an operation (group path), the image buffer and the number of MPI processes. In write functions, per example, the H5WL will first receive the DICOM file, create a new hierarchy of the image based on DICOM file layers, get the path location for new image (JPEG image) and then call mpirun procedure to start the MPI application. The master process will delegate the communication with H5WL to retrieve the function to perform, get the location (group) of image in the HDF5 structure, the arguments for the job and the stream of images to be stored. Receiving the stream, the master node has to distribute the memory buffer to each process in small buffer equal to number of nodes. Finally, once the jobs have executed, the main process will return to the wrapper the status of reading or writing the buffer. Our experiments are based on the Parallel HDF5 architecture, adapted to use PVFS, MPI and sequential CyclopsDCMServer. The parallel environment consists in four node cluster, as specified in Table 1 and one dedicated computer for DICOM server. Unfortunately, the environment is non-dedicated cluster and belongs to Telemedicine Laboratory and share a connection network 100 Mbs Ethernet. The operating system installed on all nodes is CentOS with kernel 2.6.32, and there only one metadata PVFS node. Each node has a PVFS client for access to the PVFS file system and MPI compiled. It is important to note that our results do not take into consideration external factors, like computer users, using the same ...

Similar publications

Article
Full-text available
The Hierarchical Data Format (HDF) is an inter-esting approach for developing scientific applications where a large amount of data must be stored and accessed. A tele-medicine project underway in the State of Santa Catarina (SC), in Brazil, has developed a server called the CyclopsDCM-Server, which adopts the HDF for the manipulation of medical ima...

Citations

... Improvements using combined techniques based on enhanced data models (e.g., Hierarchical Data Format -HDF, Network Common Data Format -NetCDF), and enhanced file systems (e.g., Parallel Virtual File System -PVFS), bring distribution and partitioning to the file system strategy. [14][15][16] The drawback for such approach remains in the search for specific attribute values, which still demands file-content parsing. ...
Article
Full-text available
To design, build, and evaluate a storage model able to manage heterogeneous digital imaging and communications in medicine (DICOM) images. The model must be simple, but flexible enough to accommodate variable content without structural modifications; must be effective on answering query/retrieval operations according to the DICOM standard; and must provide performance gains on querying/retrieving content to justify its adoption by image-related projects. The proposal adapts the original decomposed storage model, incorporating structural and organizational characteristics present in DICOM image files. Tag values are stored according to their data types/domains, in a schema built on top of a standard relational database management system (RDBMS). Evaluation includes storing heterogeneous DICOM images, querying metadata using a variable number of predicates, and retrieving full-content images for different hierarchical levels. When compared to a well established DICOM image archive, the proposal is 0.6-7.2 times slower in storing content; however, in querying individual tags, it is about 48.0% faster. In querying groups of tags, DICOM decomposed storage model (DCMDSM) is outperformed in scenarios with a large number of tags and low selectivity (being 66.5% slower); however, when the number of tags is balanced with better selectivity predicates, the performance gains are up to 79.1%. In executing full-content retrieval, in turn, the proposal is about 48.3% faster. DCMDSM is a model built for the storage of heterogeneous DICOM content, based on a straightforward database design. The results obtained through its evaluation attest its suitability as a storage layer for projects where DICOM images are stored once, and queried/retrieved whenever necessary.
... In this paper we present experiments with four different distributed file systems, considering an enhancement of the CyclopsDCMServer server architecture, called as PH5WRAP [10], which was designed and implemented to improve the reading and writing functions in parallel (or sequential) of the binary part of the data. This contribution is differential in showing this component supported by distinct high performance distributed file systems. ...
... Future work for the present research considers experiments of the parallel part of the PH5Wrap and a mix of tests with both sequential and concurrent requests. The studies [20] and [10] show some previews results using the right part of PH5Wrap on PVFS. Even the experiments results are not applied to all processes (e.g. ...
... Even the experiments results are not applied to all processes (e.g. communication between H5Wrap and MPI process) and in [10] the experiments is performed in an environment networkless (virtual machines), the results demonstrated advantage in parallel side. ...
Conference Paper
Full-text available
The new trend in the process of data-intensive management indicates the importance of a distributed file system for both Internet large scale services and cloud computing environments. I/O latency and application buffering sizes are two of a number of issues that are essential to be analysed on different class of distributed file systems. In this paper, it is presented a research work comparing four different high performance distributed file systems. Those systems were employed to support a medical image server application in a private storage environment. Experimental results highlight the importance of an appropriate distributed file system to provide a differential level of performance considering application specific characteristics.
Thesis
Full-text available
O uso de imagens digitais no processo de diagnóstico médico é observável em diferentes escalas e cenários de aplicação, tendo evoluído em termos de volume de dados adquiridos e número de modalidades de exame atendidas. A organização desse conteúdo digital, comumente representado por conjuntos de imagens no padrão DICOM (Digital Imaging and Communications in Medicine), costuma ser delegada a sistemas PACS (Picture Archiving and Communication System) baseados na agregação de componentes heterogêneos de hardware e software. Parte desses componentes interage de forma a compor a camada de armazenamento do PACS, responsável pela persistência de toda e qualquer imagem digital que, em algum momento, foi adquirida ou visualizada/manipulada via sistema. Apesar de empregarem recursos altamente especializados como SGBDs (Sistemas Gerenciadores de Banco de Dados), as camadas de armazenamento PACS atuais são visualizadas e utilizadas como simples repositórios de dados, assumindo um comportamento passivo (ou seja, sem a agregação de regras de negócio) quando comparadas a outros componentes do sistema. Neste trabalho, propõe-se uma nova arquitetura PACS simplificada baseada em alterações na sua camada de armazenamento. As alterações previstas baseiam-se na troca do perfil passivo assumido atualmente por essa camada por um perfil ativo, utilizando-se de recursos de extensibilidade e de distribuição de dados (hoje não empregados) disponibilizados por seus componentes. A arquitetura proposta concentra-se na comunicação e no armazenamento de dados, utilizando-se de extensões de SGBDs e de estruturas heterogêneas para armazenamento de dados convencionais e não convencionais, provendo alto desempenho em termos de escalabilidade, suporte a grandes volumes de conteúdo e processamento descentralizado de consultas. Estruturalmente, a arquitetura proposta é formada por um conjunto de módulos projetados de forma a explorar as opções de extensibilidade presentes em SGBDs, incorporando características e funcionalidades originalmente distribuídas entre outros componentes do PACS (na forma de regras de negócio). Em nível de protótipo, resultados obtidos a partir de experimentos indicam a viabilidade de uso da arquitetura proposta, explicitando ganhos de desempenho na pesquisa de metadados e na recuperação de imagens DICOM quando comparados a arquiteturas PACS convencionais. A flexibilidade da proposta quanto à adoção de tecnologias de armazenamento heterogêneas também é avaliada positivamente, permitindo estender a camada de armazenamento PACS em termos de escalabilidade, poder de processamento, tolerância a falhas e representação de conteúdo.
Thesis
Full-text available
The amount of digital data generated dialy has increased significantly. Consequently, applications need to handle increasing volumes of data, in a variety of formats and sources, with high velocity, namely Big Data problem. Since storage devices did not follow the performance evolution observed in processors and main memories, they become the bottleneck of these applications. Parallel file systems are software solutions that have been widely adopted to mitigate input and output (I/O) limitations found in current computing platforms. However, the efficient utilization of these storage solutions depends on the understanding of their behavior in different conditions of use. This is a particularly challenging task, because of the multivariate nature of the problem, namely the fact that the overall performance of the system depends on the relationship and the influence of a large set of variables. This dissertation proposes an analytical multivariate model to represent storage performance behavior in parallel file systems for different configurations and workloads. An extensive set of experiments, executed in four real computing environments, was conducted in order to identify a significant number of relevant variables, to determine the influence of these variables on overall system performance, and to build and evaluate the proposed model. As a result of the characterization effort, the effect of three factors, not explored in previous works, is presented. Results of the model evaluation, comparing the behavior and values estimated by the model with behavior and values measured in real environments for different usage scenarios, showed that the proposed model was successful in system performance representation. Although some deviations were found in the values estimated by the model, considering the significantly higher number of usage scenarios evaluated in this research work compared to previous proposals found in the literature, the accuracy of prediction was considered acceptable.
Conference Paper
This research work investigates the performance impact of operating systems' (OS) caching parameters on a parallel file system (PFS). Through an extensive experimental analysis, an analytical performance model is proposed in order to reflect caching effects on performance of file write operations. A qualitative and quantitative evaluation of 855 test cases over 3 different platforms was performed. Results indicate that the proposed model is effective in representing caching effects, identifying both situation and intensity of performance degradation. Observed mean absolute percentage error (MAPE) of predicted values is less than 36%.