Valerie Taylor

Valerie Taylor
Texas A&M University | TAMU · Department of Computer Science and Engineering

PhD

About

145
Publications
14,294
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,371
Citations

Publications

Publications (145)
Article
Energy-efficient scientific applications require insight into how high-performance computing system features impact the applications' power and performance. This insight results from the development of performance and power models. When used with an earthquake simulation and an aerospace application, a proposed modeling framework reduces energy con...
Conference Paper
The CORAL Scalable Science Benchmarks are full applications that are expected to test the full scale of future CORAL systems. It is important to better understand power and performance characteristics of these benchmarks before using them to test future CORAL systems. In this paper we have an in-depth analysis of power and performance characteristi...
Article
Understanding workload behavior plays an important role in performance studies. The growing complexity of applications and architectures has increased the gap among application developers, performance engineers, and hardware designers. To reduce this gap, we propose SKOPE, a SKeleton framework for Performance Exploration, that produces a descriptiv...
Article
Full-text available
Many/multi-core supercomputers provide a natural programming paradigm for hybrid MPI/OpenMP scientific applications. In this paper, we investigate the performance characteristics of five hybrid MPI/OpenMP scientific applications (two NAS Parallel benchmarks Multi-Zone SP-MZ and BT-MZ, an earthquake simulation PEQdyna, an aerospace application PMLB...
Article
Full-text available
The academic performance and engagement of youth from under-represented ethnic groups (African American, Latino, and Indigenous) in science, technology, engineering, and mathematics (STEM) show statistically large gaps in comparison with their White and Asian peers. Some of these differences can be attributed to the direct impact of economic forces...
Article
In this paper, we present the Energy-Aware Modeling and Optimization Methodology (E-AMOM) framework, which develops models of runtime and power consumption based upon performance counters and uses these models to identify energy-based optimizations for scientific applications. E-AMOM utilizes predictive models to employ run-time Dynamic Voltage and...
Conference Paper
MuMMI (Multiple Metrics Modeling Infrastructure) environment is an infrastructure that facilitates systematic measurement, modeling, and prediction of performance, power consumption and performance-power tradeoffs for parallel systems. MuMMI builds upon three existing frameworks: Prophesy for performance modeling and prediction of parallel applicat...
Conference Paper
The MuMMI (Multiple Metrics Modeling Infrastructure) project is an infrastructure that facilitates systematic measurement, modeling, and prediction of performance, power consumption and performance-power tradeoffs for parallel systems. In this paper, we present the MuMMI framework, which consists of an Instrument or, Databases and Analyzer. The MuM...
Conference Paper
In this paper, we investigate the performance characteristics of five hybrid MPI/OpenMP scientific applications (two NAS Parallel benchmarks Multi-Zone SP-MZ and BT-MZ, an earthquake simulation PEQdyna, an aerospace application PMLB and a 3D particle-in-cell application GTC) on a large-scale multithreaded Blue Gene/Q supercomputer at Argonne Nation...
Article
Earthquakes are one of the most destructive natural hazards on our planet Earth. Hugh earthquakes striking offshore may cause devastating tsunamis, as evidenced by the 11 March 2011 Japan (moment magnitude M w 9:0/ and the 26 December 2004 Sumatra .M w 9:1/ earthquakes. Earthquake prediction (in terms of the precise time, place, and magnitude of a...
Article
In this paper, we integrate a 3D mesh generator into the simulation, and use MPI to parallelize the 3D mesh generator, illustrate an element-based partitioning scheme for explicit finite element methods, and based on the partitioning scheme and what we learned from our previous work, we implement our hybrid MPI/OpenMP finite element earthquake simu...
Conference Paper
Full-text available
Surrogate-based Workload Application Performance Projection (SWAPP) is a framework for performance projections of High Performance Computing (HPC) applications using benchmark data. Performance projections of HPC applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in th...
Article
Full-text available
Modern large-scale scientiflc computation,problems must execute in a parallel computational environment to achieve acceptable performance. Target parallel environments range from the largest tightly-coupled supercomputers to heterogeneous clusters of workstations. Grid technologies make Internet execution more likely. Hierarchical and heterogeneous...
Article
Seeking a comprehensive view of minority student demographics to determine what programs and policies are needed to promote diversity.
Article
Full-text available
Predictive models enable a better understanding of the performance characteristics of applications on multicore systems. Previous work has utilized performance counters in a system-centered approach to model power consumption for the system, CPU, and memory components. Often, these approaches use the same group of counters across different applicat...
Article
Full-text available
Energy consumption is a major concern with high-performance multicore systems. In this paper, we explore the energy consumption and performance (execution time) characteristics of different parallel implementations of scientific applications. In particular, the experiments focus on message-passing interface (MPI)-only versus hybrid MPI/OpenMP imple...
Conference Paper
In this paper, we present a performance modeling framework based on memory bandwidth contention time and a parameterized communication model to predict the performance of OpenMP, MPI and hybrid applications with weak scaling on three large-scale multicore clusters: IBM POWER4, POWER5+ and BlueGene/P, and analyze the performance of these MPI, OpenMP...
Article
Full-text available
Chip multiprocessors (CMP) are widely used for high performance computing and are being configured in a hierarchical manner to compose a CMP compute node in a CMP system. Such a CMP system provides a natural programming paradigm for hybrid MPI/OpenMP applications. In this paper, we use OpenMP to parallelize a sequential earthquake simulation code f...
Article
The NAS Parallel Benchmarks (NPB) are well-known applications with the fixed algorithms for evaluating parallel systems and tools. Multicore supercomputers provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the data sharing with the multicores that comprise a node and MPI can be used with the communication b...
Article
Introducing CMD-IT, a new center focused on synergistic activities related to ethnic minorities and people with disabilities.
Chapter
Introduction Cosmology SAMR Applications Design of DistDLB Experiments Conclusion and Future Work Acknowledgments References
Chapter
Introduction Major Issues Related to Mesh Partitioning for Distributed Systems Description of Part Parallel Simulated Annealing Experiments Previous Work Conclusion and Future Work References
Conference Paper
Chip multiprocessors (CMP) are widely used for high performance computing and are being configured in a hierarchical manner to compose a CMP compute node in a parallel system. OpenMP parallel programming within such a CMP node can take advantage of the globally shared address space and on-chip high inter-core bandwidth and low inter-core latency. I...
Conference Paper
Full-text available
Performance projections of High Performance Computing (HPC) applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems, enable them to compare the application performance across different existing and future systems, and help HPC users with syst...
Article
Full-text available
Chip multiprocessors (CMP) are widely used for high performance computing. Further , these CMPs are being configured in a hierarchical manner to compose a node in a cluster system. A major challenge to be addressed is efficient use of such cluster systems for large-scale scientific applications. In this paper, we quantify the performance gap result...
Conference Paper
Full-text available
Resource sharing and implementation of software stack for emerging multicore processors introduce performance and scaling challenges for large-scale scientific applications, particularly on systems with thousands of processing elements. Traditional performance optimization, tuning and modeling techniques that rely on uniform representation of compu...
Conference Paper
Full-text available
Chip multiprocessors (CMP) are widely used for high performance computing. Further, these CMPs are being configured in a hierarchical manner to compose a node in a cluster system. A major challenge to be addressed is efficient use of such cluster systems for large-scale scientific applications. In this paper, we quantify the performance gap resulti...
Conference Paper
One major challenge for grid environments is how to efficiently utilize geographically distributed resources given large communication latency introduced by wide area networks interconnecting different sites. In this paper, we use optical networks to connect four clusters from three different sites: Texas A&M University, University of Illinois at C...
Article
In this work, the performance and reliability analysis of a crossbar molecular switch nanomemory demultiplexer is studied and results presented. In particular, we investigate the impact on the performance of a crossbar nanomemory demultiplexer of implementing a combination of error correction coding and multi-switch junction fault tolerance schemes...
Article
Full-text available
Nanoscale elements are fabricated using bottom-up processes, and as such are prone to high levels of defects. Therefore, fault-tolerance is crucial for the realization of practical nanoscale devices. In this paper, we investigate a fault-tolerance scheme that utilizes redundancies in the rows and columns of a nanoscale crossbar molecular switch mem...
Article
Currently, clusters of shared memory symmetric multiprocessors (SMPs) are one of the most common parallel computing systems, for which some existing environments have between 8 to 32 processors per node. Examples of such environments include some supercomputers: DataStar p655 (P655 and P655m) and P690 at the San Diego Supercomputing Center, and Sea...
Conference Paper
This paper presents a performance and reliability analysis of a scaled crossbar molecular switch memory and demultiplexer. In particular, we compare our multi-switch junction fault tolerance scheme with a banking defect tolerance scheme. Results indicate that delay and power scale linearly increasing number of redundant molecular switch junctions....
Article
Full-text available
We report on some of the interactions between two SciDAC projects: The National Computational Infrastructure for Lattice Gauge Theory (USQCD), and the Performance Engineering Research Institute (PERI). Many modern scientific programs consistently report the need for faster computational resources to maintain global competitiveness. However, as the...
Article
Full-text available
As part of the Performance Engineering Research Institute (PERI) effort, the Performance Database Working Group, which involves PERI researchers as well as outside researchers at the University of Oregon, Portland State University, and Texas A&M University, has developed technology for storing performance data collected by a number of performance m...
Conference Paper
The Lattice Boltzmann method is widely used in simulating fluid flows. In this paper, we present the performance analysis, modeling and prediction of a parallel multiblock Lattice Boltzmann application on up to 512 processors on three SMP clusters: two IBM SP systems at San Diego Supercomputing Center (DataStar - p655 and p690) and one IBM SP syste...
Chapter
Full-text available
We present a technique for deriving predictions for the run times of parallel applications from the run times of “similar” applications that have executed in the past. The novel aspect of our work is the use of search techniques to determine those application characteristics that yield the best definition of similarity for the purpose of making pre...
Conference Paper
Nanoscale elements are fabricated using bottom-up processes, and as such are prone to high levels of defects. Therefore, fault-tolerance is crucial for the realization of practical nanoscale devices. In this paper, we investigate a fault tolerance scheme that utilizes redundancies in the rows and columns of a nanoscale crossbar molecular switch mem...
Conference Paper
Nanoscale elements are fabricated using bottom-up processes, and as such they are prone to high levels of defects. Defect-tolerance will play a crucial role in the realization of practical nanoscale devices. In this paper we investigate the performance impact of combining a molecular switch junction with an ECC demultiplexer to allow for enhanced f...
Article
Cosmology SAMR simulations have played a prominent role in the field of astrophysics. The emerging distributed computing systems provide an economic alternative to the traditional parallel machines, and enable scientists to conduct cosmological simulations that require vast computing power. An important issue of conducting distributed cosmological...
Article
Prophesy system is a performance analysis and modeling infrastructure that allows users to record many different parameters relevant to an application's performance. A key component of Prophesy system is the web-based automated performance modeling system, which allows a developer to quickly gain insight into the performance of an application code...
Conference Paper
Full-text available
It is anticipated that self assembled ultra-dense nanomemories will be more susceptible to manufacturing defects and transient faults than conventional CMOS-based memories, thus the need exists for fault-tolerant memory architectures. The development of such architectures will require intense analysis in terms of achievable performance measures-pow...
Conference Paper
Full-text available
Distributed systems are available and provide vast compute and data resources to users. With the avai lability of multiple resources, one of the major issues to b e addressed is site selection. Users have access to many resource sites from which to select for execution o f applications. In this paper, we quantify the advan tages of using performanc...
Chapter
We present two novel dynamic load balancing schemes for SAMR applications: one is for parallel systems denoted as parallel DLB and the other is for distributed systems denoted as distributed DLB. Parallel DLB scheme divides the load balancing process into two steps: moving-grid phase and splitting-grid phase. Distributed DLB scheme takes into consi...
Article
It is anticipated that self assembled ultra-dense nanomemories will be more susceptible to manufacturing defects and transient faults than conventional CMOS-based memories, thus the need exists for fault-tolerant memory architectures. The development of such architectures will require intense analysis in terms of achievable performance measures - p...
Article
We present a technique for predicting the run times of parallel applications based upon the run times of “similar” applications that have executed in the past. The novel aspect of our work is the use of search techniques to determine those application characteristics that yield the best definition of similarity for the purpose of making predictions...
Conference Paper
Full-text available
Summary form only given. Kernel coupling quantifies the interaction between adjacent and chains of kernels in an application. A kernel can be a loop, procedure or file. In our previous work, we used the kernel coupling values to identify how to combine the execution times of the individual kernels that compose the application to predict the executi...
Article
A typical cosmological simulation requires a large amount of compute power, which is hard to satisfy with a single machine. Distributed systems provide the opportunity to execute such large-scale applications. As part of the iGrid Research Demonstration 2002, we explored a large-scale cosmology application on a distributed system composed of two su...
Article
Performance is an important issue with any application, especially grid applications. Efficient execution of applications requires insight into how the system features impact the performance of the applications. This insight generally results from significant experimental analysis and possibly the development of performance models. This paper prese...
Article
Over the last decade, processors have made enormous gains in speed. But increase in the speed of the secondary and tertiary storage devices could not cope with these gains. The result is that the secondary and tertiary storage access times dominate execution time of data intensive computations. Therefore, in scientific computations, efficient data...
Article
With the increasing number of scientific applications manipulating huge amounts of data, effective high-level data management is an increasingly important problem. Unfortunately, so far the solutions to the high-level data management problem either require deep understanding of specific storage architectures and file layouts (as in high-performance...
Article
Full-text available
Performance is an important issue with any application, especially grid applications. Efficient execution of applications requires insight into how the system features impact the performance of the applications. This insight generally results from significant experimental analysis and possibly the development of performance models. This paper prese...
Conference Paper
Full-text available
Kernel coupling refers to the eect that kernel i has on kernel j in relation to running each kernel in isolation. The two kernels can correspond to adjacent kernels or a chain of three or more kernels in the control ow of an application. In previous work, we used kernel cou- pling to provide insights on where further algorithm and code implementati...
Article
Adaptive mesh refinement (AMR) is a type of multiscale algorithm that achieves high resolution in localized regions of dynamic, multidimensional numerical simulations. One of the key issues related to AMR is dynamic load balancing (DLB), which allows large-scale adaptive applications to run efficiently on parallel systems. In this paper, we present...
Article
Full-text available
Large distributed systems such as Computational and Data Grids require a substantial amount of monitoring data be collected for a variety of tasks such as fault detection, performance analysis, performance tuning, performance prediction, and scheduling. Some tools are currently available and others are being developed for collecting and forwarding...
Article
Traditional performance optimization techniques have focused on finding the kernel in an application that is the most time consuming and attempting to optimize it. In this paper, we focus on an optimization technique with a more global perspective of the application. In particular, we present a methodology for measuring the interaction, or coupling...
Article
Mesh partitioning for homogeneous systems has been studied extensively; however, mesh partitioning for distributed systems is a relatively new area of research. To ensure efficient execution on a distributed system, the heterogeneities in the processor and network performance must be taken into consideration in the partitioning process; equal size...
Conference Paper
Full-text available
Performance models provide significant insight into the performance relationships between an application and the system used for execution. The major obstacle to developing performance models is the lack of knowledge about the performance relationships between the different functions that compose an application. This paper addresses the issue by us...
Conference Paper
In this paper we investigate the data access patterns and file I/O behaviors of a production cosmology application that uses the adaptive mesh refinement (AMR) technique for its domain decomposition. This application was originally developed using Hierarchical Data Format (HDF version 4) I/O library and since HDF4 does not provide parallel I/O faci...
Article
Full-text available
Dynamic load balancing(DLB) for parallel systems has been studied extensively; however, DLB for distributed systems is relatively new. To efficiently utilize computing resources provided by distributed systems, an underlying DLB scheme must address both heterogeneous and dynamic features of distributed systems. In this paper, we propose a DLB schem...
Article
Shortest path algorithms are required by several transportation applications; furthermore, the shortest path computation in these applications can account for a large percentage of the total execution time. Since these algorithms are very computationally intense, parallel processing can provide the compute power and memory required to solve large p...
Article
Large distributed systems such as Computational and Data Grids require a substantial amount of monitoring data be collected for a variety of tasks such as fault detection, performance analysis, performance tuning, performance prediction, and scheduling. Some tools are currently available and others are being developed for collecting and forwarding...
Article
Full-text available
This document presents a simple case study of a Grid performance system based on the Grid Monitoring Architecture (GMA) being developed by the Grid Forum Performance Working Group. It describes how the various system components would interact for a very basic monitoring scenario, and is intended to introduce people to the terminology and concepts p...
Article
Mesh partitioning is an important step for parallel scientific applications, in particular finite element analyses. A good partitioner will minimize both the time spent on local computation and on interprocessor communication. It is often the case that these two goals cannot be satisfied simultaneously. In this paper, we use analytical and experime...
Article
Full-text available
Computation of shortest paths is an integral component of many applications such as transportation planning and VLSI design. Frequently, a shortest path algorithm is selected for a given application based on the performance of the algorithm for a set of test networks. The performance of this algorithm, however, can be significantly different for ne...
Article
Shortest path computation is required by a large number of applications such as VLSI, transportation and communication networks. These applications, which are often very complex and have sparse networks, generally use parallel labeling shortest path algorithms. Such algorithms, when implemented on a distributed memory machine, require termination d...
Article
Shortest path computation is required by a large number of applications such as VLSI, transportation and communication networks. These applications, which often use parallel processing, require an efficient parallel shortest path algorithm. The experimental work related to parallel shortest path algorithms has focused on the development of efficien...
Conference Paper
Full-text available
Dynamic load balancing (DLB) for parallel systems has been studied extensively; however, DLB for distributed systems is relatively new. To efficiently utilize computing resources provided by distributed systems, an underlying DLB scheme must address both heterogeneous and dynamic features of distributed systems. In this paper, we propose a DLB sche...
Conference Paper
Full-text available
Performance models provide significant insight into the performance relationships between an application and the system, either paral lel or distributed, used for execution. The development of models often requires significant time, sometimes in the range of months, to develop; this is especially the case for detailed models. This paper presents ou...
Conference Paper
Full-text available
Adaptive Mesh Refinement (AMR) is a type of multiscale algorithm that achieves high resolution in localized regions of dynamic, multidimensional numerical simulations. One of the key issues related to AMR is dynamic load balancing (DLB), which allows large-scale adaptive applications to run efficiently on parallel systems. In this paper we present...
Article
Full-text available
Adaptive Mesh Refinement (AMR) is a type of multiscale algorithm that dynamically achieves high resolution in lo- calized regions of multidimensional numerical simulations. A dynamic load balancing(DLB) scheme for structured AMR applications was proposed in (19). Unfortunately, the overhead introduced by this DLB scheme is significant. Further, a p...
Article
Computer architecture and programming are disciplines that require extensive experimentation with computer tools such as simulators and compilers. At the authors' universities, several tools are being incorporated in courses at the junior and senior levels by using a powerful, web-based network-computing system as a computational and educational re...
Article
Some computational grid applications have very large resource requirements and need simultaneous access to resources from more than one parallel computer. Current scheduling systems do not provide mechanisms to gain such simultaneous access without the help of human administrators of the computer systems. In this work, we propose and evaluate sever...
Article
Full-text available
In this paper we develop a performance model for analyzing the endto -end lag in a combined supercomputer/virtual environment. Wefirst present a general model and then use this model to analyze the lag of an interactive, immersive visualization of a scientific application. This application consists of a finite element simulation executed on an IBM...
Conference Paper
On many computers, a request to run a job is not serviced immediately but instead is placed in a queue and serviced only when resources are released by preceding jobs. In this paper, we build on runtime prediction techniques that we developed in previous research to explore two problems. The first problem is to predict how long applications will wa...
Article
Full-text available
Efficient execution of applications requires insight into how the system features affect the performance of the application. For distributed systems, the task of gaining this insight is complicated by the complexity of the system features. This insight generally results from significant experimental analysis and possibly the development of performa...
Article
Full-text available
With the increasing number of scientific applications manipulating huge amounts of data, effective high-level data management is an increasingly important problem. Unfortunately, so far the solutions to the high‐level data management problem either require deep understanding of specific storage architectures and file layouts (as in high-performance...
Article
Computer architecture and programming are disciplines that require extensive experimentation with computer tools, such as simulators and compilers. At the authors' universities, several tools are being incorporated in courses at the junior and senior levels by using a powerful, web-based network-computing system as a computational and educational r...
Article
Over the last decade, processors have made enormous gains in speed. But increase in the speed of the secondary and tertiary storage devices could not cope with these gains. The result is that the secondary and tertiary storage access times dominate execution time of data intensive computations. Therefore, in scientific computations, efficient data...
Conference Paper
The authors present a technique for deriving predictions for the run times of parallel applications from the run times of similar applications that have executed in the past. The novel aspect of the work is the use of search techniques to determine those application characteristics that yield the best definition of similarity for the purpose of mak...
Article
Full-text available
Traditional performance optimization techniques have focused on finding the kernel in an application that is the most time consuming and attempting to optimize it. In this paper we focus on optimization techniques for a more global perspective of the application. In particular, we present a methodolodgy for measuring the interaction or coupling bet...
Article
Full-text available
Virtual prototyping involves a synthesis of engineering methodology and immersive, three-dimensional visualization technology. Ideally, this is a process in which computational models are used in place of physical models in the development of a new product or design concept. If used successfully, virtual prototyping can lead to more rapid product d...
Conference Paper
With the increasing number of scientific applications manipulating huge amounts of data, effective data management is an increasingly important problem. Unfortunately, so far the solutions to this data management problem either require deep understanding of specific storage architectures and file layouts (as in high-performance file systems) or pro...
Article
Mesh partitioning for distributed systems differs from partitioning for homogeneous systems in that both system and application heterogeneities need to be taken into consideration. In this paper, we focus on the issue of optimal number of partitions with local and remote communication. This is an important issue due to the fact that local and remot...
Article
Full-text available
Traditional performance optimization techniques have focused on finding the kernel in a program that is the most time consuming and attempting to optimize it. We introduce a methodology for measuring and representing the interaction, or coupling, between kernels that improves upon the accuracy of the traditional method. Then we demonstrate the bene...
Article
Shortest path computation is required by a large number of applications such as VLSI, transportation and communication networks. These applications, which are often very complex and have sparse networks, generally use parallel labeling shortest path algorithms. Such algorithms, when implemented on a distributed memory machine, require termination d...
Article
this paper we present a model of the communication of parallel, 2-D finite element problems implemented on the Intel Delta. Our communication model consists of start-up time, transmission latency, and processor wait time due to synchronization. We find that the wait time can account for up to 25% of the total communication time. 2 Parallel Finite E...
Article
Full-text available
The computationally-intensive step of the finite element method is the solution of a linear system of equations. Very large and very sparse system matrices result from three-dimensional finite-element applications. The sparsity must be exploited for efficient use of memory and computational components (e.g., floating-point units) in executing the s...
Article
The use of parallel processors for implementing the finite elment method has made feasible the analyses of large applications, especially three-dimensional applications. The speedup, however, is limited by the interprocessor communication requirements. In this paper we analyze the effects of interprocessor communications on the resultant speedup of...

Network

Cited By