Hybrid Parallel Computation of Integration in GRACE
Yuasa, F; Kawabata, S; Perret-Gallix, D; Itakura, K; Hotta, Y; Okuda, M; Yuasa, Fukuko; Ishikawa, Tadashi; Kawabata, Setsuya; Perret-Gallix, Denis; Itakura, Kazuhiro; Hotta, Yukihiko; Okuda, Motoi
2000-01-01
With an integrated software package {\\tt GRACE}, it is possible to generate Feynman diagrams, calculate the total cross section and generate physics events automatically. We outline the hybrid method of parallel computation of the multi-dimensional integration of {\\tt GRACE}. We used {\\tt MPI} (Message Passing Interface) as the parallel library and, to improve the performance we embedded the mechanism of the dynamic load balancing. The reduction rate of the practical execution time was studied.
A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid Computer Systems
Blazewicz, Marek; Diener, Peter; Koppelman, David M; Kurowski, Krzysztof; Löffler, Frank; Schnetter, Erik; Tao, Jian
2012-01-01
Heterogeneous systems are becoming more common on High Performance Computing (HPC) systems. Even using tools like CUDA and OpenCL it is a non-trivial task to obtain optimal performance on the GPU. Approaches to simplifying this task include Merge (a library based framework for heterogeneous multi-core systems), Zippy (a framework for parallel execution of codes on multiple GPUs), BSGP (a new programming language for general purpose computation on the GPU) and CUDA-lite (an enhancement to CUDA that transforms code based on annotations). In addition, efforts are underway to improve compiler tools for automatic parallelization and optimization of affine loop nests for GPUs and for automatic translation of OpenMP parallelized codes to CUDA. In this paper we present an alternative approach: a new computational framework for the development of massively data parallel scientific codes applications suitable for use on such petascale/exascale hybrid systems built upon the highly scalable Cactus framework. As the first...
Parallel Computing Characteristics of CUPID code under MPI and Hybrid environment
Energy Technology Data Exchange (ETDEWEB)
Lee, Jae Ryong; Yoon, Han Young [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of); Jeon, Byoung Jin; Choi, Hyoung Gwon [Seoul National Univ. of Science and Technology, Seoul (Korea, Republic of)
2014-05-15
In this paper, a characteristic of parallel algorithm is presented for solving an elliptic type equation of CUPID via domain decomposition method using the MPI and the parallel performance is estimated in terms of a scalability which shows the speedup ratio. In addition, the time-consuming pattern of major subroutines is studied. Two different grid systems are taken into account: 40,000 meshes for coarse system and 320,000 meshes for fine system. Since the matrix of the CUPID code differs according to whether the flow is single-phase or two-phase, the effect of matrix shape is evaluated. Finally, the effect of the preconditioner for matrix solver is also investigated. Finally, the hybrid (OpenMP+MPI) parallel algorithm is introduced and discussed in detail for solving pressure solver. Component-scale thermal-hydraulics code, CUPID has been developed for two-phase flow analysis, which adopts a three-dimensional, transient, three-field model, and parallelized to fulfill a recent demand for long-transient and highly resolved multi-phase flow behavior. In this study, the parallel performance of the CUPID code was investigated in terms of scalability. The CUPID code was parallelized with domain decomposition method. The MPI library was adopted to communicate the information at the neighboring domain. For managing the sparse matrix effectively, the CSR storage format is used. To take into account the characteristics of the pressure matrix which turns to be asymmetric for two-phase flow, both single-phase and two-phase calculations were run. In addition, the effect of the matrix size and preconditioning was also investigated. The fine mesh calculation shows better scalability than the coarse mesh because the number of coarse mesh does not need to decompose the computational domain excessively. The fine mesh can be present good scalability when dividing geometry with considering the ratio between computation and communication time. For a given mesh, single-phase flow
Imachi, Hiroto
2015-01-01
Optimally hybrid numerical solvers were constructed for massively parallel generalized eigenvalue problem (GEP).The strong scaling benchmark was carried out on the K computer and other supercomputers for electronic structure calculation problems in the matrix sizes of M = 10^4-10^6 with upto 105 cores. The procedure of GEP is decomposed into the two subprocedures of the reducer to the standard eigenvalue problem (SEP) and the solver of SEP. A hybrid solver is constructed, when a routine is chosen for each subprocedure from the three parallel solver libraries of ScaLAPACK, ELPA and EigenExa. The hybrid solvers with the two newer libraries, ELPA and EigenExa, give better benchmark results than the conventional ScaLAPACK library. The detailed analysis on the results implies that the reducer can be a bottleneck in next-generation (exa-scale) supercomputers, which indicates the guidance for future research. The code was developed as a middleware and a mini-application and will appear online.
Institute of Scientific and Technical Information of China (English)
Guo-Liang Chen; Guang-Zhong Sun; Yun-Quan Zhang; Ze-Yao Mo
2006-01-01
In this paper, we present a general survey on parallel computing. The main contents include parallel computer system which is the hardware platform of parallel computing, parallel algorithm which is the theoretical base of parallel computing, parallel programming which is the software support of parallel computing. After that, we also introduce some parallel applications and enabling technologies. We argue that parallel computing research should form an integrated methodology of "architecture - algorithm - programming - application". Only in this way, parallel computing research becomes continuous development and more realistic.
Al Jarro, Ahmed
2011-08-01
A hybrid MPI/OpenMP scheme for efficiently parallelizing the explicit marching-on-in-time (MOT)-based solution of the time-domain volume (Volterra) integral equation (TD-VIE) is presented. The proposed scheme equally distributes tested field values and operations pertinent to the computation of tested fields among the nodes using the MPI standard; while the source field values are stored in all nodes. Within each node, OpenMP standard is used to further accelerate the computation of the tested fields. Numerical results demonstrate that the proposed parallelization scheme scales well for problems involving three million or more spatial discretization elements. © 2011 IEEE.
Fox, Geoffrey C; Messina, Guiseppe C
2014-01-01
A clear illustration of how parallel computers can be successfully appliedto large-scale scientific computations. This book demonstrates how avariety of applications in physics, biology, mathematics and other scienceswere implemented on real parallel computers to produce new scientificresults. It investigates issues of fine-grained parallelism relevant forfuture supercomputers with particular emphasis on hypercube architecture. The authors describe how they used an experimental approach to configuredifferent massively parallel machines, design and implement basic systemsoftware, and develop
Energy Technology Data Exchange (ETDEWEB)
1991-10-23
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.
Parallelism in matrix computations
Gallopoulos, Efstratios; Sameh, Ahmed H
2016-01-01
This book is primarily intended as a research monograph that could also be used in graduate courses for the design of parallel algorithms in matrix computations. It assumes general but not extensive knowledge of numerical linear algebra, parallel architectures, and parallel programming paradigms. The book consists of four parts: (I) Basics; (II) Dense and Special Matrix Computations; (III) Sparse Matrix Computations; and (IV) Matrix functions and characteristics. Part I deals with parallel programming paradigms and fundamental kernels, including reordering schemes for sparse matrices. Part II is devoted to dense matrix computations such as parallel algorithms for solving linear systems, linear least squares, the symmetric algebraic eigenvalue problem, and the singular-value decomposition. It also deals with the development of parallel algorithms for special linear systems such as banded ,Vandermonde ,Toeplitz ,and block Toeplitz systems. Part III addresses sparse matrix computations: (a) the development of pa...
Computing Parallelism in Discourse
Gardent, C; Gardent, Claire; Kohlhase, Michael
1997-01-01
Although much has been said about parallelism in discourse, a formal, computational theory of parallelism structure is still outstanding. In this paper, we present a theory which given two parallel utterances predicts which are the parallel elements. The theory consists of a sorted, higher-order abductive calculus and we show that it reconciles the insights of discourse theories of parallelism with those of Higher-Order Unification approaches to discourse semantics, thereby providing a natural framework in which to capture the effect of parallelism on discourse semantics.
Morse, H Stephen
1994-01-01
Practical Parallel Computing provides information pertinent to the fundamental aspects of high-performance parallel processing. This book discusses the development of parallel applications on a variety of equipment.Organized into three parts encompassing 12 chapters, this book begins with an overview of the technology trends that converge to favor massively parallel hardware over traditional mainframes and vector machines. This text then gives a tutorial introduction to parallel hardware architectures. Other chapters provide worked-out examples of programs using several parallel languages. Thi
A fully parallel, high precision, N-body code running on hybrid computing platforms
Capuzzo-Dolcetta, R; Punzo, D
2012-01-01
We present a new implementation of the numerical integration of the classical, gravitational, N-body problem based on a high order Hermite's integration scheme with block time steps, with a direct evaluation of the particle-particle forces. The main innovation of this code (called HiGPUs) is its full parallelization, exploiting both OpenMP and MPI in the use of the multicore Central Processing Units as well as either Compute Unified Device Architecture (CUDA) or OpenCL for the hosted Graphic Processing Units. We tested both performance and accuracy of the code using up to 256 GPUs in the supercomputer IBM iDataPlex DX360M3 Linux Infiniband Cluster provided by the italian supercomputing consortium CINECA, for values of N up to 8 millions. We were able to follow the evolution of a system of 8 million bodies for few crossing times, task previously unreached by direct summation codes. The code is freely available to the scientific community.
Algorithmically specialized parallel computers
Snyder, Lawrence; Gannon, Dennis B
1985-01-01
Algorithmically Specialized Parallel Computers focuses on the concept and characteristics of an algorithmically specialized computer.This book discusses the algorithmically specialized computers, algorithmic specialization using VLSI, and innovative architectures. The architectures and algorithms for digital signal, speech, and image processing and specialized architectures for numerical computations are also elaborated. Other topics include the model for analyzing generalized inter-processor, pipelined architecture for search tree maintenance, and specialized computer organization for raster
Algorithms and parallel computing
Gebali, Fayez
2011-01-01
There is a software gap between the hardware potential and the performance that can be attained using today's software parallel program development tools. The tools need manual intervention by the programmer to parallelize the code. Programming a parallel computer requires closely studying the target algorithm or application, more so than in the traditional sequential programming we have all learned. The programmer must be aware of the communication and data dependencies of the algorithm or application. This book provides the techniques to explore the possible ways to
SKIRT: Hybrid parallelization of radiative transfer simulations
Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.
2017-07-01
We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.
Energy Technology Data Exchange (ETDEWEB)
Rosa, Massimiliano [Los Alamos National Laboratory; Warsa, James S [Los Alamos National Laboratory; Perks, Michael [Los Alamos National Laboratory
2010-12-14
We have implemented a cell-wise, block-Gauss-Seidel (bGS) iterative algorithm, for the solution of the S{sub n} transport equations on the Roadrunner hybrid, parallel computer architecture. A compute node of this massively parallel machine comprises AMD Opteron cores that are linked to a Cell Broadband Engine{trademark} (Cell/B.E.). LAPACK routines have been ported to the Cell/B.E. in order to make use of its parallel Synergistic Processing Elements (SPEs). The bGS algorithm is based on the LU factorization and solution of a linear system that couples the fluxes for all S{sub n} angles and energy groups on a mesh cell. For every cell of a mesh that has been parallel decomposed on the higher-level Opteron processors, a linear system is transferred to the Cell/B.E. and the parallel LAPACK routines are used to compute a solution, which is then transferred back to the Opteron, where the rest of the computations for the S{sub n} transport problem take place. Compared to standard parallel machines, a hundred-fold speedup of the bGS was observed on the hybrid Roadrunner architecture. Numerical experiments with strong and weak parallel scaling demonstrate the bGS method is viable and compares favorably to full parallel sweeps (FPS) on two-dimensional, unstructured meshes when it is applied to optically thick, multi-material problems. As expected, however, it is not as efficient as FPS in optically thin problems.
Energy Technology Data Exchange (ETDEWEB)
DeHart, Mark D [ORNL; Williams, Mark L [ORNL; Bowman, Stephen M [ORNL
2010-01-01
The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement
On the Parallel Computation of Characteristic Set
Institute of Scientific and Technical Information of China (English)
WUYongwei; YANGGuangwen; LINDongdai; HUANGQifeng; ZHENGWeimin
2004-01-01
As one of the efficient approaches for computing various zero decompositions of any set of multivarlable polynomials, Characteristic set (CS) computation is very compute intensive. In this paper, our purpose is to build an effective parallel computation model for CS. We first present the parallel algorithm for CS computation. Then the term of Polynomial complexity grade (PCG) is defined, and our load balance strategy is put forward based on it as well, On the analysis of two basic transmission methods, we construct one efficient hybrid method for the data communication of parallel CS computation. At last，experiments and timing data have demonstrated the stability and high performance of our parallel computation model for CS method.
Computer Assisted Parallel Program Generation
Kawata, Shigeo
2015-01-01
Parallel computation is widely employed in scientific researches, engineering activities and product development. Parallel program writing itself is not always a simple task depending on problems solved. Large-scale scientific computing, huge data analyses and precise visualizations, for example, would require parallel computations, and the parallel computing needs the parallelization techniques. In this Chapter a parallel program generation support is discussed, and a computer-assisted parallel program generation system P-NCAS is introduced. Computer assisted problem solving is one of key methods to promote innovations in science and engineering, and contributes to enrich our society and our life toward a programming-free environment in computing science. Problem solving environments (PSE) research activities had started to enhance the programming power in 1970's. The P-NCAS is one of the PSEs; The PSE concept provides an integrated human-friendly computational software and hardware system to solve a target ...
Applied Parallel Computing Industrial Computation and Optimization
DEFF Research Database (Denmark)
Madsen, Kaj; NA NA NA Olesen, Dorte
Proceedings and the Third International Workshop on Applied Parallel Computing in Industrial Problems and Optimization (PARA96)......Proceedings and the Third International Workshop on Applied Parallel Computing in Industrial Problems and Optimization (PARA96)...
Parallel Computers in Signal Processing
Directory of Open Access Journals (Sweden)
Narsingh Deo
1985-07-01
Full Text Available Signal processing often requires a great deal of raw computing power for which it is important to take a look at parallel computers. The paper reviews various types of parallel computer architectures from the viewpoint of signal and image processing.
MASSIVE HYBRID PARALLELISM FOR FULLY IMPLICIT MULTIPHYSICS
Energy Technology Data Exchange (ETDEWEB)
Cody J. Permann; David Andrs; John W. Peterson; Derek R. Gaston
2013-05-01
As hardware advances continue to modify the supercomputing landscape, traditional scientific software development practices will become more outdated, ineffective, and inefficient. The process of rewriting/retooling existing software for new architectures is a Sisyphean task, and results in substantial hours of development time, effort, and money. Software libraries which provide an abstraction of the resources provided by such architectures are therefore essential if the computational engineering and science communities are to continue to flourish in this modern computing environment. The Multiphysics Object Oriented Simulation Environment (MOOSE) framework enables complex multiphysics analysis tools to be built rapidly by scientists, engineers, and domain specialists, while also allowing them to both take advantage of current HPC architectures, and efficiently prepare for future supercomputer designs. MOOSE employs a hybrid shared-memory and distributed-memory parallel model and provides a complete and consistent interface for creating multiphysics analysis tools. In this paper, a brief discussion of the mathematical algorithms underlying the framework and the internal object-oriented hybrid parallel design are given. Representative massively parallel results from several applications areas are presented, and a brief discussion of future areas of research for the framework are provided.
Introduction to Parallel Computing
1992-05-01
Topology C, Ada, C++, Data-parallel FORTRAN, 2D mesh of node boards, each node FORTRAN-90 (late 1992) board has 1 application processor Devopment Tools ...parallel machines become the wave of the present, tools are increasingly needed to assist programmers in creating parallel tasks and coordinating...their activities. Linda was designed to be such a tool . Linda was designed with three important goals in mind: to be portable, efficient, and easy to use
Hetrogenous Parallel Computing
2013-01-01
With processor core counts doubling every 18-24 months and penetrating all markets from high-end servers in supercomputers to desktops and laptops down to even mobile phones, we sit at the dawn of a world of ubiquitous parallelism, one where extracting performance via parallelism is paramount. That is, the "free lunch" to better performance, where programmers could rely on substantial increases in single-threaded performance to improve software, is over. The burden falls on developers to expl...
Hyndman, D E
2013-01-01
Analog and Hybrid Computing focuses on the operations of analog and hybrid computers. The book first outlines the history of computing devices that influenced the creation of analog and digital computers. The types of problems to be solved on computers, computing systems, and digital computers are discussed. The text looks at the theory and operation of electronic analog computers, including linear and non-linear computing units and use of analog computers as operational amplifiers. The monograph examines the preparation of problems to be deciphered on computers. Flow diagrams, methods of ampl
Zhang, Xiaohua; Wong, Sergio E; Lightstone, Felice C
2013-04-30
A mixed parallel scheme that combines message passing interface (MPI) and multithreading was implemented in the AutoDock Vina molecular docking program. The resulting program, named VinaLC, was tested on the petascale high performance computing (HPC) machines at Lawrence Livermore National Laboratory. To exploit the typical cluster-type supercomputers, thousands of docking calculations were dispatched by the master process to run simultaneously on thousands of slave processes, where each docking calculation takes one slave process on one node, and within the node each docking calculation runs via multithreading on multiple CPU cores and shared memory. Input and output of the program and the data handling within the program were carefully designed to deal with large databases and ultimately achieve HPC on a large number of CPU cores. Parallel performance analysis of the VinaLC program shows that the code scales up to more than 15K CPUs with a very low overhead cost of 3.94%. One million flexible compound docking calculations took only 1.4 h to finish on about 15K CPUs. The docking accuracy of VinaLC has been validated against the DUD data set by the re-docking of X-ray ligands and an enrichment study, 64.4% of the top scoring poses have RMSD values under 2.0 Å. The program has been demonstrated to have good enrichment performance on 70% of the targets in the DUD data set. An analysis of the enrichment factors calculated at various percentages of the screening database indicates VinaLC has very good early recovery of actives.
Massively parallel quantum computer simulator
De Raedt, K.; Michielsen, K.; De Raedt, H.; Trieu, B.; Arnold, G.; Richter, M.; Lippert, Th.; Watanabe, H.; Ito, N.
2007-01-01
We describe portable software to simulate universal quantum computers on massive parallel Computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray
HOPSPACK: Hybrid Optimization Parallel Search Package.
Energy Technology Data Exchange (ETDEWEB)
Gray, Genetha Anne.; Kolda, Tamara G.; Griffin, Joshua; Taddy, Matt; Martinez-Canales, Monica L.
2008-12-01
In this paper, we describe the technical details of HOPSPACK (Hybrid Optimization Parallel SearchPackage), a new software platform which facilitates combining multiple optimization routines into asingle, tightly-coupled, hybrid algorithm that supports parallel function evaluations. The frameworkis designed such that existing optimization source code can be easily incorporated with minimalcode modification. By maintaining the integrity of each individual solver, the strengths and codesophistication of the original optimization package are retained and exploited.4
Parallel Hybrid Vehicle Optimal Storage System
Bloomfield, Aaron P.
2009-01-01
A paper reports the results of a Hybrid Diesel Vehicle Project focused on a parallel hybrid configuration suitable for diesel-powered, medium-sized, commercial vehicles commonly used for parcel delivery and shuttle buses, as the missions of these types of vehicles require frequent stops. During these stops, electric hybridization can effectively recover the vehicle's kinetic energy during the deceleration, store it onboard, and then use that energy to assist in the subsequent acceleration.
Preconditioned method in parallel computation
Institute of Scientific and Technical Information of China (English)
Wu Ruichan; Wei Jianing
2006-01-01
The grid equations in decomposed domain by parallel computation are soled, and a method of local orthogonalization to solve the large-scaled numerical computation is presented. It constructs preconditioned iteration matrix by the combination of predigesting LU decomposition and local orthogonalization, and the convergence of solution is proved. Indicated from the example, this algorithm can increase the rate of computation efficiently and it is quite stable.
Parallel algorithms and cluster computing
Hoffmann, Karl Heinz
2007-01-01
This book presents major advances in high performance computing as well as major advances due to high performance computing. It contains a collection of papers in which results achieved in the collaboration of scientists from computer science, mathematics, physics, and mechanical engineering are presented. From the science problems to the mathematical algorithms and on to the effective implementation of these algorithms on massively parallel and cluster computers we present state-of-the-art methods and technology as well as exemplary results in these fields. This book shows that problems which seem superficially distinct become intimately connected on a computational level.
Massive Parallel Quantum Computer Simulator
De Raedt, K; De Raedt, H; Ito, N; Lippert, T; Michielsen, K; Richter, M; Trieu, B; Watanabe, H; Lippert, Th.
2006-01-01
We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers.
The science of computing - Parallel computation
Denning, P. J.
1985-01-01
Although parallel computation architectures have been known for computers since the 1920s, it was only in the 1970s that microelectronic components technologies advanced to the point where it became feasible to incorporate multiple processors in one machine. Concommitantly, the development of algorithms for parallel processing also lagged due to hardware limitations. The speed of computing with solid-state chips is limited by gate switching delays. The physical limit implies that a 1 Gflop operational speed is the maximum for sequential processors. A computer recently introduced features a 'hypercube' architecture with 128 processors connected in networks at 5, 6 or 7 points per grid, depending on the design choice. Its computing speed rivals that of supercomputers, but at a fraction of the cost. The added speed with less hardware is due to parallel processing, which utilizes algorithms representing different parts of an equation that can be broken into simpler statements and processed simultaneously. Present, highly developed computer languages like FORTRAN, PASCAL, COBOL, etc., rely on sequential instructions. Thus, increased emphasis will now be directed at parallel processing algorithms to exploit the new architectures.
Parallel computation of rotating flows
DEFF Research Database (Denmark)
Lundin, Lars Kristian; Barker, Vincent A.; Sørensen, Jens Nørkær
1999-01-01
is that of solving a singular, large, sparse, over‐determined linear system of equations, and the iterative method CGLS is applied for this purpose. We discuss some of the mathematical and numerical aspects of this procedure and report on the performance of our software on a wide range of parallel computers. Darbe...
An Energy Conserving Parallel Hybrid Plasma Solver
Holmstrom, M
2010-01-01
We investigate the performance of a hybrid plasma solver on the test problem of an ion beam. The parallel solver is based on cell centered finite differences in space, and a predictor-corrector leapfrog scheme in time. The implementation is done in the FLASH software framework. It is shown that the solver conserves energy well over time, and that the parallelization is efficient (it exhibits weak scaling).
Parallel computing and quantum chromodynamics
Bowler, K C
1999-01-01
The study of Quantum Chromodynamics (QCD) remains one of the most challenging topics in elementary particle physics. The lattice formulation of QCD, in which space-time is treated as a four- dimensional hypercubic grid of points, provides the means for a numerical solution from first principles but makes extreme demands upon computational performance. High Performance Computing (HPC) offers us the tantalising prospect of a verification of QCD through the precise reproduction of the known masses of the strongly interacting particles. It is also leading to the development of a phenomenological tool capable of disentangling strong interaction effects from weak interaction effects in the decays of one kind of quark into another, crucial for determining parameters of the standard model of particle physics. The 1980s saw the first attempts to apply parallel architecture computers to lattice QCD. The SIMD and MIMD machines used in these pioneering efforts were the ICL DAP and the Cosmic Cube, respectively. These wer...
A HYBRID GRANULARITY PARALLEL ALGORITHM FOR PRECISE INTEGRATION OF STRUCTURAL DYNAMIC RESPONSES
Institute of Scientific and Technical Information of China (English)
Yuanyin Li; Xianlong Jin; Genguo Li
2008-01-01
Precise integration methods to solve structural dynamic responses and the corre-sponding time integration formula are composed of two parts: the multiplication of an exponential matrix with a vector and the integration term. The second term can be solved by the series solu-tion. Two hybrid granularity parallel algorithms are designed, that is, the exponential matrix and the first term are computed by the fine-grained parallel algorithm and the second term is com-puted by the coarse-grained parallel algorithm. Numerical examples show that these two hybrid granularity parallel algorithms obtain higher speedup and parallel efficiency than two existing parallel algorithms.
Parallel computing in enterprise modeling.
Energy Technology Data Exchange (ETDEWEB)
Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.; Vanderveen, Keith; Ray, Jaideep; Heath, Zach; Allan, Benjamin A.
2008-08-01
This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.
Learning Parallel Computations with ParaLab
Kozinov, E.; Shtanyuk, A.
2015-01-01
In this paper, we present the ParaLab teachware system, which can be used for learning the parallel computation methods. ParaLab provides the tools for simulating the multiprocessor computational systems with various network topologies, for carrying out the computational experiments in the simulation mode, and for evaluating the efficiency of the parallel computation methods. The visual presentation of the parallel computations taking place in the computational experiments is the key feature ...
Vibration Isolation for Parallel Hydraulic Hybrid Vehicles
Directory of Open Access Journals (Sweden)
The M. Nguyen
2008-01-01
Full Text Available In recent decades, several types of hybrid vehicles have been developed in order to improve the fuel economy and to reduce the pollution. Hybrid electric vehicles (HEV have shown a significant improvement in fuel efficiency for small and medium-sized passenger vehicles and SUVs. HEV has several limitations when applied to heavy vehicles; one is that larger vehicles demand more power, which requires significantly larger battery capacities. As an alternative solution, hydraulic hybrid technology has been found effective for heavy duty vehicle because of its high power density. The mechanical batteries used in hydraulic hybrid vehicles (HHV can be charged and discharged remarkably faster than chemical batteries. This feature is essential for heavy vehicle hybridization. One of the main problems that should be solved for the successful commercialization of HHV is the excessive noise and vibration involving with the hydraulic systems. This study focuses on using magnetorheological (MR technology to reduce the noise and vibration transmissibility from the hydraulic system to the vehicle body. In order to study the noise and vibration of HHV, a hydraulic hybrid subsystem in parallel design is analyzed. This research shows that the MR elements play an important role in reducing the transmitted noise and vibration to the vehicle body. Additionally, locations and orientations of the isolation system also affect the efficiency of the noise and vibration mitigation. In simulations, a skyhook control algorithm is used to achieve the highest possible effectiveness of the MR isolation system.
Communication issues in parallel computation
Energy Technology Data Exchange (ETDEWEB)
Tsantilas, A.M.
1990-01-01
This thesis examines the problem of interprocessor communication in realistic parallel computers. In particular, the author considers the problem of permutation routing and its generalizations in the mesh, hypercube and butterfly networks. Building on previous research, he derives lower bounds for a wide class of deterministic routing algorithms which imply that such algorithms create heavy traffic congestion. In contrast, he shows that randomized routing algorithms result in both efficient and optimal upper bounds in the above networks. Experiments were also performed to test the behavior of the randomized algorithms. These experiments suggest interesting theoretical problems. He also examines the problem of efficient interprocessor communication in a model suggested by recent advances in optical computing. The main argument of this thesis is that communication can be made efficient if randomization is used in the routing algorithms.
Advances in randomized parallel computing
Rajasekaran, Sanguthevar
1999-01-01
The technique of randomization has been employed to solve numerous prob lems of computing both sequentially and in parallel. Examples of randomized algorithms that are asymptotically better than their deterministic counterparts in solving various fundamental problems abound. Randomized algorithms have the advantages of simplicity and better performance both in theory and often in practice. This book is a collection of articles written by renowned experts in the area of randomized parallel computing. A brief introduction to randomized algorithms In the aflalysis of algorithms, at least three different measures of performance can be used: the best case, the worst case, and the average case. Often, the average case run time of an algorithm is much smaller than the worst case. 2 For instance, the worst case run time of Hoare's quicksort is O(n ), whereas its average case run time is only O( n log n). The average case analysis is conducted with an assumption on the input space. The assumption made to arrive at t...
Classical MD calculations with parallel computers
Energy Technology Data Exchange (ETDEWEB)
Matsumoto, Mitsuhiro [Nagoya Univ. (Japan)
1998-03-01
We have developed parallel computation codes for a classical molecular dynamics (MD) method. In order to use them on work station clusters as well as parallel super computers, we use MPI (message passing interface) library for distributed-memory type computers. Two algorithms are compared: (1) particle parallelism technique: easy to install, effective for rather small number of processors. (2) region parallelism technique: take some time to install, effective even for many nodes. (J.P.N.)
A CLASS OF NEW PARALLEL HYBRID ALGEBRAIC MULTILEVEL ITERATIONS
Institute of Scientific and Technical Information of China (English)
Zhong-zhi Bai
2001-01-01
For the large sparse system of linear equations with symmetric positive definite block coefficient matrix resulted from suitable finite element discretization of the second-order self-adjoint elliptic boundary value problem, by making use of the algebraic multilevel iteration technique and the blocked preconditioning strategy, we construct preconditioning matrices having parallel computing function for the coefficient matrix and set up a class of parallel hybrid algebraic multilevel iteration methods for solving this kind of system of linear equations. Theoretical analyses show that, besides much suitable for implementing on the high-speed parallel multiprocessor systems, these new methods are optimal-order methods. That is to say, their convergence rates are independent of both the sizes and the levels of the constructed matrix sequence, and their computational workloads are bounded by linear functions in the order number of the considered system of linear equations,respectively.
Physics codes on parallel computers
Energy Technology Data Exchange (ETDEWEB)
Eltgroth, P.G.
1985-12-04
An effort is under way to develop physics codes which realize the potential of parallel machines. A new explicit algorithm for the computation of hydrodynamics has been developed which avoids global synchronization entirely. The approach, called the Independent Time Step Method (ITSM), allows each zone to advance at its own pace, determined by local information. The method, coded in FORTRAN, has demonstrated parallelism of greater than 20 on the Denelcor HEP machine. ITSM can also be used to replace current implicit treatments of problems involving diffusion and heat conduction. Four different approaches toward work distribution have been investigated and implemented for the one-dimensional code on the Denelcor HEP. They are ''self-scheduled'', an ASKFOR monitor, a ''queue of queues'' monitor, and a distributed ASKFOR monitor. The self-scheduled approach shows the lowest overhead but the poorest speedup. The distributed ASKFOR monitor shows the best speedup and the lowest execution times on the tested problems. 2 refs., 3 figs.
Parallel Computing Using Web Servers and "Servlets".
Lo, Alfred; Bloor, Chris; Choi, Y. K.
2000-01-01
Describes parallel computing and presents inexpensive ways to implement a virtual parallel computer with multiple Web servers. Highlights include performance measurement of parallel systems; models for using Java and intranet technology including single server, multiple clients and multiple servers, single client; and a comparison of CGI (common…
Broadcasting a message in a parallel computer
Berg, Jeremy E [Rochester, MN; Faraj, Ahmad A [Rochester, MN
2011-08-02
Methods, systems, and products are disclosed for broadcasting a message in a parallel computer. The parallel computer includes a plurality of compute nodes connected together using a data communications network. The data communications network optimized for point to point data communications and is characterized by at least two dimensions. The compute nodes are organized into at least one operational group of compute nodes for collective parallel operations of the parallel computer. One compute node of the operational group assigned to be a logical root. Broadcasting a message in a parallel computer includes: establishing a Hamiltonian path along all of the compute nodes in at least one plane of the data communications network and in the operational group; and broadcasting, by the logical root to the remaining compute nodes, the logical root's message along the established Hamiltonian path.
Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture
Energy Technology Data Exchange (ETDEWEB)
Camp, David; Garth, Christoph; Childs, Hank; Pugmire, Dave; Joy, Kenneth I.
2010-11-01
Streamline computation in a very large vector field data set represents a significant challenge due to the non-local and datadependentnature of streamline integration. In this paper, we conduct a study of the performance characteristics of hybrid parallel programmingand execution as applied to streamline integration on a large, multicore platform. With multi-core processors now prevalent in clustersand supercomputers, there is a need to understand the impact of these hybrid systems in order to make the best implementation choice.We use two MPI-based distribution approaches based on established parallelization paradigms, parallelize-over-seeds and parallelize-overblocks,and present a novel MPI-hybrid algorithm for each approach to compute streamlines. Our findings indicate that the work sharing betweencores in the proposed MPI-hybrid parallel implementation results in much improved performance and consumes less communication andI/O bandwidth than a traditional, non-hybrid distributed implementation.
ADVANCES AT A GLANCE IN PARALLEL COMPUTING
Directory of Open Access Journals (Sweden)
RAJKUMAR SHARMA
2014-07-01
Full Text Available In the history of computational world, sequential uni-processor computers have been exploited for years to solve scientific and business problems. To satisfy the demand of compute & data hungry applications, it was observed that better response time can be achieved only through parallelism. Large computational problems were partitioned and solved by using multiple CPUs in parallel. Computing performance was further improved by adopting multi-core architecture which provides hardware parallelism through use of multiple cores. Efficient resource utilization of a parallel computing environment by using software and hardware parallelism is a major research challenge. The present hardware technologies provide freedom to algorithm developers for control & management of resources through software codes, such as threads-to-cores mapping in recent multi-core processors. In this paper, a survey is presented since beginning of parallel computing up to the use of present state-of-art multi-core processors.
Directory of Open Access Journals (Sweden)
Kamatani Naoyuki
2011-05-01
Full Text Available Abstract Background Use of missing genotype imputations and haplotype reconstructions are valuable in genome-wide association studies (GWASs. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and used for GWASs. Since millions of single nucleotide polymorphisms need to be imputed in a GWAS, faster methods for genotype imputation and haplotype reconstruction are required. Results We developed a program package for parallel computation of genotype imputation and haplotype reconstruction. Our program package, ParaHaplo 3.0, is intended for use in workstation clusters using the Intel Message Passing Interface. We compared the performance of ParaHaplo 3.0 on the Japanese in Tokyo, Japan and Han Chinese in Beijing, and Chinese in the HapMap dataset. A parallel version of ParaHaplo 3.0 can conduct genotype imputation 20 times faster than a non-parallel version of ParaHaplo. Conclusions ParaHaplo 3.0 is an invaluable tool for conducting haplotype-based GWASs. The need for faster genotype imputation and haplotype reconstruction using parallel computing will become increasingly important as the data sizes of such projects continue to increase. ParaHaplo executable binaries and program sources are available at http://en.sourceforge.jp/projects/parallelgwas/releases/.
C++ and Massively Parallel Computers
Directory of Open Access Journals (Sweden)
Daniel J. Lickly
1993-01-01
Full Text Available Our goal is to apply the software engineering advantages of object-oriented programming to the raw power of massively parallel architectures. To do this we have constructed a hierarchy of C++ classes to support the data-parallel paradigm. Feasibility studies and initial coding can be supported by any serial machine that has a C++ compiler. Parallel execution requires an extended Cfront, which understands the data-parallel classes and generates C* code. (C* is a data-parallel superset of ANSI C developed by Thinking Machines Corporation. This approach provides potential portability across parallel architectures and leverages the existing compiler technology for translating data-parallel programs onto both SIMD and MIMD hardware.
ENERGY MANAGEMENT STRATEGY FOR PARALLEL HYBRID ELECTRIC VEHICLES
Institute of Scientific and Technical Information of China (English)
Pu Jinhuan; Yin Chengliang; ZhangJianwu
2005-01-01
Energy management strategy (EMS) is the core of the real-time control algorithm of the hybrid electric vehicle (HEV). A novel EMS using the logic threshold approach with incorporation of a stand-by optimization algorithm is proposed. The aim of it is to minimize the engine fuel consumption and maintain the battery state of charge (SOC) in its operation range, while satisfying the vehicle performance and drivability requirements. The hybrid powertrain bench test is carried out to collect data of the engine, motor and battery pack, which are used in the EMS to control the powertrain. Computer simulation model of the HEV is established in the MATLAB/Simulink environment according to the bench test results. Simulation results are presented for behaviors of the engine, motor and battery. The proposed EMS is implemented for a real parallel hybrid car control system and validated by vehicle field tests.
Communication issues in parallel computation
Energy Technology Data Exchange (ETDEWEB)
Newman-Wolfe, R.E.
1986-01-01
This dissertation reviews models of parallel and distributed computation, with a focus on fundamental problems of communication between cooperating processes. Multi-stage switching networks are a relatively cheap and flexible medium of communication. New results are presented on embeddings and measure appropriate to multiprocessors connected by multi-stage switching networks, showing that such networks can efficiently simulate other topologies. An algebraic approach provides insight into the structure of the set of permutations that can be realized in two passes through the butterfly network in particular. In order to simulate another topology, or indeed, communicate at all, a communication network must somehow permit sharing of memory. New algorithms for controlling access to shared variables using weaker shared variables are presented. A trade off between the amount of space and the amount of waiting is exhibited in some of the algorithms. Another measure of the efficiency of communication is the size of the connecting network. The size of the communication medium required for a universal simulator is formulated here as a graph-theoretic problem.
Template based parallel checkpointing in a massively parallel computer system
Archer, Charles Jens; Inglett, Todd Alan
2009-01-13
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
Reservoir Thermal Recover Simulation on Parallel Computers
Li, Baoyan; Ma, Yuanle
The rapid development of parallel computers has provided a hardware background for massive refine reservoir simulation. However, the lack of parallel reservoir simulation software has blocked the application of parallel computers on reservoir simulation. Although a variety of parallel methods have been studied and applied to black oil, compositional, and chemical model numerical simulations, there has been limited parallel software available for reservoir simulation. Especially, the parallelization study of reservoir thermal recovery simulation has not been fully carried out, because of the complexity of its models and algorithms. The authors make use of the message passing interface (MPI) standard communication library, the domain decomposition method, the block Jacobi iteration algorithm, and the dynamic memory allocation technique to parallelize their serial thermal recovery simulation software NUMSIP, which is being used in petroleum industry in China. The parallel software PNUMSIP was tested on both IBM SP2 and Dawn 1000A distributed-memory parallel computers. The experiment results show that the parallelization of I/O has great effects on the efficiency of parallel software PNUMSIP; the data communication bandwidth is also an important factor, which has an influence on software efficiency. Keywords: domain decomposition method, block Jacobi iteration algorithm, reservoir thermal recovery simulation, distributed-memory parallel computer
Compiler Technology for Parallel Scientific Computation
Directory of Open Access Journals (Sweden)
Can Özturan
1994-01-01
Full Text Available There is a need for compiler technology that, given the source program, will generate efficient parallel codes for different architectures with minimal user involvement. Parallel computation is becoming indispensable in solving large-scale problems in science and engineering. Yet, the use of parallel computation is limited by the high costs of developing the needed software. To overcome this difficulty we advocate a comprehensive approach to the development of scalable architecture-independent software for scientific computation based on our experience with equational programming language (EPL. Our approach is based on a program decomposition, parallel code synthesis, and run-time support for parallel scientific computation. The program decomposition is guided by the source program annotations provided by the user. The synthesis of parallel code is based on configurations that describe the overall computation as a set of interacting components. Run-time support is provided by the compiler-generated code that redistributes computation and data during object program execution. The generated parallel code is optimized using techniques of data alignment, operator placement, wavefront determination, and memory optimization. In this article we discuss annotations, configurations, parallel code generation, and run-time support suitable for parallel programs written in the functional parallel programming language EPL and in Fortran.
Streamline integration using MPI-hybrid parallelism on a large multicore architecture.
Camp, David; Garth, Christoph; Childs, Hank; Pugmire, Dave; Joy, Kenneth I
2011-11-01
Streamline computation in a very large vector field data set represents a significant challenge due to the nonlocal and data-dependent nature of streamline integration. In this paper, we conduct a study of the performance characteristics of hybrid parallel programming and execution as applied to streamline integration on a large, multicore platform. With multicore processors now prevalent in clusters and supercomputers, there is a need to understand the impact of these hybrid systems in order to make the best implementation choice. We use two MPI-based distribution approaches based on established parallelization paradigms, parallelize over seeds and parallelize over blocks, and present a novel MPI-hybrid algorithm for each approach to compute streamlines. Our findings indicate that the work sharing between cores in the proposed MPI-hybrid parallel implementation results in much improved performance and consumes less communication and I/O bandwidth than a traditional, nonhybrid distributed implementation.
Application of parallel computing to robot dynamics
Schäfer, Peter; Schiehlen, Werner
1993-01-01
In this paper an approach for the application of parallel processing to the dynamic analysis of robots based on the multibody system method is presented. The inherent structure of the symbolic equations of motion is used for partitioning those into independent modules for concurrent evaluation. The applied strategies for parallelization include the parallel evaluation of subsystem equations and the parallel computation of the inertia matrix along with its factorization, and of the force vecto...
Remarks on parallel computations in MATLAB environment
Opalska, Katarzyna; Opalski, Leszek
2013-10-01
The paper attempts to summarize author's investigation of parallel computation capability of MATLAB environment in solving large ordinary differential equations (ODEs). Two MATLAB versions were tested and two parallelization techniques: one used multiple processors-cores, the other - CUDA compatible Graphics Processing Units (GPUs). A set of parameterized test problems was specially designed to expose different capabilities/limitations of the different variants of the parallel computation environment tested. Presented results illustrate clearly the superiority of the newer MATLAB version and, elapsed time advantage of GPU-parallelized computations for large dimensionality problems over the multiple processor-cores (with speed-up factor strongly dependent on the problem structure).
A Parallel Genetic Simulated Annealing Hybrid Algorithm for Task Scheduling
Institute of Scientific and Technical Information of China (English)
SHU Wanneng; ZHENG Shijue
2006-01-01
In this paper combined with the advantages of genetic algorithm and simulated annealing, brings forward a parallel genetic simulated annealing hybrid algorithm (PGSAHA) and applied to solve task scheduling problem in grid computing .It first generates a new group of individuals through genetic operation such as reproduction, crossover, mutation, etc, and than simulated anneals independently all the generated individuals respectively.When the temperature in the process of cooling no longer falls, the result is the optimal solution on the whole.From the analysis and experiment result, it is concluded that this algorithm is superior to genetic algorithm and simulated annealing.
Hybrid Parallel Bidirectional Sieve based on SMP Cluster
Liao, Gang; Liu, Lei
2012-01-01
In this article, hybrid parallel bidirectional sieve method is implemented by SMP Cluster, the individual computational units joined together by the communication network, are usually shared-memory systems with one or more multicore processor. To high-efficiency optimization, we propose average divide data into nodes, generating double-ended queues (deque) for sieve method that are able to exploit dual-cores simultaneously start sifting out primes from the head and tail.And each node create a FIFO queue as dynamic data buffer to ache temporary data from another nodes send to. The approach obtains huge speedup and efficiency on SMP Cluster.
Massively parallel evolutionary computation on GPGPUs
Tsutsui, Shigeyoshi
2013-01-01
Evolutionary algorithms (EAs) are metaheuristics that learn from natural collective behavior and are applied to solve optimization problems in domains such as scheduling, engineering, bioinformatics, and finance. Such applications demand acceptable solutions with high-speed execution using finite computational resources. Therefore, there have been many attempts to develop platforms for running parallel EAs using multicore machines, massively parallel cluster machines, or grid computing environments. Recent advances in general-purpose computing on graphics processing units (GPGPU) have opened u
Fuel optimal control of parallel hybrid electric vehicles
Institute of Scientific and Technical Information of China (English)
Jinhuan PU; Chenliang YIN; Jianwu ZHANG
2008-01-01
A mathematical model for fuel optimal control and its corresponding dynamic programming (DP) recurs-ive equation were established for an existing parallel hybrid electric vehicle (HEV). Two augmented cost func-tions for gear shifting and engine stop-starting were designed to limit their frequency. To overcome the prob-lem of numerical DP dimensionality, an algorithm to restrict the exploring region was proposed. The algorithm significantly reduced the computational complexity. The system model was converted into real-time simulation code by using MATLAB/RTW to improve computation efficiency. Comparison between the results of a chassis dynamometer test, simulation, and DP proves that the proposed method can compute the performance limita-tion of the HEV within an acceptable time period and can be used to evaluate and optimize the control strategy.
Hybrid Parallelism for Volume Rendering on Large, Multi- and Many-core Systems
Energy Technology Data Exchange (ETDEWEB)
Howison, Mark; Bethel, E. Wes; Childs, Hank
2011-01-01
With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells. The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance.
Collectively loading an application in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.; Miller, Samuel J.; Mundy, Michael B.
2016-01-05
Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.
A Survey of Parallel Computing
1988-07-01
11 is being developed as a more powerful restructuring compiler that will produce executable code. It will be a multilanguage compiler allowing...IL 61821-8149. R6 SN’ ,i Ok~. ;§ .1’ma JOURNALS AND BOOKS 167 IEEE Computer This journal is published by the Computer Society of the Institute of...Electrical and Electronics Engineers (IEEE). An annual subscription is included in society member dues. Computer Architecture News This journal is
Institute of Scientific and Technical Information of China (English)
迟利华; 刘杰; 龚春叶; 徐涵; 蒋杰; 胡庆丰
2009-01-01
The parallel performance of solving the multi-group particle transport equations on the unstructure meshes is analyzed Adapting to the characteristics of multi-core cluster systems, this paper desgins a MPI/OpenMP hybrid parallel code. For the meshes, the space domain decomposition is adopted, and MPI between the computations of multi-core CPU nodes is used. When each MPI process begin to compute the variables of the energy groups, several OpenMP threads will be forked, and the threads start to compute simultaneously in the same mutli-core CPU node. Using the MPI/OpenMP hybrid parallel code, we solve a 2D mutli-group particle transport equation on a cluster with mutli-core CPU nodes, and the results show that the code has good scalability and can be scaled to 1024 CPU cores.%本文分析了非结构网格多群粒子输运Sn方程求解的并行性,拟合多核机群系统的特点,设计了MPI/OpenMP混合程序,针对空间网格点采用区域分解划分,计算结点间基于消息传递MPI编程,每个MPI计算进程在计算过程中碰到关于能群的计算,就生成多个OpenMP线程,计算结点内针对能群进行多线程并行计算.数值测试结果表明,非结构网格上的粒子输运问题的混合并行计算能较好地匹配多核机群系统的硬件结构,具有良好的可扩展性,可以扩展到1 024个CPU核.
Parallel computation with the spectral element method
Energy Technology Data Exchange (ETDEWEB)
Ma, Hong
1995-12-01
Spectral element models for the shallow water equations and the Navier-Stokes equations have been successfully implemented on a data parallel supercomputer, the Connection Machine model CM-5. The nonstaggered grid formulations for both models are described, which are shown to be especially efficient in data parallel computing environment.
Massively Parallel Computing: A Sandia Perspective
Energy Technology Data Exchange (ETDEWEB)
Dosanjh, Sudip S.; Greenberg, David S.; Hendrickson, Bruce; Heroux, Michael A.; Plimpton, Steve J.; Tomkins, James L.; Womble, David E.
1999-05-06
The computing power available to scientists and engineers has increased dramatically in the past decade, due in part to progress in making massively parallel computing practical and available. The expectation for these machines has been great. The reality is that progress has been slower than expected. Nevertheless, massively parallel computing is beginning to realize its potential for enabling significant break-throughs in science and engineering. This paper provides a perspective on the state of the field, colored by the authors' experiences using large scale parallel machines at Sandia National Laboratories. We address trends in hardware, system software and algorithms, and we also offer our view of the forces shaping the parallel computing industry.
Parallel Computing of Ocean General Circulation Model
Institute of Scientific and Technical Information of China (English)
无
2001-01-01
This paper discusses the parallel computing of the thirdgeneration Ocea n General Circulation Model (OGCM) from the State Key Laboratory of Numerical Mo deling for Atmospheric Science and Geophysical Fluid Dynamics(LASG),Institute of Atmosphere Physics(IAP). Meanwhile, several optimization strategies for paralle l computing of OGCM (POGCM) on Scalable Shared Memory Multiprocessor (S2MP) are presented. Using Message Passing Interface (MPI), we obtain super linear speedup on SGI Origin 2000 for parallel OGCM(POGCM) after optimization.
Structured Parallel Programming Patterns for Efficient Computation
McCool, Michael; Robison, Arch
2012-01-01
Programming is now parallel programming. Much as structured programming revolutionized traditional serial programming decades ago, a new kind of structured programming, based on patterns, is relevant to parallel programming today. Parallel computing experts and industry insiders Michael McCool, Arch Robison, and James Reinders describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give detailed concrete examples using multiple programming models. Examples are primarily given using two of th
Hybrid Parallel Contour Trees, Version 1.0
Energy Technology Data Exchange (ETDEWEB)
2017-01-03
A common operation in scientific visualization is to compute and render a contour of a data set. Given a function of the form f : R^d -> R, a level set is defined as an inverse image f^-1(h) for an isovalue h, and a contour is a single connected component of a level set. The Reeb graph can then be defined to be the result of contracting each contour to a single point, and is well defined for Euclidean spaces or for general manifolds. For simple domains, the graph is guaranteed to be a tree, and is called the contour tree. Analysis can then be performed on the contour tree in order to identify isovalues of particular interest, based on various metrics, and render the corresponding contours, without having to know such isovalues a priori. This code is intended to be the first data-parallel algorithm for computing contour trees. Our implementation will use the portable data-parallel primitives provided by Nvidia’s Thrust library, allowing us to compile our same code for both GPUs and multi-core CPUs. Native OpenMP and purely serial versions of the code will likely also be included. It will also be extended to provide a hybrid data-parallel / distributed algorithm, allowing scaling beyond a single GPU or CPU.
Hybridity in Embedded Computing Systems
Institute of Scientific and Technical Information of China (English)
虞慧群; 孙永强
1996-01-01
An embedded system is a system that computer is used as a component in a larger device.In this paper,we study hybridity in embedded systems and present an interval based temporal logic to express and reason about hybrid properties of such kind of systems.
Structured grid generator on parallel computers
Energy Technology Data Exchange (ETDEWEB)
Muramatsu, Kazuhiro; Murakami, Hiroyuki; Higashida, Akihiro; Yanagisawa, Ichiro
1997-03-01
A general purpose structured grid generator on parallel computers, which generates a large-scale structured grid efficiently, has been developed. The generator is applicable to Cartesian, cylindrical and BFC (Boundary-Fitted Curvilinear) coordinates. In case of BFC grids, there are three adaptable topologies; L-type, O-type and multi-block type, the last of which enables any combination of L- and O-grids. Internal BFC grid points can be automatically generated and smoothed by either algebraic supplemental method or partial differential equation method. The partial differential equation solver is implemented on parallel computers, because it consumes a large portion of overall execution time. Therefore, high-speed processing of large-scale grid generation can be realized by use of parallel computer. Generated grid data are capable to be adjusted to domain decomposition for parallel analysis. (author)
Computer-Aided Parallelizer and Optimizer
Jin, Haoqiang
2011-01-01
The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.
Parallel Algorithms for Computer Vision.
1989-01-01
developed algorithms for sev- stage at which they are used, for example by a eral early vision processes, such as edge detection, stere - navigation...system operates by receiving a stream of instructions from its front end computer. A microcontroller receives the instructions, expands each of them...instructions flow into the Connection Machine hardware from the front end. These I macro-instructions are sent to a microcontroller , which expands them
Verifiable Computation with Massively Parallel Interactive Proofs
Thaler, Justin; Mitzenmacher, Michael; Pfister, Hanspeter
2012-01-01
As the cloud computing paradigm has gained prominence, the need for verifiable computation has grown increasingly urgent. The concept of verifiable computation enables a weak client to outsource difficult computations to a powerful, but untrusted, server. Protocols for verifiable computation aim to provide the client with a guarantee that the server performed the requested computations correctly, without requiring the client to perform the computations herself. By design, these protocols impose a minimal computational burden on the client. However, existing protocols require the server to perform a large amount of extra bookkeeping in order to enable a client to easily verify the results. Verifiable computation has thus remained a theoretical curiosity, and protocols for it have not been implemented in real cloud computing systems. Our goal is to leverage GPUs to reduce the server-side slowdown for verifiable computation. To this end, we identify abundant data parallelism in a state-of-the-art general-purpose...
Graph Partitioning Models for Parallel Computing
Energy Technology Data Exchange (ETDEWEB)
Hendrickson, B.; Kolda, T.G.
1999-03-02
Calculations can naturally be described as graphs in which vertices represent computation and edges reflect data dependencies. By partitioning the vertices of a graph, the calculation can be divided among processors of a parallel computer. However, the standard methodology for graph partitioning minimizes the wrong metric and lacks expressibility. We survey several recently proposed alternatives and discuss their relative merits.
Parallel FFT Algorithm on Computer Clusters
Institute of Scientific and Technical Information of China (English)
无
2005-01-01
DFT is widely applied in the field of signal process and others. Most present rapid ways of calculation are either based on paralleled computers connected by such particular systems like butterfly network, hypercube etc;or based on the assumption of instant transportation, non-conflict communication, complete connection of paralleled processors and unlimited usable processors. However, the delay of communication in the system of information transmission cannot be ignored. This paper works on the following aspects: instant transmission, dispatching missions, and the path of information through the communication link in the computer cluster systems;layout of the dynamic FFT algorithm under the different structures of computer clusters.
Internode data communications in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E
2014-02-11
Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
Internode data communications in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Miller, Douglas R.; Parker, Jeffrey J.; Ratterman, Joseph D.; Smith, Brian E.
2013-09-03
Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
Link failure detection in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Megerian, Mark G.; Smith, Brian E.
2010-11-09
Methods, apparatus, and products are disclosed for link failure detection in a parallel computer including compute nodes connected in a rectangular mesh network, each pair of adjacent compute nodes in the rectangular mesh network connected together using a pair of links, that includes: assigning each compute node to either a first group or a second group such that adjacent compute nodes in the rectangular mesh network are assigned to different groups; sending, by each of the compute nodes assigned to the first group, a first test message to each adjacent compute node assigned to the second group; determining, by each of the compute nodes assigned to the second group, whether the first test message was received from each adjacent compute node assigned to the first group; and notifying a user, by each of the compute nodes assigned to the second group, whether the first test message was received.
Locating hardware faults in a parallel computer
Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.
2010-04-13
Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.
Wavelet-Based DFT calculations on Massively Parallel Hybrid Architectures
Genovese, Luigi
2011-03-01
In this contribution, we present an implementation of a full DFT code that can run on massively parallel hybrid CPU-GPU clusters. Our implementation is based on modern GPU architectures which support double-precision floating-point numbers. This DFT code, named BigDFT, is delivered within the GNU-GPL license either in a stand-alone version or integrated in the ABINIT software package. Hybrid BigDFT routines were initially ported with NVidia's CUDA language, and recently more functionalities have been added with new routines writeen within Kronos' OpenCL standard. The formalism of this code is based on Daubechies wavelets, which is a systematic real-space based basis set. As we will see in the presentation, the properties of this basis set are well suited for an extension on a GPU-accelerated environment. In addition to focusing on the implementation of the operators of the BigDFT code, this presentation also relies of the usage of the GPU resources in a complex code with different kinds of operations. A discussion on the interest of present and expected performances of Hybrid architectures computation in the framework of electronic structure calculations is also adressed.
Parallel visualization on leadership computing resources
Energy Technology Data Exchange (ETDEWEB)
Peterka, T; Ross, R B [Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439 (United States); Shen, H-W [Department of Computer Science and Engineering, Ohio State University, Columbus, OH 43210 (United States); Ma, K-L [Department of Computer Science, University of California at Davis, Davis, CA 95616 (United States); Kendall, W [Department of Electrical Engineering and Computer Science, University of Tennessee at Knoxville, Knoxville, TN 37996 (United States); Yu, H, E-mail: tpeterka@mcs.anl.go [Sandia National Laboratories, California, Livermore, CA 94551 (United States)
2009-07-01
Changes are needed in the way that visualization is performed, if we expect the analysis of scientific data to be effective at the petascale and beyond. By using similar techniques as those used to parallelize simulations, such as parallel I/O, load balancing, and effective use of interprocess communication, the supercomputers that compute these datasets can also serve as analysis and visualization engines for them. Our team is assessing the feasibility of performing parallel scientific visualization on some of the most powerful computational resources of the U.S. Department of Energy's National Laboratories in order to pave the way for analyzing the next generation of computational results. This paper highlights some of the conclusions of that research.
Primitive parallel operations for computational linear algebra
Energy Technology Data Exchange (ETDEWEB)
Panetta, J.
1985-01-01
This work is a small step in the direction of code portability over parallel and vector machines. The proposal consists of a style of programming and a set of parallel operators built over abstract data types. Objects and operators are directed to the Computational Linear Algebra area, although the principles of the proposal can be applied to any other area. A subset of the operators was implemented on a 64-processor, distributed memory MIMD machine, and the results are that computationally intensive operators achieve asymptotically optimal speed-ups, but data movement operators are inefficient, some even intrinsically sequential.
A Computing Platform for Parallel Sparse Matrix Computations
2016-01-05
infiniband. Each node contains 24 cores. This parallel computing platform has been used by my research group in the early stages of developing large... research staff Inventions (DD882) Scientific Progress Two classes of parallel solvers have been developed. The first is a family of parallel sparse...SECURITY CLASSIFICATION OF: This grant enabled the purchase of an Intel multiprocessor consisting of eight multicore nodes interconnected via an
Arkin, Ethem; Tekinerdogan, Bedir; Imre, Kayhan M.
2017-01-01
The need for high-performance computing together with the increasing trend from single processor to parallel computer architectures has leveraged the adoption of parallel computing. To benefit from parallel computing power, usually parallel algorithms are defined that can be mapped and executed
Arkin, Ethem; Tekinerdogan, Bedir; Imre, Kayhan M.
2016-01-01
The need for high-performance computing together with the increasing trend from single processor to parallel computer architectures has leveraged the adoption of parallel computing. To benefit from parallel computing power, usually parallel algorithms are defined that can be mapped and executed o
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.
2014-08-12
Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.
Electric Propulsion Plume Simulations Using Parallel Computer
Directory of Open Access Journals (Sweden)
Joseph Wang
2007-01-01
Full Text Available A parallel, three-dimensional electrostatic PIC code is developed for large-scale electric propulsion simulations using parallel supercomputers. This code uses a newly developed immersed-finite-element particle-in-cell (IFE-PIC algorithm designed to handle complex boundary conditions accurately while maintaining the computational speed of the standard PIC code. Domain decomposition is used in both field solve and particle push to divide the computation among processors. Two simulations studies are presented to demonstrate the capability of the code. The first is a full particle simulation of near-thruster plume using real ion to electron mass ratio. The second is a high-resolution simulation of multiple ion thruster plume interactions for a realistic spacecraft using a domain enclosing the entire solar array panel. Performance benchmarks show that the IFE-PIC achieves a high parallel efficiency of ≥ 90%
Parallel Computing Properties of Tail Copolymer Chain
Directory of Open Access Journals (Sweden)
Hong Li
2013-08-01
Full Text Available The properties of a AB diblock copolymer chain are calculated by Monte Carlo methods. Monomer A contacting to the surface has an adsorption energy E=-1 and monomer B E= 0. The polymer chain is simulated by self-avoiding walk in simple cubic lattice. The adsorption properties and the conformation properties of the polymer chain are computed by using message passing interface (MPI. The speedup is close to linear speedup by parallel computing independent samples.
Contributions to computational stereology and parallel programming
DEFF Research Database (Denmark)
Rasmusson, Allan
rotator, even without the need for isotropic sections. To meet the need for computational power to perform image restoration of virtual tissue sections, parallel programming on GPUs has also been part of the project. This has lead to a significant change in paradigm for a previously developed surgical...
Software development strategies for parallel computer architectures
Gruber, Ralf; Cooper, W. Anthony; Beniston, Martin; Gengler, Marc; Merazzi, Silvio
1991-09-01
As pragmatic users of high performance supercomputers, we believe that nowadays parallel computer architectures with disturbed memories are not yet mature to be used by a wide range of application engineers. A big effort should be made to bring these very promising computers closer to the users. One major flaw of massively parallel machines is that the programmer has to take care himself of the data flow which is often different on different parallel computers. To overcome this problem, we propose that data structures be standardized. The data base then can become an integrated part of the system and the data flow for a given algorithm can be easily prescribed. Fixing data structures forces the computer manufacturer to rather adapt his machine to user's demands and not, as it happens now, the user has to adapt to the innovative computer science approach of the computer manufacturer. In this paper, we present data standards chosen for our ASTRID programming platform for research scientist and engineers, as well as a plasma physics application which won the Cray Gigaflop Performance Awards 1989 and 1990 and which was succesfully ported on an INTEL iPSC/2 hypercube.
Hou, Zhen-Long; Wei, Xiao-Hui; Huang, Da-Nian; Sun, Xu
2015-09-01
We apply reweighted inversion focusing to full tensor gravity gradiometry data using message-passing interface (MPI) and compute unified device architecture (CUDA) parallel computing algorithms, and then combine MPI with CUDA to formulate a hybrid algorithm. Parallel computing performance metrics are introduced to analyze and compare the performance of the algorithms. We summarize the rules for the performance evaluation of parallel algorithms. We use model and real data from the Vinton salt dome to test the algorithms. We find good match between model and real density data, and verify the high efficiency and feasibility of parallel computing algorithms in the inversion of full tensor gravity gradiometry data.
MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems
Energy Technology Data Exchange (ETDEWEB)
Howison, Mark; Bethel, E. Wes; Childs, Hank
2010-03-20
This work studies the performance and scalability characteristics of"hybrid'" parallel programming and execution as applied to raycasting volume rendering -- a staple visualization algorithm -- on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with four- and six-core chips common today and 128-core chips coming soon, we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.
Hybrid Parallelism for Volume Rendering on Large, Multi-core Systems
Howison, M.; Bethel, E. W.; Childs, H.
2011-10-01
This work studies the performance and scalability characteristics of "hybrid" parallel programming and execution as applied to raycasting volume rendering - a staple visualization algorithm - on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with four- and six-core chips common today, as well as processors capable of running hundreds of concurrent threads (GPUs), we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.
Implementing hybrid MPI/OpenMP parallelism in Fluidity
Gorman, Gerard; Lange, Michael; Avdis, Alexandros; Guo, Xiaohu; Mitchell, Lawrence; Weiland, Michele
2014-05-01
Parallelising finite element codes using domain decomposition methods and MPI has nearly become routine at the application code level. This has been helped in no small part by the development of an eco-system of open source libraries to provide key functionality, for example SCOTCH for graph partitioning or PETSc for sparse iterative solvers. As we move to an era where pure MPI no longer suffices, application developers cannot only focus on the application code, but must consider the full software stack. In the case of Fluidity (an open source control volume/finite element general purpose fluid dynamics code) the decision to improve parallel efficiency by moving to a hybrid MPI/OpenMP programming model it became necessary to get involved in extending 3rd party open source libraries, specifically PETSc, in addition to the application code itself. The effort involved in re-engineering a large application code highlights the fact that as computing platforms continue their advance towards low power many core processors, the software stack must also develop at a similar pace or application codes will suffer. In this presentation we will illustrate the steps required to re-engineer Fluidity to achieve good parallel efficiency when using MPI/OpenMP. We identify performance pitfalls when using Fortran features such as automatic arrays in a multi-threaded context, as well as poor data locality on NUMA platforms. A significant proportion of the computational cost is in the sparse iterative solvers. For this we collaborated with the development team at Argonne National Laboratory to add OpenMP support to PETSc. We will present performance results for both the application as a whole, as well as for key individual components such as matrix assembly and the solvers. We also show that while we did not explicitly target I/O for optimisation here, its performance is nonetheless greatly improved because of fewer processes accessing the file system. One of the main remaining
Parallel peak pruning for scalable SMP contour tree computation
Energy Technology Data Exchange (ETDEWEB)
Carr, Hamish A. [Univ. of Leeds (United Kingdom); Weber, Gunther H. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Davis, CA (United States); Sewell, Christopher M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Ahrens, James P. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
2017-03-09
As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this form of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.
Intranode data communications in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Ratterman, Joseph D; Smith, Brian E
2014-01-07
Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a computer node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.
Parallel computing in atmospheric chemistry models
Energy Technology Data Exchange (ETDEWEB)
Rotman, D. [Lawrence Livermore National Lab., CA (United States). Atmospheric Sciences Div.
1996-02-01
Studies of atmospheric chemistry are of high scientific interest, involve computations that are complex and intense, and require enormous amounts of I/O. Current supercomputer computational capabilities are limiting the studies of stratospheric and tropospheric chemistry and will certainly not be able to handle the upcoming coupled chemistry/climate models. To enable such calculations, the authors have developed a computing framework that allows computations on a wide range of computational platforms, including massively parallel machines. Because of the fast paced changes in this field, the modeling framework and scientific modules have been developed to be highly portable and efficient. Here, the authors present the important features of the framework and focus on the atmospheric chemistry module, named IMPACT, and its capabilities. Applications of IMPACT to aircraft studies will be presented.
Advanced Hybrid Computer Systems. Software Technology.
This software technology final report evaluates advances made in Advanced Hybrid Computer System software technology . The report describes what...automatic patching software is available as well as which analog/hybrid programming languages would be most feasible for the Advanced Hybrid Computer...compiler software . The problem of how software would interface with the hybrid system is also presented.
Accelerating Climate Simulations Through Hybrid Computing
Zhou, Shujia; Sinno, Scott; Cruz, Carlos; Purcell, Mark
2009-01-01
Unconventional multi-core processors (e.g., IBM Cell B/E and NYIDIDA GPU) have emerged as accelerators in climate simulation. However, climate models typically run on parallel computers with conventional processors (e.g., Intel and AMD) using MPI. Connecting accelerators to this architecture efficiently and easily becomes a critical issue. When using MPI for connection, we identified two challenges: (1) identical MPI implementation is required in both systems, and; (2) existing MPI code must be modified to accommodate the accelerators. In response, we have extended and deployed IBM Dynamic Application Virtualization (DAV) in a hybrid computing prototype system (one blade with two Intel quad-core processors, two IBM QS22 Cell blades, connected with Infiniband), allowing for seamlessly offloading compute-intensive functions to remote, heterogeneous accelerators in a scalable, load-balanced manner. Currently, a climate solar radiation model running with multiple MPI processes has been offloaded to multiple Cell blades with approx.10% network overhead.
A hybrid parallel framework for the cellular Potts model simulations
Energy Technology Data Exchange (ETDEWEB)
Jiang, Yi [Los Alamos National Laboratory; He, Kejing [SOUTH CHINA UNIV; Dong, Shoubin [SOUTH CHINA UNIV
2009-01-01
The Cellular Potts Model (CPM) has been widely used for biological simulations. However, most current implementations are either sequential or approximated, which can't be used for large scale complex 3D simulation. In this paper we present a hybrid parallel framework for CPM simulations. The time-consuming POE solving, cell division, and cell reaction operation are distributed to clusters using the Message Passing Interface (MPI). The Monte Carlo lattice update is parallelized on shared-memory SMP system using OpenMP. Because the Monte Carlo lattice update is much faster than the POE solving and SMP systems are more and more common, this hybrid approach achieves good performance and high accuracy at the same time. Based on the parallel Cellular Potts Model, we studied the avascular tumor growth using a multiscale model. The application and performance analysis show that the hybrid parallel framework is quite efficient. The hybrid parallel CPM can be used for the large scale simulation ({approx}10{sup 8} sites) of complex collective behavior of numerous cells ({approx}10{sup 6}).
Efficient Parallel Engineering Computing on Linux Workstations
Lou, John Z.
2010-01-01
A C software module has been developed that creates lightweight processes (LWPs) dynamically to achieve parallel computing performance in a variety of engineering simulation and analysis applications to support NASA and DoD project tasks. The required interface between the module and the application it supports is simple, minimal and almost completely transparent to the user applications, and it can achieve nearly ideal computing speed-up on multi-CPU engineering workstations of all operating system platforms. The module can be integrated into an existing application (C, C++, Fortran and others) either as part of a compiled module or as a dynamically linked library (DLL).
HiPPI-based parallel computing
Jung, Charles C.
1993-02-01
The IBM Enhanced Clustered Fortran (ECF) advanced technology project combines parallel computing technology with a HiPPI-based LAN network. The ECF environment is a clustered, parallel computing environment which consists of IBM ES/90001 complexes and possibly other parallel machines connected by HiPPI. The ECF software, including the language processor, is independent of hardware architectures, operating systems, and the Fortran compiler and runtime library. The ECF software is highly portable because it is based on well-known, standard technology and transport protocols such as Remote Procedure Call (RPC), X/Open Transport Interface (XTI), and TCP/IP. The ECF software is transport-independent, and can accommodate other transport protocols concurrently. This paper describes the IBM ECF environment including the language extensions, the programming model, and the software layers and components. Also, this paper explains how to achieve portability and scalability. Lastly, this paper describes how effective task communication is accomplished in ECF through RPC, XTI, TCP/IP, and a customized enhancement over HiPPI. An analysis of network performance in terms of bottleneck conditions is presented, and empirical data indicating improved throughput is provided. Comparisons to alternative methodologies and technologies are also presented.
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations
Energy Technology Data Exchange (ETDEWEB)
Detrixhe, Miles, E-mail: mdetrixhe@engineering.ucsb.edu [Department of Mechanical Engineering (United States); University of California Santa Barbara, Santa Barbara, CA, 93106 (United States); Gibou, Frédéric, E-mail: fgibou@engineering.ucsb.edu [Department of Mechanical Engineering (United States); University of California Santa Barbara, Santa Barbara, CA, 93106 (United States); Department of Computer Science (United States); Department of Mathematics (United States)
2016-10-01
The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
Hybrid massively parallel fast sweeping method for static Hamilton-Jacobi equations
Detrixhe, Miles; Gibou, Frédéric
2016-10-01
The fast sweeping method is a popular algorithm for solving a variety of static Hamilton-Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
COMPARISON OF PARALLEL AND SERIES HYBRID POWERTRAINS FOR TRANSIT BUS APPLICATION
Energy Technology Data Exchange (ETDEWEB)
Gao, Zhiming [ORNL; Daw, C Stuart [ORNL; Smith, David E [ORNL; Jones, Perry T [ORNL; LaClair, Tim J [ORNL; Parks, II, James E [ORNL
2016-01-01
The fuel economy and emissions of both conventional and hybrid buses equipped with emissions aftertreatment were evaluated via computational simulation for six representative city bus drive cycles. Both series and parallel configurations for the hybrid case were studied. The simulation results indicate that series hybrid buses have the greatest overall advantage in fuel economy. The series and parallel hybrid buses were predicted to produce similar CO and HC tailpipe emissions but were also predicted to have reduced NOx tailpipe emissions compared to the conventional bus in higher speed cycles. For the New York bus cycle (NYBC), which has the lowest average speed among the cycles evaluated, the series bus tailpipe emissions were somewhat higher than they were for the conventional bus, while the parallel hybrid bus had significantly lower tailpipe emissions. All three bus powertrains were found to require periodic active DPF regeneration to maintain PM control. Plug-in operation of series hybrid buses appears to offer significant fuel economy benefits and is easily employed due to the relatively large battery capacity that is typical of the series hybrid configuration.
Parallel Scientific Computing in C++ and MPI
Karniadakis, George Em; Kirby, Robert M., II
2003-06-01
This book provides a seamless approach to numerical algorithms, modern programming techniques and parallel computing. These concepts and tools are usually taught serially across different courses and different textbooks, thus observing the connection between them. The necessity of integrating these subjects usually comes after such courses are concluded (e.g., during a first job or a thesis project), thus forcing the student to synthesize what is perceived to be three independent subfields into one in order to produce a solution. The book includes both basic and advanced topics and places equal emphasis on the discretization of partial differential equations and on solvers. Advanced topics include wavelets, high-order methods, non-symmetric systems and parallelization of sparse systems. A CD-ROM accompanies the text.
Parallel computations in linear algebra. II
Energy Technology Data Exchange (ETDEWEB)
Faddeeva, V.N.; Faddeev, D.K.
1982-05-01
For pt.I, see Kibernetika, vol.13, no.6, p.28 (1977). Considerable effort was devoted in the surveyed period to automatic decomposition of sequential algorithms, or rather of procedures or subprograms written in the algorithmic languages ALGOL or FORTRAN. The authors do not consider this body of research, they only note that, on the one hand, the available linear algebra subprograms included in Eispack provide convenient objects for testing various approaches to automatic construction of parallel programs and, on the other, an important state in this activity is the development of methods for fast and efficient solution of linear recurrences, which reduce to solving systems of linear equations with band-triangular matrix (in particular, of sufficiently small width). This article reflects the penetration of the parallelism ideas into the computational methods of linear algebra in recent years. 74 references.
Electromagnetic Physics Models for Parallel Computing Architectures
Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.
2016-10-01
The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
Parallel Computing in Information Retrieval--An Updated Review.
Macfarlane, A.; And Others
1997-01-01
Reviews the progress of parallel computing in information retrieval (IR) and stresses the importance of the motivation in using parallel computing for text retrieval. Analyzes parallel IR systems using a classification defined by Rasmussen; describes retrieval models used in parallel information processing; and suggests areas of needed research.…
parallelnewhybrid: an R package for the parallelization of hybrid detection using newhybrids.
Wringe, Brendan F; Stanley, Ryan R E; Jeffery, Nicholas W; Anderson, Eric C; Bradbury, Ian R
2017-01-01
Hybridization among populations and species is a central theme in many areas of biology, and the study of hybridization has direct applicability to testing hypotheses about evolution, speciation and genetic recombination, as well as having conservation, legal and regulatory implications. Yet, despite being a topic of considerable interest, the identification of hybrid individuals, and quantification of the (un)certainty surrounding the identifications, remains difficult. Unlike other programs that exist to identify hybrids based on genotypic information, newhybrids is able to assign individuals to specific hybrid classes (e.g. F1 , F2 ) because it makes use of patterns of gene inheritance within each locus, rather than just the proportions of gene inheritance within each individual. For each comparison and set of markers, multiple independent runs of each data set should be used to develop an estimate of the hybrid class assignment accuracy. The necessity of analysing multiple simulated data sets, constructed from large genomewide data sets, presents significant computational challenges. To address these challenges, we present parallelnewhybrid, an r package designed to decrease user burden when undertaking multiple newhybrids analyses. parallelnewhybrid does so by taking advantage of the parallel computational capabilities inherent in modern computers to efficiently and automatically execute separate newhybrids runs in parallel. We show that parallelization of analyses using this package affords users several-fold reductions in time over a traditional serial analysis. parallelnewhybrid consists of an example data set, a readme and three operating system-specific functions to execute parallel newhybrids analyses on each of a computer's c cores. parallelnewhybrid is freely available on the long-term software hosting site github (www.github.com/bwringe/parallelnewhybrid). © 2016 John Wiley & Sons Ltd.
Parallel algorithm for computing points on a computation front hyperplane
Krasnov, M. M.
2015-01-01
A parallel algorithm for computing points on a computation front hyperplane is described. This task arises in the computation of a quantity defined on a multidimensional rectangular domain. Three-dimensional domains are usually discussed, but the material is given in the general form when the number of measurements is at least two. When the values of a quantity at different points are internally independent (which is frequently the case), the corresponding computations are independent as well and can be performed in parallel. However, if there are internal dependences (as, for example, in the Gauss-Seidel method for systems of linear equations), then the order of scanning points of the domain is an important issue. A conventional approach in this case is to form a computation front hyperplane (a usual plane in the three-dimensional case and a line in the two-dimensional case) that moves linearly across the domain at a certain angle. At every step in the course of motion of this hyperplane, its intersection points with the domain can be treated independently and, hence, in parallel, but the steps themselves are executed sequentially. At different steps, the intersection of the hyperplane with the entire domain can have a rather complex geometry and the search for all points of the domain lying on the hyperplane at a given step is a nontrivial problem. This problem (i.e., the computation of the coordinates of points lying in the intersection of the domain with the hyperplane at a given step in the course of hyperplane motion) is addressed below. The computations over the points of the hyperplane can be executed in parallel.
Optimized data communications in a parallel computer
Faraj, Daniel A.
2014-08-19
A parallel computer includes nodes that include a network adapter that couples the node in a point-to-point network and supports communications in opposite directions of each dimension. Optimized communications include: receiving, by a network adapter of a receiving compute node, a packet--from a source direction--that specifies a destination node and deposit hints. Each hint is associated with a direction within which the packet is to be deposited. If a hint indicates the packet to be deposited in the opposite direction: the adapter delivers the packet to an application on the receiving node; forwards the packet to a next node in the opposite direction if the receiving node is not the destination; and forwards the packet to a node in a direction of a subsequent dimension if the hints indicate that the packet is to be deposited in the direction of the subsequent dimension.
A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.
Aligning multiple protein sequences by parallel hybrid genetic algorithm.
Nguyen, Hung Dinh; Yoshihara, Ikuo; Yamamori, Kunihito; Yasunaga, Moritoshi
2002-01-01
This paper presents a parallel hybrid genetic algorithm (GA) for solving the sum-of-pairs multiple protein sequence alignment. A new chromosome representation and its corresponding genetic operators are proposed. A multi-population GENITOR-type GA is combined with local search heuristics. It is then extended to run in parallel on a multiprocessor system for speeding up. Experimental results of benchmarks from the BAliBASE show that the proposed method is superior to MSA, OMA, and SAGA methods with regard to quality of solution and running time. It can be used for finding multiple sequence alignment as well as testing cost functions.
QCMPI: A parallel environment for quantum computing
Tabakin, Frank; Juliá-Díaz, Bruno
2009-06-01
QCMPI is a quantum computer (QC) simulation package written in Fortran 90 with parallel processing capabilities. It is an accessible research tool that permits rapid evaluation of quantum algorithms for a large number of qubits and for various "noise" scenarios. The prime motivation for developing QCMPI is to facilitate numerical examination of not only how QC algorithms work, but also to include noise, decoherence, and attenuation effects and to evaluate the efficacy of error correction schemes. The present work builds on an earlier Mathematica code QDENSITY, which is mainly a pedagogic tool. In that earlier work, although the density matrix formulation was featured, the description using state vectors was also provided. In QCMPI, the stress is on state vectors, in order to employ a large number of qubits. The parallel processing feature is implemented by using the Message-Passing Interface (MPI) protocol. A description of how to spread the wave function components over many processors is provided, along with how to efficiently describe the action of general one- and two-qubit operators on these state vectors. These operators include the standard Pauli, Hadamard, CNOT and CPHASE gates and also Quantum Fourier transformation. These operators make up the actions needed in QC. Codes for Grover's search and Shor's factoring algorithms are provided as examples. A major feature of this work is that concurrent versions of the algorithms can be evaluated with each version subject to alternate noise effects, which corresponds to the idea of solving a stochastic Schrödinger equation. The density matrix for the ensemble of such noise cases is constructed using parallel distribution methods to evaluate its eigenvalues and associated entropy. Potential applications of this powerful tool include studies of the stability and correction of QC processes using Hamiltonian based dynamics. Program summaryProgram title: QCMPI Catalogue identifier: AECS_v1_0 Program summary URL
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
Directory of Open Access Journals (Sweden)
Javed Ali
2012-04-01
Full Text Available Parallel computing systems compose task partitioning strategies in a true multiprocessing manner. Such systems share the algorithm and processing unit as computing resources which leads to highly inter process communications capabilities. The main part of the proposed algorithm is resource management unit which performs task partitioning and co-scheduling .In this paper, we present a technique for integrated task partitioning and co-scheduling on the privately owned network. We focus on real-time and non preemptive systems. A large variety of experiments have been conducted on the proposed algorithm using synthetic and real tasks. Goal of computation model is to provide a realistic representation of the costs of programming The results show the benefit of the task partitioning. The main characteristics of our method are optimal scheduling and strong link between partitioning, scheduling and communication. Some important models for task partitioning are also discussed in the paper. We target the algorithm for task partitioning which improve the inter process communication between the tasks and use the recourses of the system in the efficient manner. The proposed algorithm contributes the inter-process communication cost minimization amongst the executing processes.
Multivariable speed synchronisation for a parallel hybrid electric vehicle drivetrain
Alt, B.; Antritter, F.; Svaricek, F.; Schultalbers, M.
2013-03-01
In this article, a new drivetrain configuration of a parallel hybrid electric vehicle is considered and a novel model-based control design strategy is given. In particular, the control design covers the speed synchronisation task during a restart of the internal combustion engine. The proposed multivariable synchronisation strategy is based on feedforward and decoupled feedback controllers. The performance and the robustness properties of the closed-loop system are illustrated by nonlinear simulation results.
Integrated research of parallel computing: Status and future
Institute of Scientific and Technical Information of China (English)
CHEN GuoLiang; SUN GuangZhong; XU Yun; LONG Bai
2009-01-01
In the past twenty years, the research group in University of Science and Technology of China has de-veloped an integrated research method for parallel computing, which is a combination of "Architecture-Algorithm-Programming-Application". This method is also called the ecological environment of parallel computing research. In this paper, we survey the current status of integrated research method for par-allel computing and by combining the impact of multi-core systems, cloud computing and personal high performance computer, we present our outlook on the future development of parallel computing.
CFD research, parallel computation and aerodynamic optimization
Ryan, James S.
1995-01-01
Over five years of research in Computational Fluid Dynamics and its applications are covered in this report. Using CFD as an established tool, aerodynamic optimization on parallel architectures is explored. The objective of this work is to provide better tools to vehicle designers. Submarine design requires accurate force and moment calculations in flow with thick boundary layers and large separated vortices. Low noise production is critical, so flow into the propulsor region must be predicted accurately. The High Speed Civil Transport (HSCT) has been the subject of recent work. This vehicle is to be a passenger vehicle with the capability of cutting overseas flight times by more than half. A successful design must surpass the performance of comparable planes. Fuel economy, other operational costs, environmental impact, and range must all be improved substantially. For all these reasons, improved design tools are required, and these tools must eventually integrate optimization, external aerodynamics, propulsion, structures, heat transfer and other disciplines.
Broadcasting a message in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Archer, Charles J; Faraj, Daniel A
2014-11-18
Methods, systems, and products are disclosed for broadcasting a message in a parallel computer that includes: transmitting, by the logical root to all of the nodes directly connected to the logical root, a message; and for each node except the logical root: receiving the message; if that node is the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received; if that node received the message from a parent node and if that node is not a leaf node, then transmitting the message to all of the child nodes; and if that node received the message from a child node and if that node is not the physical root, then transmitting the message to all of the child nodes except the child node from which the message was received and transmitting the message to the parent node.
Broadcasting collective operation contributions throughout a parallel computer
Faraj, Ahmad [Rochester, MN
2012-02-21
Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.
A hybrid algorithm for unrelated parallel machines scheduling
Directory of Open Access Journals (Sweden)
Mohsen Shafiei Nikabadi
2016-09-01
Full Text Available In this paper, a new hybrid algorithm based on multi-objective genetic algorithm (MOGA using simulated annealing (SA is proposed for scheduling unrelated parallel machines with sequence-dependent setup times, varying due dates, ready times and precedence relations among jobs. Our objective is to minimize makespan (Maximum completion time of all machines, number of tardy jobs, total tardiness and total earliness at the same time which can be more advantageous in real environment than considering each of objectives separately. For obtaining an optimal solution, hybrid algorithm based on MOGA and SA has been proposed in order to gain both good global and local search abilities. Simulation results and four well-known multi-objective performance metrics, indicate that the proposed hybrid algorithm outperforms the genetic algorithm (GA and SA in terms of each objective and significantly in minimizing the total cost of the weighted function.
A hybrid algorithm for parallel molecular dynamics simulations
Mangiardi, Chris M
2016-01-01
This article describes an algorithm for hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-ranged forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with AVX and AVX-2 processors as well as Xeon-Phi co-processors.
A hybrid algorithm for parallel molecular dynamics simulations
Mangiardi, Chris M.; Meyer, R.
2017-10-01
This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
Parallel reservoir computing using optical amplifiers.
Vandoorne, Kristof; Dambre, Joni; Verstraeten, David; Schrauwen, Benjamin; Bienstman, Peter
2011-09-01
Reservoir computing (RC), a computational paradigm inspired on neural systems, has become increasingly popular in recent years for solving a variety of complex recognition and classification problems. Thus far, most implementations have been software-based, limiting their speed and power efficiency. Integrated photonics offers the potential for a fast, power efficient and massively parallel hardware implementation. We have previously proposed a network of coupled semiconductor optical amplifiers as an interesting test case for such a hardware implementation. In this paper, we investigate the important design parameters and the consequences of process variations through simulations. We use an isolated word recognition task with babble noise to evaluate the performance of the photonic reservoirs with respect to traditional software reservoir implementations, which are based on leaky hyperbolic tangent functions. Our results show that the use of coherent light in a well-tuned reservoir architecture offers significant performance benefits. The most important design parameters are the delay and the phase shift in the system's physical connections. With optimized values for these parameters, coherent semiconductor optical amplifier (SOA) reservoirs can achieve better results than traditional simulated reservoirs. We also show that process variations hardly degrade the performance, but amplifier noise can be detrimental. This effect must therefore be taken into account when designing SOA-based RC implementations.
Automatic Parallelization Tool: Classification of Program Code for Parallel Computing
Directory of Open Access Journals (Sweden)
Mustafa Basthikodi
2016-04-01
Full Text Available Performance growth of single-core processors has come to a halt in the past decade, but was re-enabled by the introduction of parallelism in processors. Multicore frameworks along with Graphical Processing Units empowered to enhance parallelism broadly. Couples of compilers are updated to developing challenges forsynchronization and threading issues. Appropriate program and algorithm classifications will have advantage to a great extent to the group of software engineers to get opportunities for effective parallelization. In present work we investigated current species for classification of algorithms, in that related work on classification is discussed along with the comparison of issues that challenges the classification. The set of algorithms are chosen which matches the structure with different issues and perform given task. We have tested these algorithms utilizing existing automatic species extraction toolsalong with Bones compiler. We have added functionalities to existing tool, providing a more detailed characterization. The contributions of our work include support for pointer arithmetic, conditional and incremental statements, user defined types, constants and mathematical functions. With this, we can retain significant data which is not captured by original speciesof algorithms. We executed new theories into the device, empowering automatic characterization of program code.
Comparison Of Hybrid Sorting Algorithms Implemented On Different Parallel Hardware Platforms
Directory of Open Access Journals (Sweden)
Dominik Zurek
2013-01-01
Full Text Available Sorting is a common problem in computer science. There are lot of well-known sorting algorithms created for sequential execution on a single processor. Recently, hardware platforms enable to create wide parallel algorithms. We have standard processors consist of multiple cores and hardware accelerators like GPU. The graphic cards with their parallel architecture give new possibility to speed up many algorithms. In this paper we describe results of implementation of a few different sorting algorithms on GPU cards and multicore processors. Then hybrid algorithm will be presented which consists of parts executed on both platforms, standard CPU and GPU.
Hybrid soft computing approaches research and applications
Dutta, Paramartha; Chakraborty, Susanta
2016-01-01
The book provides a platform for dealing with the flaws and failings of the soft computing paradigm through different manifestations. The different chapters highlight the necessity of the hybrid soft computing methodology in general with emphasis on several application perspectives in particular. Typical examples include (a) Study of Economic Load Dispatch by Various Hybrid Optimization Techniques, (b) An Application of Color Magnetic Resonance Brain Image Segmentation by ParaOptiMUSIG activation Function, (c) Hybrid Rough-PSO Approach in Remote Sensing Imagery Analysis, (d) A Study and Analysis of Hybrid Intelligent Techniques for Breast Cancer Detection using Breast Thermograms, and (e) Hybridization of 2D-3D Images for Human Face Recognition. The elaborate findings of the chapters enhance the exhibition of the hybrid soft computing paradigm in the field of intelligent computing.
Parallel computing and networking; Heiretsu keisanki to network
Energy Technology Data Exchange (ETDEWEB)
Asakawa, E.; Tsuru, T. [Japan National Oil Corp., Tokyo (Japan); Matsuoka, T. [Japan Petroleum Exploration Co. Ltd., Tokyo (Japan)
1996-05-01
This paper describes the trend of parallel computers used in geophysical exploration. Around 1993 was the early days when the parallel computers began to be used for geophysical exploration. Classification of these computers those days was mainly MIMD (multiple instruction stream, multiple data stream), SIMD (single instruction stream, multiple data stream) and the like. Parallel computers were publicized in the 1994 meeting of the Geophysical Exploration Society as a `high precision imaging technology`. Concerning the library of parallel computers, there was a shift to PVM (parallel virtual machine) in 1993 and to MPI (message passing interface) in 1995. In addition, the compiler of FORTRAN90 was released with support implemented for data parallel and vector computers. In 1993, networks used were Ethernet, FDDI, CDDI and HIPPI. In 1995, the OC-3 products under ATM began to propagate. However, ATM remains to be an interoffice high speed network because the ATM service has not spread yet for the public network. 1 ref.
Arkin, Ethem; Tekinerdogan, Bedir
2016-01-01
Mapping parallel algorithms to parallel computing platforms requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform, the mapping of the algorithm to the logical configuration platform and the implementation of the sou
Parallel CFD design on network-based computer
Cheung, Samson
1995-01-01
Combining multiple engineering workstations into a network-based heterogeneous parallel computer allows application of aerodynamic optimization with advanced computational fluid dynamics codes, which can be computationally expensive on mainframe supercomputers. This paper introduces a nonlinear quasi-Newton optimizer designed for this network-based heterogeneous parallel computing environment utilizing a software called Parallel Virtual Machine. This paper will introduce the methodology behind coupling a Parabolized Navier-Stokes flow solver to the nonlinear optimizer. This parallel optimization package is applied to reduce the wave drag of a body of revolution and a wing/body configuration with results of 5% to 6% drag reduction.
Mighell, Kenneth John
2010-01-01
The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called CRBLASTER, which does cosmic-ray rejection of CCD (charge-coupled device) images using the embarrassingly-parallel L.A.COSMIC algorithm. CRBLASTER is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analy...
Domain Decomposition Based High Performance Parallel Computing
Raju, Mandhapati P
2009-01-01
The study deals with the parallelization of finite element based Navier-Stokes codes using domain decomposition and state-ofart sparse direct solvers. There has been significant improvement in the performance of sparse direct solvers. Parallel sparse direct solvers are not found to exhibit good scalability. Hence, the parallelization of sparse direct solvers is done using domain decomposition techniques. A highly efficient sparse direct solver PARDISO is used in this study. The scalability of both Newton and modified Newton algorithms are tested.
GOTPM: A Parallel Hybrid Particle-Mesh Treecode
Dubinski, J; Park, C; Humble, R J; Dubinski, John; Kim, Juhan; Park, Changbom; Humble, Robin
2004-01-01
We describe a parallel, cosmological N-body code based on a hybrid scheme using the particle-mesh (PM) and Barnes-Hut (BH) oct-tree algorithm. We call the algorithm GOTPM for Grid-of-Oct-Trees-Particle-Mesh. The code is parallelized using the Message Passing Interface (MPI) library and is optimized to run on Beowulf clusters as well as symmetric multi-processors. The gravitational potential is determined on a mesh using a standard PM method with particle forces determined through interpolation. The softened PM force is corrected for short range interactions using a grid of localized BH trees throughout the entire simulation volume in a completely analogous way to P$^3$M methods. This method makes no assumptions about the local density for short range force corrections and so is consistent with the results of the P$^3$M method in the limit that the treecode opening angle parameter, $\\theta \\to 0$. (abridged)
Robust control of a parallel hybrid drivetrain with a CVT
Energy Technology Data Exchange (ETDEWEB)
Mayer, T.; Schroeder, D. [Technical Univ. of Munich (Germany)
1996-09-01
In this paper the design of a robust control system for a parallel hybrid drivetrain is presented. The drivetrain is based on a continuously variable transmission (CVT) and is therefore a highly nonlinear multiple-input-multiple-output system (MIMO-System). Input-Output-Linearization offers the possibility of linearizing and of decoupling the system. Since for example the vehicle mass varies with the load and the efficiency of the gearbox depends strongly on the actual working point, an exact linearization of the plant will mostly fail. Therefore a robust control algorithm based on sliding mode is used to control the drivetrain.
Bogdanov, P. B.; Gorobets, A. V.; Sukov, S. A.
2013-08-01
The design of efficient algorithms for large-scale gas dynamics computations with hybrid (heterogeneous) computing systems whose high performance relies on massively parallel accelerators is addressed. A high-order accurate finite volume algorithm with polynomial reconstruction on unstructured hybrid meshes is used to compute compressible gas flows in domains of complex geometry. The basic operations of the algorithm are implemented in detail for massively parallel accelerators, including AMD and NVIDIA graphics processing units (GPUs). Major optimization approaches and a computation transfer technique are covered. The underlying programming tool is the Open Computing Language (OpenCL) standard, which performs on accelerators of various architectures, both existing and emerging.
Fast Parallel Computation of Polynomials Using Few Processors
DEFF Research Database (Denmark)
Valiant, Leslie G.; Skyum, Sven; Berkowitz, S.;
1983-01-01
It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors.......It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors....
Fast parallel computation of polynomials using few processors
DEFF Research Database (Denmark)
Valiant, Leslie; Skyum, Sven
1981-01-01
It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors.......It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors....
A hybrid computational grid architecture for comparative genomics.
Singh, Aarti; Chen, Chen; Liu, Weiguo; Mitchell, Wayne; Schmidt, Bertil
2008-03-01
Comparative genomics provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved among species, as well as genes that give each organism its unique characteristics. However, the huge datasets involved makes this approach impractical on traditional computer architectures leading to prohibitively long runtimes. In this paper, we present a new computational grid architecture based on a hybrid computing model to significantly accelerate comparative genomics applications. The hybrid computing model consists of two types of parallelism: coarse grained and fine grained. The coarse-grained parallelism uses a volunteer computing infrastructure for job distribution, while the fine-grained parallelism uses commodity computer graphics hardware for fast sequence alignment. We present the deployment and evaluation of this approach on our grid test bed for the all-against-all comparison of microbial genomes. The results of this comparison are then used by phenotype--genotype explorer (PheGee). PheGee is a new tool that nominates candidate genes responsible for a given phenotype.
Advances in Domain Mapping of Massively Parallel Scientific Computations
Energy Technology Data Exchange (ETDEWEB)
Leland, Robert W. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Hendrickson, Bruce A. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2015-10-01
One of the most important concerns in parallel computing is the proper distribution of workload across processors. For most scientific applications on massively parallel machines, the best approach to this distribution is to employ data parallelism; that is, to break the datastructures supporting a computation into pieces and then to assign those pieces to different processors. Collectively, these partitioning and assignment tasks comprise the domain mapping problem.
Chen, M.; Wei, S.
2016-12-01
The serious damage of Mexico City caused by the 1985 Michoacan earthquake 400 km away indicates that urban areas may be affected by remote earthquakes. To asses earthquake risk of urban areas imposed by distant earthquakes, we developed a hybrid Frequency Wavenumber (FK) and Finite Difference (FD) code implemented with MPI, since the computation of seismic wave propagation from a distant earthquake using a single numerical method (e.g. Finite Difference, Finite Element or Spectral Element) is very expensive. In our approach, we compute the incident wave field (ud) at the boundaries of the excitation box, which surrounding the local structure, using a paralleled FK method (Zhu and Rivera, 2002), and compute the total wave field (u) within the excitation box using a parallelled 2D FD method. We apply perfectly matched layer (PML) absorbing condition to the diffracted wave field (u-ud). Compared to previous Generalized Ray Theory and Finite Difference (Wen and Helmberger, 1998), Frequency Wavenumber and Spectral Element (Tong et al., 2014), and Direct Solution Method and Spectral Element hybrid method (Monteiller et al., 2013), our absorbing boundary condition dramatically suppress the numerical noise. The MPI implementation of our method can greatly speed up the calculation. Besides, our hybrid method also has a potential use in high resolution array imaging similar to Tong et al. (2014).
Predictive control strategy for power management in parallel hybrid-electric vehicle
DEFF Research Database (Denmark)
Nodeh, Mohammad Taqi; Gholizade, Hossein; Hajizadeh, Amin
2016-01-01
In this paper, a hybrid model-based nonlinear optimal control method is used to compute the optimal power distribution and power management in parallel hybrid electric vehicles. In the power management strategy, for optimal power distribution between the internal combustion engine, electrical...... system and the other subsystems, nonlinear predictive control is applied. In order to achieve this goal, a hierarchical control structure is utilized. This type of control structure consists of three levels of monitoring, coordinating and local controllers. Nonlinear modeling and performance index...... in the proposed method should be formulated at the regulatory level of the controller. Discrete dynamic mode of operation (motor-generator) in hybrid electric vehicle requires to use a dual-mode switch model and to define an alternative expression of performance index for the optimal control problem...
Checkpointing for a hybrid computing node
Energy Technology Data Exchange (ETDEWEB)
Cher, Chen-Yong
2016-03-08
According to an aspect, a method for checkpointing in a hybrid computing node includes executing a task in a processing accelerator of the hybrid computing node. A checkpoint is created in a local memory of the processing accelerator. The checkpoint includes state data to restart execution of the task in the processing accelerator upon a restart operation. Execution of the task is resumed in the processing accelerator after creating the checkpoint. The state data of the checkpoint are transferred from the processing accelerator to a main processor of the hybrid computing node while the processing accelerator is executing the task.
Ng, C M
2013-10-01
The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
A Hybrid, Parallel Krylov Solver for MODFLOW using Schwarz Domain Decomposition
Sutanudjaja, E.; Verkaik, J.; Hughes, J. D.
2015-12-01
In order to support decision makers in solving hydrological problems, detailed high-resolution models are often needed. These models typically consist of a large number of computational cells and have large memory requirements and long run times. An efficient technique for obtaining realistic run times and memory requirements is parallel computing, where the problem is divided over multiple processor cores. The new Parallel Krylov Solver (PKS) for MODFLOW-USG is presented. It combines both distributed memory parallelization by the Message Passing Interface (MPI) and shared memory parallelization by Open Multi-Processing (OpenMP). PKS includes conjugate gradient and biconjugate gradient stabilized linear accelerators that are both preconditioned by an overlapping additive Schwarz preconditioner in a way that: a) subdomains are partitioned using the METIS library; b) each subdomain uses local memory only and communicates with other subdomains by MPI within the linear accelerator; c) is fully integrated in the MODFLOW-USG code. PKS is based on the unstructured PCGU-solver, and supports OpenMP. Depending on the available hardware, PKS can run exclusively with MPI, exclusively with OpenMP, or with a hybrid MPI/OpenMP approach. Benchmarks were performed on the Cartesius Dutch supercomputer (https://userinfo.surfsara.nl/systems/cartesius) using up to 144 cores, for a synthetic test (~112 million cells) and the Indonesia groundwater model (~4 million 1km cells). The latter, which includes all islands in the Indonesian archipelago, was built using publically available global datasets, and is an ideal test bed for evaluating the applicability of PKS parallelization techniques to a global groundwater model consisting of multiple continents and islands. Results show that run time reductions can be greatest with the hybrid parallelization approach for the problems tested.
Parallel image computation in clusters with task-distributor.
Baun, Christian
2016-01-01
Distributed systems, especially clusters, can be used to execute ray tracing tasks in parallel for speeding up the image computation. Because ray tracing is a computational expensive and memory consuming task, ray tracing can also be used to benchmark clusters. This paper introduces task-distributor, a free software solution for the parallel execution of ray tracing tasks in distributed systems. The ray tracing solution used for this work is the Persistence Of Vision Raytracer (POV-Ray). Task-distributor does not require any modification of the POV-Ray source code or the installation of an additional message passing library like the Message Passing Interface or Parallel Virtual Machine to allow parallel image computation, in contrast to various other projects. By analyzing the runtime of the sequential and parallel program parts of task-distributor, it becomes clear how the problem size and available hardware resources influence the scaling of the parallel application.
Reachability computation for hybrid systems with Ariadne
L. Benvenuti; D. Bresolin; A. Casagrande; P.J. Collins (Pieter); A. Ferrari; E. Mazzi; T. Villa; A. Sangiovanni-Vincentelli
2008-01-01
htmlabstractAriadne is an in-progress open environment to design algorithms for computing with hybrid automata, that relies on a rigorous computable analysis theory to represent geometric objects, in order to achieve provable approximation bounds along the computations. In this paper we discuss the
Computer code for intraply hybrid composite design
Chamis, C. C.; Sinclair, J. H.
1981-01-01
A computer program has been developed and is described herein for intraply hybrid composite design (INHYD). The program includes several composite micromechanics theories, intraply hybrid composite theories and a hygrothermomechanical theory. These theories provide INHYD with considerable flexibility and capability which the user can exercise through several available options. Key features and capabilities of INHYD are illustrated through selected samples.
Directory of Open Access Journals (Sweden)
Qiang Lü
Full Text Available BACKGROUND: Protein structure prediction (PSP, which is usually modeled as a computational optimization problem, remains one of the biggest challenges in computational biology. PSP encounters two difficult obstacles: the inaccurate energy function problem and the searching problem. Even if the lowest energy has been luckily found by the searching procedure, the correct protein structures are not guaranteed to obtain. RESULTS: A general parallel metaheuristic approach is presented to tackle the above two problems. Multi-energy functions are employed to simultaneously guide the parallel searching threads. Searching trajectories are in fact controlled by the parameters of heuristic algorithms. The parallel approach allows the parameters to be perturbed during the searching threads are running in parallel, while each thread is searching the lowest energy value determined by an individual energy function. By hybridizing the intelligences of parallel ant colonies and Monte Carlo Metropolis search, this paper demonstrates an implementation of our parallel approach for PSP. 16 classical instances were tested to show that the parallel approach is competitive for solving PSP problem. CONCLUSIONS: This parallel approach combines various sources of both searching intelligences and energy functions, and thus predicts protein conformations with good quality jointly determined by all the parallel searching threads and energy functions. It provides a framework to combine different searching intelligence embedded in heuristic algorithms. It also constructs a container to hybridize different not-so-accurate objective functions which are usually derived from the domain expertise.
Distributing an executable job load file to compute nodes in a parallel computer
Gooding, Thomas M.
2016-08-09
Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications link over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.
Distributing an executable job load file to compute nodes in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Gooding, Thomas M.
2016-09-13
Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications link over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.
Lee, Wei-Po; Hsiao, Yu-Ting; Hwang, Wei-Che
2014-01-16
To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high
2014-01-01
Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2013-10-29
Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint.
CaKernel – A Parallel Application Programming Framework for Heterogenous Computing Architectures
Directory of Open Access Journals (Sweden)
Marek Blazewicz
2011-01-01
Full Text Available With the recent advent of new heterogeneous computing architectures there is still a lack of parallel problem solving environments that can help scientists to use easily and efficiently hybrid supercomputers. Many scientific simulations that use structured grids to solve partial differential equations in fact rely on stencil computations. Stencil computations have become crucial in solving many challenging problems in various domains, e.g., engineering or physics. Although many parallel stencil computing approaches have been proposed, in most cases they solve only particular problems. As a result, scientists are struggling when it comes to the subject of implementing a new stencil-based simulation, especially on high performance hybrid supercomputers. In response to the presented need we extend our previous work on a parallel programming framework for CUDA – CaCUDA that now supports OpenCL. We present CaKernel – a tool that simplifies the development of parallel scientific applications on hybrid systems. CaKernel is built on the highly scalable and portable Cactus framework. In the CaKernel framework, Cactus manages the inter-process communication via MPI while CaKernel manages the code running on Graphics Processing Units (GPUs and interactions between them. As a non-trivial test case we have developed a 3D CFD code to demonstrate the performance and scalability of the automatically generated code.
Implementation of GAMMA on a Massively Parallel Computer
Institute of Scientific and Technical Information of China (English)
黄林鹏; 童维勤; 等
1997-01-01
The GAMMA paradigm is recently proposed by Banatre and Metayer to describe the systematic construction of parallel programs without introducing artificial sequentiality.This paper presents two synchronous execution models for GAMMA and discusses how to implement them on MasPar MP-1,a massively data parallel computer.The results show that GAMMA paradign can be implemented very naturally on data parallel machines,and very high level language,such as GAMMA in which parallelism is left implicit,is suitable for specifying massively parallel applications.
Parallel genetic algorithms with migration for the hybrid flow shop scheduling problem
Directory of Open Access Journals (Sweden)
K. Belkadi
2006-01-01
Full Text Available This paper addresses scheduling problems in hybrid flow shop-like systems with a migration parallel genetic algorithm (PGA_MIG. This parallel genetic algorithm model allows genetic diversity by the application of selection and reproduction mechanisms nearer to nature. The space structure of the population is modified by dividing it into disjoined subpopulations. From time to time, individuals are exchanged between the different subpopulations (migration. Influence of parameters and dedicated strategies are studied. These parameters are the number of independent subpopulations, the interconnection topology between subpopulations, the choice/replacement strategy of the migrant individuals, and the migration frequency. A comparison between the sequential and parallel version of genetic algorithm (GA is provided. This comparison relates to the quality of the solution and the execution time of the two versions. The efficiency of the parallel model highly depends on the parameters and especially on the migration frequency. In the same way this parallel model gives a significant improvement of computational time if it is implemented on a parallel architecture which offers an acceptable number of processors (as many processors as subpopulations.
Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.
2017-07-01
Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).
Dynamic traffic assignment on parallel computers
Energy Technology Data Exchange (ETDEWEB)
Nagel, K.; Frye, R.; Jakob, R.; Rickert, M.; Stretz, P.
1998-12-01
The authors describe part of the current framework of the TRANSIMS traffic research project at the Los Alamos National Laboratory. It includes parallel implementations of a route planner and a microscopic traffic simulation model. They present performance figures and results of an offline load-balancing scheme used in one of the iterative re-planning runs required for dynamic route assignment.
Resource Centered Computing delivering high parallel performance
2014-01-01
International audience; Modern parallel programming requires a combination of differentparadigms, expertise and tuning, that correspond to the differentlevels in today's hierarchical architectures. To cope with theinherent difficulty, ORWL (ordered read-write locks) presents a newparadigm and toolbox centered around local or remote resources, suchas data, processors or accelerators. ORWL programmers describe theircomputation in terms of access to these resources during criticalsections. Exclu...
Universal blind quantum computation for hybrid system
Huang, He-Liang; Bao, Wan-Su; Li, Tan; Li, Feng-Guang; Fu, Xiang-Qun; Zhang, Shuo; Zhang, Hai-Long; Wang, Xiang
2017-08-01
As progress on the development of building quantum computer continues to advance, first-generation practical quantum computers will be available for ordinary users in the cloud style similar to IBM's Quantum Experience nowadays. Clients can remotely access the quantum servers using some simple devices. In such a situation, it is of prime importance to keep the security of the client's information. Blind quantum computation protocols enable a client with limited quantum technology to delegate her quantum computation to a quantum server without leaking any privacy. To date, blind quantum computation has been considered only for an individual quantum system. However, practical universal quantum computer is likely to be a hybrid system. Here, we take the first step to construct a framework of blind quantum computation for the hybrid system, which provides a more feasible way for scalable blind quantum computation.
Parallel computing in genomic research: advances and applications
Directory of Open Access Journals (Sweden)
Ocaña K
2015-11-01
Full Text Available Kary Ocaña,1 Daniel de Oliveira2 1National Laboratory of Scientific Computing, Petrópolis, Rio de Janeiro, 2Institute of Computing, Fluminense Federal University, Niterói, Brazil Abstract: Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities. Keywords: high-performance computing, genomic research, cloud computing, grid computing, cluster computing, parallel computing
Computing Optimal Cycle Mean in Parallel on CUDA
Directory of Open Access Journals (Sweden)
Jiří Barnat
2011-10-01
Full Text Available Computation of optimal cycle mean in a directed weighted graph has many applications in program analysis, performance verification in particular. In this paper we propose a data-parallel algorithmic solution to the problem and show how the computation of optimal cycle mean can be efficiently accelerated by means of CUDA technology. We show how the problem of computation of optimal cycle mean is decomposed into a sequence of data-parallel graph computation primitives and show how these primitives can be implemented and optimized for CUDA computation. Finally, we report a fivefold experimental speed up on graphs representing models of distributed systems when compared to best sequential algorithms.
Basic design of parallel computational program for probabilistic structural analysis
Energy Technology Data Exchange (ETDEWEB)
Kaji, Yoshiyuki; Arai, Taketoshi [Japan Atomic Energy Research Inst., Tokai, Ibaraki (Japan). Tokai Research Establishment; Gu, Wenwei; Nakamura, Hitoshi
1999-06-01
In our laboratory, for `development of damage evaluation method of structural brittle materials by microscopic fracture mechanics and probabilistic theory` (nuclear computational science cross-over research) we examine computational method related to super parallel computation system which is coupled with material strength theory based on microscopic fracture mechanics for latent cracks and continuum structural model to develop new structural reliability evaluation methods for ceramic structures. This technical report is the review results regarding probabilistic structural mechanics theory, basic terms of formula and program methods of parallel computation which are related to principal terms in basic design of computational mechanics program. (author)
Stochastic Optimal Control of Parallel Hybrid Electric Vehicles
Directory of Open Access Journals (Sweden)
Feiyan Qin
2017-02-01
Full Text Available Energy management strategies (EMSs in hybrid electric vehicles (HEVs are highly related to the fuel economy and emission performances. However, EMS constitutes a challenging problem due to the complex structure of a HEV and the unknown or partially known driving cycles. To meet this problem, this paper adopts a stochastic dynamic programming (SDP method for the EMS of a specially designed vehicle, a pre-transmission single-shaft torque-coupling parallel HEV. In this parallel HEV, the auto clutch output is connected to the transmission input through an electric motor, which benefits an efficient motor assist operation. In this EMS, demanded torque of driver is modeled as a one-state Markov process to represent the uncertainty of future driving situations. The obtained EMS has been evaluated with ADVISOR2002 over two standard government drive cycles and a self-defined one, and compared with a dynamic programming (DP one and a rule-based one. Simulation results have shown the real-time performance of the proposed approach, and potential vehicle performance improvement relative to the rule-based one.
Mathematical model partitioning and packing for parallel computer calculation
Arpasi, Dale J.; Milner, Edward J.
1986-01-01
This paper deals with the development of multiprocessor simulations from a serial set of ordinary differential equations describing a physical system. The identification of computational parallelism within the model equations is discussed. A technique is presented for identifying this parallelism and for partitioning the equations for parallel solution on a multiprocessor. Next, an algorithm which packs the equations into a minimum number of processors is described. The results of applying the packing algorithm to a turboshaft engine model are presented.
Performance of Air Pollution Models on Massively Parallel Computers
DEFF Research Database (Denmark)
Brown, John; Hansen, Per Christian; Wasniewski, Jerzy
1996-01-01
To compare the performance and use of three massively parallel SIMD computers, we implemented a large air pollution model on the computers. Using a realistic large-scale model, we gain detailed insight about the performance of the three computers when used to solve large-scale scientific problems...... that involve several types of numerical computations. The computers considered in our study are the Connection Machines CM-200 and CM-5, and the MasPar MP-2216...
Performance of Air Pollution Models on Massively Parallel Computers
DEFF Research Database (Denmark)
Brown, John; Hansen, Per Christian; Wasniewski, Jerzy
1996-01-01
To compare the performance and use of three massively parallel SIMD computers, we implemented a large air pollution model on the computers. Using a realistic large-scale model, we gain detailed insight about the performance of the three computers when used to solve large-scale scientific problems...
Hybrid Parallel Programming Models for AMR Neutron Monte-Carlo Transport
Dureau, David; Poëtte, Gaël
2014-06-01
This paper deals with High Performance Computing (HPC) applied to neutron transport theory on complex geometries, thanks to both an Adaptive Mesh Refinement (AMR) algorithm and a Monte-Carlo (MC) solver. Several Parallelism models are presented and analyzed in this context, among them shared memory and distributed memory ones such as Domain Replication and Domain Decomposition, together with Hybrid strategies. The study is illustrated by weak and strong scalability tests on complex benchmarks on several thousands of cores thanks to the petaflopic supercomputer Tera100.
Software metrics for green parallel computing of big data systems
Gurbuz, Havva Gulay; Tekinerdogan, Bedir
2016-01-01
Big Data is typically organized around a distributed file system on top of which the parallel algorithms can be executed for realizing the Big Data analytics. In general, the parallel algorithms can be mapped in different alternative ways to the computing platform. Hereby each alternative will
Combined Scheduling and Mapping for Scalable Computing with Parallel Tasks
Directory of Open Access Journals (Sweden)
Jörg Dümmler
2012-01-01
Full Text Available Recent and future parallel clusters and supercomputers use symmetric multiprocessors (SMPs and multi-core processors as basic nodes, providing a huge amount of parallel resources. These systems often have hierarchically structured interconnection networks combining computing resources at different levels, starting with the interconnect within multi-core processors up to the interconnection network combining nodes of the cluster or supercomputer. The challenge for the programmer is that these computing resources should be utilized efficiently by exploiting the available degree of parallelism of the application program and by structuring the application in a way which is sensitive to the heterogeneous interconnect. In this article, we pursue a parallel programming method using parallel tasks to structure parallel implementations. A parallel task can be executed by multiple processors or cores and, for each activation of a parallel task, the actual number of executing cores can be adapted to the specific execution situation. In particular, we propose a new combined scheduling and mapping technique for parallel tasks with dependencies that takes the hierarchical structure of modern multi-core clusters into account. An experimental evaluation shows that the presented programming approach can lead to a significantly higher performance compared to standard data parallel implementations.
Research in Parallel Algorithms and Software for Computational Aerosciences
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Parallel hybrid algorithm for solution in electrical impedance equation
Ponomaryov, Volodymyr; Robles-Gonzalez, Marco; Bucio-Ramirez, Ariana; Ramirez-Tachiquin, Marco; Ramos-Diaz, Eduardo
2015-02-01
This work is dedicated to the analysis of the forward and the inverse problem to obtain a better approximation to the Electrical Impedance Tomography equation. In this case, we employ for the forward problem the numerical method based on the Taylor series in formal power and for the inverse problem the Finite Element Method. For the analysis of the forward problem, we proposed a novel algorithm, which employs a regularization technique for the stability, additionally the parallel computing is used to obtain the solution faster; this modification permits to obtain an efficient solution of the forward problem. Then, the found solution is used in the inverse problem for the approximation employing the Finite Element Method. The algorithms employed in this work are developed in structural programming paradigm in C++, including parallel processing; the time run analysis is performed only in the forward problem because the Finite Element Method due to their high recursive does not accept parallelism. Some examples are performed for this analysis, in which several conductivity functions are employed for two different cases: for the analytical cases: the exponential and sinusoidal functions are used, and for the geometrical cases the circle at center and five disk structure are revised as conductivity functions. The Lebesgue measure is used as metric for error estimation in the forward problem, meanwhile, in the inverse problem PSNR, SSIM, MSE criteria are applied, to determine the convergence of both methods.
Parallel computing, failure recovery, and extreme values
DEFF Research Database (Denmark)
Andersen, Lars Nørvang; Asmussen, Søren
A task of random size T is split into M subtasks of lengths T1, . . . , TM, each of which is sent to one out of M parallel processors. Each processor may fail at a random time before completing its allocated task, and then has to restart it from the beginning. If X1, . . . ,XM are the total task...... times at the M processors, the overall total task time is then ZM = max1,...,MXi. Limit theorems as M → ∞ are given for ZM, allowing the distribution of T to depend on M. In some cases the limits are classical extreme value distributions, in others they are of a different type....
Parallel Computing Methods For Particle Accelerator Design
Popescu, Diana Andreea; Hersch, Roger
We present methods for parallelizing the transport map construction for multi-core processors and for Graphics Processing Units (GPUs). We provide an efficient implementation of the transport map construction. We describe a method for multi-core processors using the OpenMP framework which brings performance improvement over the serial version of the map construction. We developed a novel and efficient algorithm for multivariate polynomial multiplication for GPUs and we implemented it using the CUDA framework. We show the benefits of using the multivariate polynomial multiplication algorithm for GPUs in the map composition operation for high orders. Finally, we present an algorithm for map composition for GPUs.
History Matching in Parallel Computational Environments
Energy Technology Data Exchange (ETDEWEB)
Steven Bryant; Sanjay Srinivasan; Alvaro Barrera; Sharad Yadav
2004-08-31
In the probabilistic approach for history matching, the information from the dynamic data is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, fluid properties, well configuration, flow constrains on wells etc. This implies probabilistic approach should then update different regions of the reservoir in different ways. This necessitates delineation of multiple reservoir domains in order to increase the accuracy of the approach. The research focuses on a probabilistic approach to integrate dynamic data that ensures consistency between reservoir models developed from one stage to the next. The algorithm relies on efficient parameterization of the dynamic data integration problem and permits rapid assessment of the updated reservoir model at each stage. The report also outlines various domain decomposition schemes from the perspective of increasing the accuracy of probabilistic approach of history matching. Research progress in three important areas of the project are discussed: {lg_bullet}Validation and testing the probabilistic approach to incorporating production data in reservoir models. {lg_bullet}Development of a robust scheme for identifying reservoir regions that will result in a more robust parameterization of the history matching process. {lg_bullet}Testing commercial simulators for parallel capability and development of a parallel algorithm for history matching.
Katouda, Michio; Nakajima, Takahito
2013-12-10
A new algorithm for massively parallel calculations of electron correlation energy of large molecules based on the resolution of identity second-order Møller-Plesset perturbation (RI-MP2) technique is developed and implemented into the quantum chemistry software NTChem. In this algorithm, a Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) hybrid parallel programming model is applied to attain efficient parallel performance on massively parallel supercomputers. An in-core storage scheme of intermediate data of three-center electron repulsion integrals utilizing the distributed memory is developed to eliminate input/output (I/O) overhead. The parallel performance of the algorithm is tested on massively parallel supercomputers such as the K computer (using up to 45 992 central processing unit (CPU) cores) and a commodity Intel Xeon cluster (using up to 8192 CPU cores). The parallel RI-MP2/cc-pVTZ calculation of two-layer nanographene sheets (C150H30)2 (number of atomic orbitals is 9640) is performed using 8991 node and 71 288 CPU cores of the K computer.
A Case Study of a Hybrid Parallel 3D Surface Rendering Graphics Architecture
DEFF Research Database (Denmark)
Holten-Lund, Hans Erik; Madsen, Jan; Pedersen, Steen
1997-01-01
This paper presents a case study in the design strategy used inbuilding a graphics computer, for drawing very complex 3Dgeometric surfaces. The goal is to build a PC based computer systemcapable of handling surfaces built from about 2 million triangles, andto be able to render a perspective view...... of these on a computer displayat interactive frame rates, i.e. processing around 50 milliontriangles per second. The paper presents a hardware/softwarearchitecture called HPGA (Hybrid Parallel Graphics Architecture) whichis likely to be able to carry out this task. The case study focuses ontechniques to increase...... the clock frequency as well as the parallelismof the system. This paper focuses on the back-end graphics pipeline,which is responsible for rasterizing triangles.%with a practically linear increase in performance. A pure software implementation of the proposed architecture iscurrently able to process 300...
Accelerating Climate and Weather Simulations through Hybrid Computing
Zhou, Shujia; Cruz, Carlos; Duffy, Daniel; Tucker, Robert; Purcell, Mark
2011-01-01
Unconventional multi- and many-core processors (e.g. IBM (R) Cell B.E.(TM) and NVIDIA (R) GPU) have emerged as effective accelerators in trial climate and weather simulations. Yet these climate and weather models typically run on parallel computers with conventional processors (e.g. Intel, AMD, and IBM) using Message Passing Interface. To address challenges involved in efficiently and easily connecting accelerators to parallel computers, we investigated using IBM's Dynamic Application Virtualization (TM) (IBM DAV) software in a prototype hybrid computing system with representative climate and weather model components. The hybrid system comprises two Intel blades and two IBM QS22 Cell B.E. blades, connected with both InfiniBand(R) (IB) and 1-Gigabit Ethernet. The system significantly accelerates a solar radiation model component by offloading compute-intensive calculations to the Cell blades. Systematic tests show that IBM DAV can seamlessly offload compute-intensive calculations from Intel blades to Cell B.E. blades in a scalable, load-balanced manner. However, noticeable communication overhead was observed, mainly due to IP over the IB protocol. Full utilization of IB Sockets Direct Protocol and the lower latency production version of IBM DAV will reduce this overhead.
Partitioning problems in parallel, pipelined, and distributed computing
Bokhari, Shahid H.
1988-01-01
The problem of optimally assigning the modules of a parallel program over the processors of a multiple-computer system is addressed. A sum-bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple-satellite system: partitioning multiple chain-structured parallel programs, multiple arbitrarily structured serial programs, and single-tree structured parallel programs. In addition, the problem of partitioning chain-structured parallel programs across chain-connected systems is solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple-computer architectures for a wide range of problems of practical interest.
Fitting equation of state parameters in parallel computers
Directory of Open Access Journals (Sweden)
M. Castier
2014-12-01
Full Text Available This work compares two strategies to fit parameters of equations of state in parallel computers, emphasizing solutions that require few changes to existing sequential programs. One strategy uses the conventional Nelder-Mead algorithm coupled with parallel objective function evaluation (SSPO. The other strategy uses a parallel Nelder-Mead algorithm coupled with sequential objective function evaluation (PSSO. The PSSO strategy, which executes parallel one-dimensional searches during each iteration, is simpler to implement and converged to parameter sets with objective functions smaller than those obtained by the SSPO strategy. The SSPO strategy produced speedups consistent with the number of processes used and is more suitable when many processors are available. Both strategies are potentially useful and choosing between them is a matter of convenience, depending on the problem at hand. With parallel computers increasingly available, the easy implementation and convenience of these two strategies should appeal to developers and users of thermodynamic models.
DEVELOPMENT OF THE ENERGY MANAGEMENT STRATEGY FOR PARALLEL HYBRID ELECTRIC URBAN BUSES
Institute of Scientific and Technical Information of China (English)
HUANG Yuanjun; YIN Chengliang; ZHANG Jianwu
2008-01-01
A novel parallel hybrid electrical urban bus (PHEUB) configuration consisting of an extra one-way clutch and an automatic mechanical transmission (AMT) is taken as the study subject. An energy management strategy combining a logic threshold approach and an instantaneous optimization algorithm is proposed for the investigated PHEUB. The objective of the energy management strategy is to achieve acceptable vehicle performance and drivability requirements while simultaneously maximizing the engine fuel consumption and maintaining the battery state of charge in its operation range at all times. Under the environment of Matlab/Simulink, a computer simulation model for the PHEUB is constructed by using the model building method combining theoretical analysis and bench test data. Simulation and experiment results for China Typical Bus Driving Schedule at Urban District (CTBDS_UD) are obtained, and the results indicate that the proposed control strategy not only controls the hybrid system efficiently but also improves the fuel economy significantly.
Hybrid Systems: Computation and Control.
2007-11-02
elbow) and a pinned first joint (shoul- der) (see Figure 2); it is termed an underactuated system since it is a mechanical system with fewer...Montreal, PQ, Canada, 1998. [10] M. W. Spong. Partial feedback linearization of underactuated mechanical systems . In Proceedings, IROS, pages 314-321...control mechanism and search for optimal combinations of control variables. Besides the nonlinear and hybrid nature of powertrain systems , hardware
Parallel algorithms and archtectures for computational structural mechanics
Patrick, Merrell; Ma, Shing; Mahajan, Umesh
1989-01-01
The determination of the fundamental (lowest) natural vibration frequencies and associated mode shapes is a key step used to uncover and correct potential failures or problem areas in most complex structures. However, the computation time taken by finite element codes to evaluate these natural frequencies is significant, often the most computationally intensive part of structural analysis calculations. There is continuing need to reduce this computation time. This study addresses this need by developing methods for parallel computation.
Misleading Performance Claims in Parallel Computations
Energy Technology Data Exchange (ETDEWEB)
Bailey, David H.
2009-05-29
In a previous humorous note entitled 'Twelve Ways to Fool the Masses,' I outlined twelve common ways in which performance figures for technical computer systems can be distorted. In this paper and accompanying conference talk, I give a reprise of these twelve 'methods' and give some actual examples that have appeared in peer-reviewed literature in years past. I then propose guidelines for reporting performance, the adoption of which would raise the level of professionalism and reduce the level of confusion, not only in the world of device simulation but also in the larger arena of technical computing.
Global seismic tomography and modern parallel computers
Directory of Open Access Journals (Sweden)
A. Piersanti
2006-06-01
Full Text Available A fast technological progress is providing seismic tomographers with computers of rapidly increasing speed and RAM, that are not always properly taken advantage of. Large computers with both shared-memory and distributedmemory architectures have made it possible to approach the tomographic inverse problem more accurately. For example, resolution can be quantified from the resolution matrix rather than checkerboard tests; the covariance matrix can be calculated to evaluate the propagation of errors from data to model parameters; the L-curve method can be applied to determine a range of acceptable regularization schemes. We show how these exercises can be implemented efficiently on different hardware architectures.
Visual animation of parallel algorithms for matrix computations
Energy Technology Data Exchange (ETDEWEB)
Heath, M.T.
1990-01-01
In this talk we show how graphical animation of the behavior of parallel algorithms can facilitate the design and performance enhancement of algorithms for matrix computations on parallel computer architectures. Using a portable instrumented communication library and a graphical animation package developed at Oak Ridge National Laboratory, we illustrate the effects of various strategies in parallel algorithm design, including interconnection topologies, global communication patterns, data mapping schemes, load balancing, and pipelining techniques for overlapping communication with computation. In this talk we focus on distributed-memory parallel architectures in which the processors communicate by passing messages. The linear algebra problems we consider include matrix factorization and the solution of triangular systems. 4 refs., 12 figs.
Dynamic Distribution Model with Prime Granularity for Parallel Computing
Institute of Scientific and Technical Information of China (English)
无
2005-01-01
Dynamic distribution model is one of the best schemes for parallel volume rendering. However, in homogeneous cluster system, since the granularity is traditionally identical, all processors communicate almost simultaneously and computation load may lose balance. Due to problems above, a dynamic distribution model with prime granularity for parallel computing is presented.Granularities of each processor are relatively prime, and related theories are introduced. A high parallel performance can be achieved by minimizing network competition and using a load balancing strategy that ensures all processors finish almost simultaneously. Based on Master-Slave-Gleaner (MSG) scheme, the parallel Splatting Algorithm for volume rendering is used to test the model on IBM Cluster 1350 system. The experimental results show that the model can bring a considerable improvement in performance, including computation efficiency, total execution time, speed, and load balancing.
Identifying failure in a tree network of a parallel computer
Archer, Charles J.; Pinnow, Kurt W.; Wallenfelt, Brian P.
2010-08-24
Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.
Models of parallel computation :a survey and classification
Institute of Scientific and Technical Information of China (English)
ZHANG Yunquan; CHEN Guoliang; SUN Guangzhong; MIAO Qiankun
2007-01-01
In this paper,the state-of-the-art parallel computational model research is reviewed.We will introduce various models that were developed during the past decades.According to their targeting architecture features,especially memory organization,we classify these parallel computational models into three generations.These models and their characteristics are discussed based on three generations classification.We believe that with the ever increasing speed gap between the CPU and memory systems,incorporating non-uniform memory hierarchy into computational models will become unavoidable.With the emergence of multi-core CPUs,the parallelism hierarchy of current computing platforms becomes more and more complicated.Describing this complicated parallelism hierarchy in future computational models becomes more and more important.A semi-automatic toolkit that can extract model parameters and their values on real computers can reduce the model analysis complexity,thus allowing more complicated models with more parameters to be adopted.Hierarchical memory and hierarchical parallelism will be two very important features that should be considered in future model design and research.
A microeconomic scheduler for parallel computers
Stoica, Ion; Abdel-Wahab, Hussein; Pothen, Alex
1995-01-01
We describe a scheduler based on the microeconomic paradigm for scheduling on-line a set of parallel jobs in a multiprocessor system. In addition to the classical objectives of increasing the system throughput and reducing the response time, we consider fairness in allocating system resources among the users, and providing the user with control over the relative performances of his jobs. We associate with every user a savings account in which he receives money at a constant rate. When a user wants to run a job, he creates an expense account for that job to which he transfers money from his savings account. The job uses the funds in its expense account to obtain the system resources it needs for execution. The share of the system resources allocated to the user is directly related to the rate at which the user receives money; the rate at which the user transfers money into a job expense account controls the job's performance. We prove that starvation is not possible in our model. Simulation results show that our scheduler improves both system and user performances in comparison with two different variable partitioning policies. It is also shown to be effective in guaranteeing fairness and providing control over the performance of jobs.
History Matching in Parallel Computational Environments
Energy Technology Data Exchange (ETDEWEB)
Steven Bryant; Sanjay Srinivasan; Alvaro Barrera; Sharad Yadav
2005-10-01
A novel methodology for delineating multiple reservoir domains for the purpose of history matching in a distributed computing environment has been proposed. A fully probabilistic approach to perturb permeability within the delineated zones is implemented. The combination of robust schemes for identifying reservoir zones and distributed computing significantly increase the accuracy and efficiency of the probabilistic approach. The information pertaining to the permeability variations in the reservoir that is contained in dynamic data is calibrated in terms of a deformation parameter rD. This information is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, well configuration, flow constrains etc. The probabilistic approach then has to account for multiple r{sub D} values in different regions of the reservoir. In order to delineate reservoir domains that can be characterized with different rD parameters, principal component analysis (PCA) of the Hessian matrix has been done. The Hessian matrix summarizes the sensitivity of the objective function at a given step of the history matching to model parameters. It also measures the interaction of the parameters in affecting the objective function. The basic premise of PC analysis is to isolate the most sensitive and least correlated regions. The eigenvectors obtained during the PCA are suitably scaled and appropriate grid block volume cut-offs are defined such that the resultant domains are neither too large (which increases interactions between domains) nor too small (implying ineffective history matching). The delineation of domains requires calculation of Hessian, which could be computationally costly and as well as restricts the current approach to
Exploiting Maximum Parallelism in Loop Using Heterogeneous Computing
Institute of Scientific and Technical Information of China (English)
ZENG Guosun
2001-01-01
In this paper, we present the defini-tion of maximum loop speedup, which is the metricof parallelism hidden in loop body. We also studythe classes of Do-loop and their dependence as wellas the parallelism they contain. How to exploit suchparallelism under heterogeneous computing environ-ment? The paper proposes several approaches, whichare eliminating serial bottleneck by means of heteroge-neous computing, heterogeneous Do-all-loop schedul-ing, heterogeneous Do-a-cross scheduling. We findthat, not only on theoretical analysis but also on ex-perimental results, these schemes acquire better per-formance than in homogeneous computing.
Parallel Computational Fluid Dynamics: Current Status and Future Requirements
Simon, Horst D.; VanDalsem, William R.; Dagum, Leonardo; Kutler, Paul (Technical Monitor)
1994-01-01
One or the key objectives of the Applied Research Branch in the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Allies Research Center is the accelerated introduction of highly parallel machines into a full operational environment. In this report we discuss the performance results obtained from the implementation of some computational fluid dynamics (CFD) applications on the Connection Machine CM-2 and the Intel iPSC/860. We summarize some of the experiences made so far with the parallel testbed machines at the NAS Applied Research Branch. Then we discuss the long term computational requirements for accomplishing some of the grand challenge problems in computational aerosciences. We argue that only massively parallel machines will be able to meet these grand challenge requirements, and we outline the computer science and algorithm research challenges ahead.
Directory of Open Access Journals (Sweden)
Zulkifli Saiful A.
2016-01-01
Full Text Available Parallel hybrid electric vehicles (HEV can be classified according to the location of the electric motor with respect to the transmission unit for the internal combustion engine (ICE: they can be pre-transmission or posttransmission parallel hybrid. A split-axle parallel HEV – in which the ICE and electric motor provide propulsion power to different axles – is a sub-type of the post-transmission hybrid, since addition of torque and power from the two power sources occurs after the vehicle’s transmission. The term ‘through-the-road’ (TTR hybrid is also used for the split-parallel HEV, since power coupling between the ICE and electric motor is not through some mechanical device but through the vehicle itself, its wheels and the road on which it moves. The present work presents torquespeed relationship of the split-parallel hybrid and analyses simulation results of torque profiles and acceleration performance of pre-transmission and post-transmission hybrid configurations, using three different sizes of electric motor. Different operating regions of the pre-trans and post-trans motors are observed, leading to different speed and torque profiles. Although ICE average efficiency in the post-trans hybrid is slightly lower than in the pre-trans hybrid, the post-trans hybrid vehicle has better fuel economy and acceleration performance than the pre-trans hybrid vehicle.
3D magnetospheric parallel hybrid multi-grid method applied to planet–plasma interactions
Energy Technology Data Exchange (ETDEWEB)
Leclercq, L., E-mail: ludivine.leclercq@latmos.ipsl.fr [LATMOS/IPSL, UVSQ Université Paris-Saclay, UPMC Univ. Paris 06, CNRS, Guyancourt (France); Modolo, R., E-mail: ronan.modolo@latmos.ipsl.fr [LATMOS/IPSL, UVSQ Université Paris-Saclay, UPMC Univ. Paris 06, CNRS, Guyancourt (France); Leblanc, F. [LATMOS/IPSL, UPMC Univ. Paris 06 Sorbonne Universités, UVSQ, CNRS, Paris (France); Hess, S. [ONERA, Toulouse (France); Mancini, M. [LUTH, Observatoire Paris-Meudon (France)
2016-03-15
We present a new method to exploit multiple refinement levels within a 3D parallel hybrid model, developed to study planet–plasma interactions. This model is based on the hybrid formalism: ions are kinetically treated whereas electrons are considered as a inertia-less fluid. Generally, ions are represented by numerical particles whose size equals the volume of the cells. Particles that leave a coarse grid subsequently entering a refined region are split into particles whose volume corresponds to the volume of the refined cells. The number of refined particles created from a coarse particle depends on the grid refinement rate. In order to conserve velocity distribution functions and to avoid calculations of average velocities, particles are not coalesced. Moreover, to ensure the constancy of particles' shape function sizes, the hybrid method is adapted to allow refined particles to move within a coarse region. Another innovation of this approach is the method developed to compute grid moments at interfaces between two refinement levels. Indeed, the hybrid method is adapted to accurately account for the special grid structure at the interfaces, avoiding any overlapping grid considerations. Some fundamental test runs were performed to validate our approach (e.g. quiet plasma flow, Alfven wave propagation). Lastly, we also show a planetary application of the model, simulating the interaction between Jupiter's moon Ganymede and the Jovian plasma.
Simulation of Parallel Logical Operations with Biomolecular Computing
Directory of Open Access Journals (Sweden)
Mahnaz Kadkhoda
2008-01-01
Full Text Available Biomolecular computing is the computational method that uses the potential of DNA as a parallel computing device. DNA computing can be used to solve NP-complete problems. An appropriate application of DNA computation is large-scale evaluation of parallel computation models such as Boolean Circuits. In this study, we present a molecular-based algorithm for evaluation of Nand-based Boolean Circuits. The contribution of this paper is that the proposed algorithm has been implemented using only three molecular operations and the number of passes in each level is decreased to less than half of previously addressed in the literature. Thus, the proposed algorithm is much easier to implement in the laboratory.
Estimating the overlap between dependent computations for automatic parallelization
Bone, Paul; Schachte, Peter
2011-01-01
Researchers working on the automatic parallelization of programs have long known that too much parallelism can be even worse for performance than too little, because spawning a task to be run on another CPU incurs overheads. Autoparallelizing compilers have therefore long tried to use granularity analysis to ensure that they only spawn off computations whose cost will probably exceed the spawn-off cost by a comfortable margin. However, this is not enough to yield good results, because data dependencies may \\emph{also} limit the usefulness of running computations in parallel. If one computation blocks almost immediately and can resume only after another has completed its work, then the cost of parallelization again exceeds the benefit. We present a set of algorithms for recognizing places in a program where it is worthwhile to execute two or more computations in parallel that pay attention to the second of these issues as well as the first. Our system uses profiling information to compute the times at which a ...
Simulating the Immune Response on a Distributed Parallel Computer
Castiglione, F.; Bernaschi, M.; Succi, S.
The application of ideas and methods of statistical mechanics to problems of biological relevance is one of the most promising frontiers of theoretical and computational mathematical physics.1,2 Among others, the computer simulation of the immune system dynamics stands out as one of the prominent candidates for this type of investigations. In the recent years immunological research has been drawing increasing benefits from the resort to advanced mathematical modeling on modern computers.3,4 Among others, Cellular Automata (CA), i.e., fully discrete dynamical systems evolving according to boolean laws, appear to be extremely well suited to computer simulation of biological systems.5 A prominent example of immunological CA is represented by the Celada-Seiden automaton, that has proven capable of providing several new insights into the dynamics of the immune system response. To date, the Celada-Seiden automaton was not in a position to exploit the impressive advances of computer technology, and notably parallel processing, simply because no parallel version of this automaton had been developed yet. In this paper we fill this gap and describe a parallel version of the Celada-Seiden cellular automaton aimed at simulating the dynamic response of the immune system. Details on the parallel implementation as well as performance data on the IBM SP2 parallel platform are presented and commented on.
A Prototype Embedded Microprocessor Interconnect for Distributed and Parallel Computing
Directory of Open Access Journals (Sweden)
Bryan Hughes
2008-08-01
Full Text Available Parallel computing is currently undergoing a transition from a niche use to widespread acceptance due to new, computationally intensive applications and multi-core processors. While parallel processing is an invaluable tool for increasing performance, more time and expertise are required to develop a parallel system than are required for sequential systems. This paper discusses a toolkit currently in development that will simplify both the hardware and software development of embedded distributed and parallel systems. The hardware interconnection mechanism uses the Serial Peripheral Interface as a physical medium and provides routing and management services for the system. The topics in this paper are primarily limited to the interconnection aspect of the toolkit.
Modeling groundwater flow on massively parallel computers
Energy Technology Data Exchange (ETDEWEB)
Ashby, S.F.; Falgout, R.D.; Fogwell, T.W.; Tompson, A.F.B.
1994-12-31
The authors will explore the numerical simulation of groundwater flow in three-dimensional heterogeneous porous media. An interdisciplinary team of mathematicians, computer scientists, hydrologists, and environmental engineers is developing a sophisticated simulation code for use on workstation clusters and MPPs. To date, they have concentrated on modeling flow in the saturated zone (single phase), which requires the solution of a large linear system. they will discuss their implementation of preconditioned conjugate gradient solvers. The preconditioners under consideration include simple diagonal scaling, s-step Jacobi, adaptive Chebyshev polynomial preconditioning, and multigrid. They will present some preliminary numerical results, including simulations of groundwater flow at the LLNL site. They also will demonstrate the code`s scalability.
Accelerated Matrix Element Method with Parallel Computing
Schouten, Doug; Stelzer, Bernd
2014-01-01
The matrix element method utilizes ab initio calculations of probability densities as powerful discriminants for processes of interest in experimental particle physics. The method has already been used successfully at previous and current collider experiments. However, the computational complexity of this method for final states with many particles and degrees of freedom sets it at a disadvantage compared to supervised classification methods such as decision trees, k nearest-neighbour, or neural networks. This note presents a concrete implementation of the matrix element technique using graphics processing units. Due to the intrinsic parallelizability of multidimensional integration, dramatic speedups can be readily achieved, which makes the matrix element technique viable for general usage at collider experiments.
Parallel Implementation of Classification Algorithms Based on Cloud Computing Environment
Directory of Open Access Journals (Sweden)
Wenbo Wang
2012-09-01
Full Text Available As an important task of data mining, Classification has been received considerable attention in many applications, such as information retrieval, web searching, etc. The enlarging volumes of information emerging by the progress of technology and the growing individual needs of data mining, makes classifying of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel classification algorithms. This paper introduces the classification algorithms and cloud computing briefly, based on it analyses the bad points of the present parallel classification algorithms, then addresses a new model of parallel classifying algorithms. And it mainly introduces a parallel Naïve Bayes classification algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm improves the original algorithm performance, and it can process large datasets efficiently on commodity hardware.
Toward an automated parallel computing environment for geosciences
Zhang, Huai; Liu, Mian; Shi, Yaolin; Yuen, David A.; Yan, Zhenzhen; Liang, Guoping
2007-08-01
Software for geodynamic modeling has not kept up with the fast growing computing hardware and network resources. In the past decade supercomputing power has become available to most researchers in the form of affordable Beowulf clusters and other parallel computer platforms. However, to take full advantage of such computing power requires developing parallel algorithms and associated software, a task that is often too daunting for geoscience modelers whose main expertise is in geosciences. We introduce here an automated parallel computing environment built on open-source algorithms and libraries. Users interact with this computing environment by specifying the partial differential equations, solvers, and model-specific properties using an English-like modeling language in the input files. The system then automatically generates the finite element codes that can be run on distributed or shared memory parallel machines. This system is dynamic and flexible, allowing users to address different problems in geosciences. It is capable of providing web-based services, enabling users to generate source codes online. This unique feature will facilitate high-performance computing to be integrated with distributed data grids in the emerging cyber-infrastructures for geosciences. In this paper we discuss the principles of this automated modeling environment and provide examples to demonstrate its versatility.
Parallel computation of seismic analysis of high arch dam
Institute of Scientific and Technical Information of China (English)
Chen Houqun; Ma Huaifa; Tu Jin; Cheng Guangqing; Tang Juzhen
2008-01-01
Parallel computation programs are developed for three-dimensional meso-mechanics analysis of fully-graded dam concrete and seismic response analysis of high arch dams (ADs), based on the Parallel Finite Element Program Generator (PFEPG). The computational algorithms of the numerical simulation of the meso-structure of concrete specimens were studied. Taking into account damage evolution, static preload, strain rate effect, and the heterogeneity of the meso-structure of dam concrete, the fracture processes of damage evolution and configuration of the cracks can be directly simulated. In the seismic response analysis of ADs, all the following factors are involved, such as the nonlinear contact due to the opening and slipping of the contraction joints, energy dispersion of the far-field foundation, dynamic interactions of the dam-foundation-reservoir system, and the combining effects of seismic action with all static loads. The correctness, reliability and efficiency of the two parallel computational programs are verified with practical illustrations.
A High-Performance Communication Service for Parallel Servo Computing
Directory of Open Access Journals (Sweden)
Cheng Xin
2010-11-01
Full Text Available Complexity of algorithms for the servo control in the multi-dimensional, ultra-precise stage application has made multi-processor parallel computing technology needed. Considering the specific communication requirements in the parallel servo computing, we propose a communication service scheme based on VME bus, which provides high-performance data transmission and precise synchronization trigger support for the processors involved. Communications service is implemented on both standard VME bus and user-defined Internal Bus (IB, and can be redefined online. This paper introduces parallel servo computing architecture and communication service, describes structure and implementation details of each module in the service, and finally provides data transmission model and analysis. Experimental results show that communication services can provide high-speed data transmission with sub-nanosecond-level error of transmission latency, and synchronous trigger with nanosecond-level synchronization error. Moreover, the performance of communication service is not affected by the increasing number of processors.
CFD Analysis and Design Optimization Using Parallel Computers
Martinelli, Luigi; Alonso, Juan Jose; Jameson, Antony; Reuther, James
1997-01-01
A versatile and efficient multi-block method is presented for the simulation of both steady and unsteady flow, as well as aerodynamic design optimization of complete aircraft configurations. The compressible Euler and Reynolds Averaged Navier-Stokes (RANS) equations are discretized using a high resolution scheme on body-fitted structured meshes. An efficient multigrid implicit scheme is implemented for time-accurate flow calculations. Optimum aerodynamic shape design is achieved at very low cost using an adjoint formulation. The method is implemented on parallel computing systems using the MPI message passing interface standard to ensure portability. The results demonstrate that, by combining highly efficient algorithms with parallel computing, it is possible to perform detailed steady and unsteady analysis as well as automatic design for complex configurations using the present generation of parallel computers.
Exact Computation of Parallel Robot's Generalized Inertia Matrix
Institute of Scientific and Technical Information of China (English)
ZHAO Yongjie; YANG Zhiyong; MEI Jiangping; HUANG Tian
2005-01-01
According to the definition of the new hypothetical states which have obvious physical significance and are termed as no-gravity static and accelerated states, a method for exact computation of the parallel robot's generalized inertia matrix is presented. Based on the matrix theory, the generalized inertia matrix of the parallel robot can be computed on the assumption that the robot is in these new hypothetical states respectively. The approach is demonstrated by the Delta robot as an example. Based on the principle of the virtual work, the inverse dynamics model of the robot is formulized after the kinematics analysis. Finally, a numerical example is given and the element distribution of the Delta robot's inertia matrix in the workspace is studied. The method has computational advantage of numerical accuracy for the Delta robot and can be parallelized easily.
PARASURG hybrid parallel robot for minimally invasive surgery.
Pisla, D; Gherman, B; Plitea, N; Gyurka, B; Vaida, C; Vlad, L; Graur, F; Radu, C; Suciu, M; Szilaghi, A; Stoica, A
2011-01-01
This paper presents the parallel hybrid robot, PARASURG 9M, for robotically assisted surgery, a robot which was entirely designed and produced in Romania. It is a versatile robot, being composed of a positioning and orientation module, PARASURG 5M with five degrees of freedom, having the possibility of attaching at its end either a laparoscope or an active surgical instrument for cutting/grasping, PARASIM, with four degrees of freedom. Based on its mathematical modelling, the first low-cost experimental model of the surgical robot has been built. The robot is part of the surgical robotic system, PARAMIS, with three arms, one used as a laparoscope holder, and other two for manipulating active instruments. When it is used as a manipulator of the camera, the user has the possibility to give commands in a large area for the positioning of the laparoscope using different interfaces: joystick, microphone, keyboard & mouse and haptic device. If the active surgical instrument, PARASIM, is attached, the robot commands are given through a haptic device. The main features that make the PARASURG 9M surgical robot suited for minimally invasive surgery are: precision, the elimination of the natural tremor of the surgeon, direct control over a smooth, precise, stable view of the internal surgical field for the surgeon. It also eliminates the need of a second surgeon to be present for the entire procedure (in the case of using the robot as a camera holder). In addition, there is improvement of surgeon dexterity in the case of using the PARASIM active instrument and better ergonomics in using the robot (in the case of the classic laparoscopy, the surgeon must adopt a difficult position for a long period of time, while the robot never gets tired). Having a relatively easy to understand, intuitive commanding system, the surgeons can rapidly adapt to the use of the PARASURG 9M robot in surgical procedures.
IPython: components for interactive and parallel computing across disciplines. (Invited)
Perez, F.; Bussonnier, M.; Frederic, J. D.; Froehle, B. M.; Granger, B. E.; Ivanov, P.; Kluyver, T.; Patterson, E.; Ragan-Kelley, B.; Sailer, Z.
2013-12-01
Scientific computing is an inherently exploratory activity that requires constantly cycling between code, data and results, each time adjusting the computations as new insights and questions arise. To support such a workflow, good interactive environments are critical. The IPython project (http://ipython.org) provides a rich architecture for interactive computing with: 1. Terminal-based and graphical interactive consoles. 2. A web-based Notebook system with support for code, text, mathematical expressions, inline plots and other rich media. 3. Easy to use, high performance tools for parallel computing. Despite its roots in Python, the IPython architecture is designed in a language-agnostic way to facilitate interactive computing in any language. This allows users to mix Python with Julia, R, Octave, Ruby, Perl, Bash and more, as well as to develop native clients in other languages that reuse the IPython clients. In this talk, I will show how IPython supports all stages in the lifecycle of a scientific idea: 1. Individual exploration. 2. Collaborative development. 3. Production runs with parallel resources. 4. Publication. 5. Education. In particular, the IPython Notebook provides an environment for "literate computing" with a tight integration of narrative and computation (including parallel computing). These Notebooks are stored in a JSON-based document format that provides an "executable paper": notebooks can be version controlled, exported to HTML or PDF for publication, and used for teaching.
State of the Art in Parallel Computing with R
Directory of Open Access Journals (Sweden)
Markus Schmidberger
2009-06-01
Full Text Available R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing.This paper presents an overview of techniques for parallel computing with R on computer clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance.Two packages (snow, Rmpi stand out as particularly suited to general use on computer clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems five different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix.
Small file aggregation in a parallel computing system
Faibish, Sorin; Bent, John M.; Tzelnic, Percy; Grider, Gary; Zhang, Jingwang
2014-09-02
Techniques are provided for small file aggregation in a parallel computing system. An exemplary method for storing a plurality of files generated by a plurality of processes in a parallel computing system comprises aggregating the plurality of files into a single aggregated file; and generating metadata for the single aggregated file. The metadata comprises an offset and a length of each of the plurality of files in the single aggregated file. The metadata can be used to unpack one or more of the files from the single aggregated file.
Genetic Algorithm Modeling with GPU Parallel Computing Technology
Cavuoti, Stefano; Brescia, Massimo; Pescapé, Antonio; Longo, Giuseppe; Ventre, Giorgio
2012-01-01
We present a multi-purpose genetic algorithm, designed and implemented with GPGPU / CUDA parallel computing technology. The model was derived from a multi-core CPU serial implementation, named GAME, already scientifically successfully tested and validated on astrophysical massive data classification problems, through a web application resource (DAMEWARE), specialized in data mining based on Machine Learning paradigms. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm has provided an exploit of the internal training features of the model, permitting a strong optimization in terms of processing performances and scalability.
Energy Technology Data Exchange (ETDEWEB)
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.
2016-03-15
Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.
Applications of parallel supercomputers: Scientific results and computer science lessons
Energy Technology Data Exchange (ETDEWEB)
Fox, G.C.
1989-07-12
Parallel Computing has come of age with several commercial and inhouse systems that deliver supercomputer performance. We illustrate this with several major computations completed or underway at Caltech on hypercubes, transputer arrays and the SIMD Connection Machine CM-2 and AMT DAP. Applications covered are lattice gauge theory, computational fluid dynamics, subatomic string dynamics, statistical and condensed matter physics,theoretical and experimental astronomy, quantum chemistry, plasma physics, grain dynamics, computer chess, graphics ray tracing, and Kalman filters. We use these applications to compare the performance of several advanced architecture computers including the conventional CRAY and ETA-10 supercomputers. We describe which problems are suitable for which computers in the terms of a matching between problem and computer architecture. This is part of a set of lessons we draw for hardware, software, and performance. We speculate on the emergence of new academic disciplines motivated by the growing importance of computers. 138 refs., 23 figs., 10 tabs.
Requirements for supercomputing in energy research: The transition to massively parallel computing
Energy Technology Data Exchange (ETDEWEB)
1993-02-01
This report discusses: The emergence of a practical path to TeraFlop computing and beyond; requirements of energy research programs at DOE; implementation: supercomputer production computing environment on massively parallel computers; and implementation: user transition to massively parallel computing.
A design methodology for portable software on parallel computers
Nicol, David M.; Miller, Keith W.; Chrisman, Dan A.
1993-01-01
This final report for research that was supported by grant number NAG-1-995 documents our progress in addressing two difficulties in parallel programming. The first difficulty is developing software that will execute quickly on a parallel computer. The second difficulty is transporting software between dissimilar parallel computers. In general, we expect that more hardware-specific information will be included in software designs for parallel computers than in designs for sequential computers. This inclusion is an instance of portability being sacrificed for high performance. New parallel computers are being introduced frequently. Trying to keep one's software on the current high performance hardware, a software developer almost continually faces yet another expensive software transportation. The problem of the proposed research is to create a design methodology that helps designers to more precisely control both portability and hardware-specific programming details. The proposed research emphasizes programming for scientific applications. We completed our study of the parallelizability of a subsystem of the NASA Earth Radiation Budget Experiment (ERBE) data processing system. This work is summarized in section two. A more detailed description is provided in Appendix A ('Programming Practices to Support Eventual Parallelism'). Mr. Chrisman, a graduate student, wrote and successfully defended a Ph.D. dissertation proposal which describes our research associated with the issues of software portability and high performance. The list of research tasks are specified in the proposal. The proposal 'A Design Methodology for Portable Software on Parallel Computers' is summarized in section three and is provided in its entirety in Appendix B. We are currently studying a proposed subsystem of the NASA Clouds and the Earth's Radiant Energy System (CERES) data processing system. This software is the proof-of-concept for the Ph.D. dissertation. We have implemented and measured
Parallel Computing for the Computed-Tomography Imaging Spectrometer
Lee, Seungwon
2008-01-01
This software computes the tomographic reconstruction of spatial-spectral data from raw detector images of the Computed-Tomography Imaging Spectrometer (CTIS), which enables transient-level, multi-spectral imaging by capturing spatial and spectral information in a single snapshot.
A UNIVERSAL ALGORITHM FOR PARALLEL CRC COMPUTATION AND ITS IMPLEMENTATION
Institute of Scientific and Technical Information of China (English)
Xu Zhanqi; Yi Kechu; Liu Zengji
2006-01-01
Derived from a proposed universal mathematical expression, this paper investigates a novel algorithm for parallel Cyclic Redundancy Check (CRC) computation, which is an iterative algorithm to update the check-bit sequence step by step and suits to various argument selections of CRC computation. The algorithm proposed is quite suitable for hardware implementation. The simulation implementation and performance analysis suggest that it could efficiently speed up the computation compared with the conventional ones. The algorithm is implemented in hardware at as high as 21Gbps, and its usefulness in high-speed CRC computations is implied, such as Asynchronous Transfer Mode (ATM) networks and 10G Ethernet.
Lattice gauge theory on the Intel parallel scientific computer
Energy Technology Data Exchange (ETDEWEB)
Gottlieb, S. (Department of Physics, Indiana University, Bloomington, IN (USA))
1990-08-01
Intel Scientific Computers (ISC) has just started producing its third general of parallel computer, the iPSC/860. Based on the i860 chip that has a peak performance of 80 Mflops and with a current maximum of 128 nodes, this computer should achieve speeds in excess of those obtainable on conventional vector supercomputers. The hardware, software and computing techniques appropriate for lattice gauge theory calculations are described. The differences between a staggered fermion conjugate gradient program written under CANOPY and for the iPSC are detailed.
Parallelism in computations in quantum and statistical mechanics
Clementi, E.; Corongiu, G.; Detrich, J. H.
1985-07-01
Often very fundamental biochemical and biophysical problems defy simulations because of limitations in today's computers. We present and discuss a distributed system composed of two IBM 4341s and/or an IBM 4381 as front-end processors and ten FPS-164 attached array processors. This parallel system - called LCAP - has presently a peak performance of about 110 Mflops; extensions to higher performance are discussed. Presently, the system applications use a modified version of VM/SP as the operating system: description of the modifications is given. Three applications programs have been migrated from sequential to parallel: a molecular quantum mechanical, a Metropolis-Monte Carlo and a molecular dynamics program. Descriptions of the parallel codes are briefly outlined. Use of these parallel codes has already opened up new capabilities for our research. The very positive performance comparisons with today's supercomputers allow us to conclude that parallel computers and programming, of the type we have considered, represent a pragmatic answer to many computationally intensive problems.
The science of computing - The evolution of parallel processing
Denning, P. J.
1985-01-01
The present paper is concerned with the approaches to be employed to overcome the set of limitations in software technology which impedes currently an effective use of parallel hardware technology. The process required to solve the arising problems is found to involve four different stages. At the present time, Stage One is nearly finished, while Stage Two is under way. Tentative explorations are beginning on Stage Three, and Stage Four is more distant. In Stage One, parallelism is introduced into the hardware of a single computer, which consists of one or more processors, a main storage system, a secondary storage system, and various peripheral devices. In Stage Two, parallel execution of cooperating programs on different machines becomes explicit, while in Stage Three, new languages will make parallelism implicit. In Stage Four, there will be very high level user interfaces capable of interacting with scientists at the same level of abstraction as scientists do with each other.
Mighell, Kenneth John
2010-10-01
The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called crblaster, which does cosmic-ray rejection of CCD images using the embarrassingly parallel l.a.cosmic algorithm. crblaster is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. crblaster uses a two-dimensional image partitioning algorithm that partitions an input image into N rectangular subimages of nearly equal area; the subimages include sufficient additional pixels along common image partition edges such that the need for communication between computer processes is eliminated. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly parallel algorithms. The crblaster source code is freely available at the official application Web site at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800 × 800 pixel Hubble Space Telescope WFPC2 image takes 44 s with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8 GHz quad-core Intel Xeon processors. crblaster is 7.4 times faster when processing the same image on a single core on the same machine. Processing the same image with crblaster simultaneously on all eight cores of the same machine takes 0.875 s—which is a speedup factor of 50.3 times faster than the
Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing
Ozguner, Fusun
1996-01-01
Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.
Solving the Stokes problem on a massively parallel computer
DEFF Research Database (Denmark)
Axelsson, Owe; Barker, Vincent A.; Neytcheva, Maya
2001-01-01
We describe a numerical procedure for solving the stationary two‐dimensional Stokes problem based on piecewise linear finite element approximations for both velocity and pressure, a regularization technique for stability, and a defect‐correction technique for improving accuracy. Eliminating...... boundary value problem for each velocity component, are solved by the conjugate gradient method with a preconditioning based on the algebraic multi‐level iteration (AMLI) technique. The velocity is found from the computed pressure. The method is optimal in the sense that the computational work...... is proportional to the number of unknowns. Further, it is designed to exploit a massively parallel computer with distributed memory architecture. Numerical experiments on a Cray T3E computer illustrate the parallel performance of the method....
Variable-Complexity Multidisciplinary Optimization on Parallel Computers
Grossman, Bernard; Mason, William H.; Watson, Layne T.; Haftka, Raphael T.
1998-01-01
This report covers work conducted under grant NAG1-1562 for the NASA High Performance Computing and Communications Program (HPCCP) from December 7, 1993, to December 31, 1997. The objective of the research was to develop new multidisciplinary design optimization (MDO) techniques which exploit parallel computing to reduce the computational burden of aircraft MDO. The design of the High-Speed Civil Transport (HSCT) air-craft was selected as a test case to demonstrate the utility of our MDO methods. The three major tasks of this research grant included: development of parallel multipoint approximation methods for the aerodynamic design of the HSCT, use of parallel multipoint approximation methods for structural optimization of the HSCT, mathematical and algorithmic development including support in the integration of parallel computation for items (1) and (2). These tasks have been accomplished with the development of a response surface methodology that incorporates multi-fidelity models. For the aerodynamic design we were able to optimize with up to 20 design variables using hundreds of expensive Euler analyses together with thousands of inexpensive linear theory simulations. We have thereby demonstrated the application of CFD to a large aerodynamic design problem. For the predicting structural weight we were able to combine hundreds of structural optimizations of refined finite element models with thousands of optimizations based on coarse models. Computations have been carried out on the Intel Paragon with up to 128 nodes. The parallel computation allowed us to perform combined aerodynamic-structural optimization using state of the art models of a complex aircraft configurations.
Parallel MMF: a Multiresolution Approach to Matrix Computation
Kondor, Risi; Teneva, Nedelina; Mudrakarta, Pramod K.
2015-01-01
Multiresolution Matrix Factorization (MMF) was recently introduced as a method for finding multiscale structure and defining wavelets on graphs/matrices. In this paper we derive pMMF, a parallel algorithm for computing the MMF factorization. Empirically, the running time of pMMF scales linearly in the dimension for sparse matrices. We argue that this makes pMMF a valuable new computational primitive in its own right, and present experiments on using pMMF for two distinct purposes: compressing...
Element-topology-independent preconditioners for parallel finite element computations
Park, K. C.; Alexander, Scott
1992-01-01
A family of preconditioners for the solution of finite element equations are presented, which are element-topology independent and thus can be applicable to element order-free parallel computations. A key feature of the present preconditioners is the repeated use of element connectivity matrices and their left and right inverses. The properties and performance of the present preconditioners are demonstrated via beam and two-dimensional finite element matrices for implicit time integration computations.
Match and Move, an Approach to Data Parallel Computing
1992-10-01
Blelloch, Siddhartha Chatterjee, Jay Sippelstein, and Marco Zagha. CVL: a C Vector Library. School of Computer Science, Carnegie Mellon University...CBZ90] Siddhartha Chatterjee, Guy E. Blelloch, and Marco Zagha. Scan primitives for vector computers. In Proceedings Supercomputing , November 1990...Cha91] Siddhartha Chatterjee. Compiling data-parallel programs for efficient execution on shared-memory multiprocessors. PhD thesis, Carnegie Mellon
WEKA-G: Parallel data mining on computational grids
Directory of Open Access Journals (Sweden)
PIMENTA, A.
2009-12-01
Full Text Available Data mining is a technology that can extract useful information from large amounts of data. However, mining a database often requires a high computational power. To resolve this problem, this paper presents a tool (Weka-G, which runs in parallel algorithms used in the mining process data. As the environment for doing so, we use a computational grid by adding several features within a WAN.
Element-topology-independent preconditioners for parallel finite element computations
Park, K. C.; Alexander, Scott
1992-01-01
A family of preconditioners for the solution of finite element equations are presented, which are element-topology independent and thus can be applicable to element order-free parallel computations. A key feature of the present preconditioners is the repeated use of element connectivity matrices and their left and right inverses. The properties and performance of the present preconditioners are demonstrated via beam and two-dimensional finite element matrices for implicit time integration computations.
Hardware packet pacing using a DMA in a parallel computer
Chen, Dong; Heidelberger, Phillip; Vranas, Pavlos
2013-08-13
Method and system for hardware packet pacing using a direct memory access controller in a parallel computer which, in one aspect, keeps track of a total number of bytes put on the network as a result of a remote get operation, using a hardware token counter.
PCCM2: A GCM adapted for scalable parallel computers
Energy Technology Data Exchange (ETDEWEB)
Drake, J.; Semeraro, B.D.; Worley, P. [Oak Ridge National Lab., TN (United States); Foster, I.; Michalakes, J.; Toonen, B. [Argonne National Lab., IL (United States); Hack, J.J.; Williamson, D.L. [National Center for Atmospheric Research, Boulder, CO (United States)
1994-01-01
The Computer Hardware, Advanced Mathematics and Model Physics (CHAMMP) program seeks to provide climate researchers with an advanced modeling capability for the study of global change issues. One of the more ambitious projects being undertaken in the CHAMMP program is the development of PCCM2, an adaptation of the Community Climate Model (CCM2) for scalable parallel computers. PCCM2 uses a message-passing, domain-decomposition approach, in which each processor is allocated responsibility for computation on one part of the computational grid, and messages are generated to communicate data between processors. Much of the research effort associated with development of a parallel code of this sort is concerned with identifying efficient decomposition and communication strategies. In PCCM2, this task is complicated by the need to support both semi-Lagrangian transport and spectral transport. Load balancing and parallel I/O techniques are also required. In this paper, the authors review the various parallel algorithms used in PCCM2 and the work done to arrive at a validated model.
Adaptive Methods and Parallel Computation for Partial Differential Equations
1992-05-01
E. Batcher, W. C. Meilander, and J. L. Potter, Eds ., Proceedings of the Inter- national Conference on Parallel Processing, Computer Society Press...11. P. L. Baehmann, S. L. Wittchen , M. S. Shephard, K. R. Grice, and M. A. Yerry, Robust, geometrically based, automatic two-dimensional mesh
Advances in parallel computer technology for desktop atmospheric dispersion models
Energy Technology Data Exchange (ETDEWEB)
Bian, X.; Ionescu-Niscov, S.; Fast, J.D. [Pacific Northwest National Lab., Richland, WA (United States); Allwine, K.J. [Allwine Enviornmental Serv., Richland, WA (United States)
1996-12-31
Desktop models are those models used by analysts with varied backgrounds, for performing, for example, air quality assessment and emergency response activities. These models must be robust, well documented, have minimal and well controlled user inputs, and have clear outputs. Existing coarse-grained parallel computers can provide significant increases in computation speed in desktop atmospheric dispersion modeling without considerable increases in hardware cost. This increased speed will allow for significant improvements to be made in the scientific foundations of these applied models, in the form of more advanced diffusion schemes and better representation of the wind and turbulence fields. This is especially attractive for emergency response applications where speed and accuracy are of utmost importance. This paper describes one particular application of coarse-grained parallel computer technology to a desktop complex terrain atmospheric dispersion modeling system. By comparing performance characteristics of the coarse-grained parallel version of the model with the single-processor version, we will demonstrate that applying coarse-grained parallel computer technology to desktop atmospheric dispersion modeling systems will allow us to address critical issues facing future requirements of this class of dispersion models.
The Computational Complexity of the Parallel Knock-Out Problem
Broersma, H.J.; Johnson, M.; Paulusma, D.; Stewart, I.A.; Correa, J.R.; Hevia, A.; Kiwi, M.
2006-01-01
We consider computational complexity questions related to parallel knock-out schemes for graphs. In such schemes, in each round, each remaining vertex of a given graph eliminates exactly one of its neighbours. We show that the problem of whether, for a given graph, such a scheme can be found that el
Cluster based parallel database management system for data intensive computing
Institute of Scientific and Technical Information of China (English)
Jianzhong LI; Wei ZHANG
2009-01-01
This paper describes a computer-cluster based parallel database management system (DBMS), InfiniteDB, developed by the authors. InfiniteDB aims at efficiently sup-port data intensive computing in response to the rapid grow-ing in database size and the need of high performance ana-lyzing of massive databases. It can be efficiently executed in the computing system composed by thousands of computers such as cloud computing system. It supports the parallelisms of intra-query, inter-query, intra-operation, inter-operation and pipelining. It provides effective strategies for managing massive databases including the multiple data declustering methods, the declustering-aware algorithms for relational operations and other database operations, and the adaptive query optimization method. It also provides the functions of parallel data warehousing and data mining, the coordinator-wrapper mechanism to support the integration of heteroge-neous information resources on the Internet, and the fault tol-erant and resilient infrastructures. It has been used in many applications and has proved quite effective for data intensive computing.
Fencing data transfers in a parallel active messaging interface of a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Blocksome, Michael A.; Mamidala, Amith R.
2015-06-09
Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Fencing data transfers in a parallel active messaging interface of a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Blocksome, Michael A.; Mamidala, Amith R.
2015-06-02
Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Fencing data transfers in a parallel active messaging interface of a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Blocksome, Michael A.; Mamidala, Amith R.
2015-06-30
Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint comprising a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI and through data communications resources including a deterministic data communications network, including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Fencing data transfers in a parallel active messaging interface of a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Blocksome, Michael A.; Mamidala, Amith R.
2015-08-11
Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint comprising a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI and through data communications resources including a deterministic data communications network, including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Directory of Open Access Journals (Sweden)
Tarık Çakar
2012-12-01
Full Text Available This paper considers the problem of scheduling a given number of jobs on a specified number of identical parallel robots with unequal release dates and precedence constraints in order to minimize mean tardiness. This problem is strongly NP-hard. The author proposes a hybrid intelligent solution system, which uses Genetic Algorithms and Simulated Annealing (GA+SA. A genetic algorithm, as is well known, is an efficient tool for the solution of combinatorial optimization problems. Solutions for problems of different scales are found using genetic algorithms, simulated annealing and a Hybrid Intelligent Solution System (HISS. Computational results of empirical experiments show that the Hybrid Intelligent Solution System (HISS is successful with regards to solution quality and computational time.
Directory of Open Access Journals (Sweden)
Florica Novăcescu
2011-10-01
Full Text Available HPC (High Performance Computing has become essential for the acceleration of innovation and the companies’ assistance in creating new inventions, better models and more reliable products as well as obtaining processes and services at low costs. The information in this paper focuses particularly on: description the field of high performance scientific computing, parallel computing, scientific computing, parallel computers, and trends in the HPC field, presented here reveal important new directions toward the realization of a high performance computational society. The practical part of the work is an example of use of the HPC tool to accelerate solving an electrostatic optimization problem using the Parallel Computing Toolbox that allows solving computational and data-intensive problems using MATLAB and Simulink on multicore and multiprocessor computers.
Adaptation and hybridization in computational intelligence
Jr, Iztok
2015-01-01
This carefully edited book takes a walk through recent advances in adaptation and hybridization in the Computational Intelligence (CI) domain. It consists of ten chapters that are divided into three parts. The first part illustrates background information and provides some theoretical foundation tackling the CI domain, the second part deals with the adaptation in CI algorithms, while the third part focuses on the hybridization in CI. This book can serve as an ideal reference for researchers and students of computer science, electrical and civil engineering, economy, and natural sciences that are confronted with solving the optimization, modeling and simulation problems. It covers the recent advances in CI that encompass Nature-inspired algorithms, like Artificial Neural networks, Evolutionary Algorithms and Swarm Intelligence –based algorithms.
Lyapunov exponents computation for hybrid neurons.
Bizzarri, Federico; Brambilla, Angelo; Gajani, Giancarlo Storti
2013-10-01
Lyapunov exponents are a basic and powerful tool to characterise the long-term behaviour of dynamical systems. The computation of Lyapunov exponents for continuous time dynamical systems is straightforward whenever they are ruled by vector fields that are sufficiently smooth to admit a variational model. Hybrid neurons do not belong to this wide class of systems since they are intrinsically non-smooth owing to the impact and sometimes switching model used to describe the integrate-and-fire (I&F) mechanism. In this paper we show how a variational model can be defined also for this class of neurons by resorting to saltation matrices. This extension allows the computation of Lyapunov exponent spectrum of hybrid neurons and of networks made up of them through a standard numerical approach even in the case of neurons firing synchronously.
Hybrid Nanoelectronics: Future of Computer Technology
Institute of Scientific and Technical Information of China (English)
Wei Wang; Ming Liu; Andrew Hsu
2006-01-01
Nanotechnology may well prove to be the 21st century's new wave of scientific knowledge that transforms people's lives. Nanotechnology research activities are booming around the globe. This article reviews the recent progresses made on nanoelectronic research in US and China, and introduces several novel hybrid solutions specifically useful for future computer technology. These exciting new directions will lead to many future inventions, and have a huge impact to research communities and industries.
Aggregating job exit statuses of a plurality of compute nodes executing a parallel application
Energy Technology Data Exchange (ETDEWEB)
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.; Mundy, Michael B.
2015-07-21
Aggregating job exit statuses of a plurality of compute nodes executing a parallel application, including: identifying a subset of compute nodes in the parallel computer to execute the parallel application; selecting one compute node in the subset of compute nodes in the parallel computer as a job leader compute node; initiating execution of the parallel application on the subset of compute nodes; receiving an exit status from each compute node in the subset of compute nodes, where the exit status for each compute node includes information describing execution of some portion of the parallel application by the compute node; aggregating each exit status from each compute node in the subset of compute nodes; and sending an aggregated exit status for the subset of compute nodes in the parallel computer.
Domain decomposition parallel computing for transient two-phase flow of nuclear reactors
Energy Technology Data Exchange (ETDEWEB)
Lee, Jae Ryong; Yoon, Han Young [KAERI, Daejeon (Korea, Republic of); Choi, Hyoung Gwon [Seoul National University, Seoul (Korea, Republic of)
2016-05-15
KAERI (Korea Atomic Energy Research Institute) has been developing a multi-dimensional two-phase flow code named CUPID for multi-physics and multi-scale thermal hydraulics analysis of Light water reactors (LWRs). The CUPID code has been validated against a set of conceptual problems and experimental data. In this work, the CUPID code has been parallelized based on the domain decomposition method with Message passing interface (MPI) library. For domain decomposition, the CUPID code provides both manual and automatic methods with METIS library. For the effective memory management, the Compressed sparse row (CSR) format is adopted, which is one of the methods to represent the sparse asymmetric matrix. CSR format saves only non-zero value and its position (row and column). By performing the verification for the fundamental problem set, the parallelization of the CUPID has been successfully confirmed. Since the scalability of a parallel simulation is generally known to be better for fine mesh system, three different scales of mesh system are considered: 40000 meshes for coarse mesh system, 320000 meshes for mid-size mesh system, and 2560000 meshes for fine mesh system. In the given geometry, both single- and two-phase calculations were conducted. In addition, two types of preconditioners for a matrix solver were compared: Diagonal and incomplete LU preconditioner. In terms of enhancement of the parallel performance, the OpenMP and MPI hybrid parallel computing for a pressure solver was examined. It is revealed that the scalability of hybrid calculation was enhanced for the multi-core parallel computation.
A Highly Efficient Parallel Algorithm for Computing the Fiedler Vector
Manguoglu, Murat
2010-01-01
The eigenvector corresponding to the second smallest eigenvalue of the Laplacian of a graph, known as the Fiedler vector, has a number of applications in areas that include matrix reordering, graph partitioning, protein analysis, data mining, machine learning, and web search. The computation of the Fiedler vector has been regarded as an expensive process as it involves solving a large eigenvalue problem. We present a novel and efficient parallel algorithm for computing the Fiedler vector of large graphs based on the Trace Minimization algorithm (Sameh, et.al). We compare the parallel performance of our method with a multilevel scheme, designed specifically for computing the Fiedler vector, which is implemented in routine MC73\\_Fiedler of the Harwell Subroutine Library (HSL). In addition, we compare the quality of the Fiedler vector for the application of weighted matrix reordering and provide a metric for measuring the quality of reordering.
Parallel ProXimal Algorithm for Image Restoration Using Hybrid Regularization
Pustelnik, Nelly; Pesquet, Jean-Christophe
2009-01-01
Regularization approaches have demonstrated their effectiveness for solving ill-posed problems. However, in the context of variational restoration methods, a challenging question remains, which is how to find a good regularizer. While total variation introduces staircase effects, wavelet domain regularization brings other artefacts, e.g. ringing. However, a compromise can be found by introducing a hybrid regularization including several terms non necessarily acting in the same domain (e.g. spatial and wavelet transform domains). We adopt a convex optimization framework where the criterion to be minimized is split in the sum of more than two terms. For spatial domain regularization, isotropic or anisotropic total variation definitions using various gradient filters are considered. An accelerated version of the Parallel ProXimal Algorithm is proposed to perform the minimization. Some difficulties in the computation of the proximity operators involved in this algorithm are also addressed in this paper. Numerical...
Directory of Open Access Journals (Sweden)
Chunfeng Liu
2013-01-01
Full Text Available The paper presents a novel hybrid genetic algorithm (HGA for a deterministic scheduling problem where multiple jobs with arbitrary precedence constraints are processed on multiple unrelated parallel machines. The objective is to minimize total tardiness, since delays of the jobs may lead to punishment cost or cancellation of orders by the clients in many situations. A priority rule-based heuristic algorithm, which schedules a prior job on a prior machine according to the priority rule at each iteration, is suggested and embedded to the HGA for initial feasible schedules that can be improved in further stages. Computational experiments are conducted to show that the proposed HGA performs well with respect to accuracy and efficiency of solution for small-sized problems and gets better results than the conventional genetic algorithm within the same runtime for large-sized problems.
Computing NLTE Opacities -- Node Level Parallel
Energy Technology Data Exchange (ETDEWEB)
Holladay, Daniel [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
2015-09-11
Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.
Parallel distance matrix computation for Matlab data mining
Skurowski, Przemysław; Staniszewski, Michał
2016-06-01
The paper presents utility functions for computing of a distance matrix, which plays a crucial role in data mining. The goal in the design was to enable operating on relatively large datasets by overcoming basic shortcoming - computing time - with an interface easy to use. The presented solution is a set of functions, which were created with emphasis on practical applicability in real life. The proposed solution is presented along the theoretical background for the performance scaling. Furthermore, different approaches of the parallel computing are analyzed, including shared memory, which is uncommon in Matlab environment.
Position Analysis of a Hybrid Serial-Parallel Manipulator in Immersion Lithography
Directory of Open Access Journals (Sweden)
Jie-jie Shao
2015-01-01
Full Text Available This paper proposes a novel hybrid serial-parallel mechanism with 6 degrees of freedom. The new mechanism combines two different parallel modules in a serial form. 3-P̲(PH parallel module is architecture of 3 degrees of freedom based on higher joints and specializes in describing two planes’ relative pose. 3-P̲SP parallel module is typical architecture which has been widely investigated in recent researches. In this paper, the direct-inverse position problems of the 3-P̲SP parallel module in the couple mixed-type mode are analyzed in detail, and the solutions are obtained in an analytical form. Furthermore, the solutions for the direct and inverse position problems of the novel hybrid serial-parallel mechanism are also derived and obtained in the analytical form. The proposed hybrid serial-parallel mechanism is applied to regulate the immersion hood’s pose in an immersion lithography system. Through measuring and regulating the pose of the immersion hood with respect to the wafer surface simultaneously, the immersion hood can track the wafer surface’s pose in real-time and the gap status is stabilized. This is another exploration to hybrid serial-parallel mechanism’s application.
Conceptual design of a hybrid parallel mechanism for mask exchanging of TMT
Wang, Jianping; Zhou, Hongfei; Li, Kexuan; Zhou, Zengxiang; Zhai, Chao
2015-10-01
Mask exchange system is an important part of the Multi-Object Broadband Imaging Echellette (MOBIE) on the Thirty Meter Telescope (TMT). To solve the problem of stiffness changing with the gravity vector of the mask exchange system in the MOBIE, the hybrid parallel mechanism design method was introduced into the whole research. By using the characteristics of high stiffness and precision of parallel structure, combined with large moving range of serial structure, a conceptual design of a hybrid parallel mask exchange system based on 3-RPS parallel mechanism was presented. According to the position requirements of the MOBIE, the SolidWorks structure model of the hybrid parallel mask exchange robot was established and the appropriate installation position without interfering with the related components and light path in the MOBIE of TMT was analyzed. Simulation results in SolidWorks suggested that 3-RPS parallel platform had good stiffness property in different gravity vector directions. Furthermore, through the research of the mechanism theory, the inverse kinematics solution of the 3-RPS parallel platform was calculated and the mathematical relationship between the attitude angle of moving platform and the angle of ball-hinges on the moving platform was established, in order to analyze the attitude adjustment ability of the hybrid parallel mask exchange robot. The proposed conceptual design has some guiding significance for the design of mask exchange system of the MOBIE on TMT.
New Parallel computing framework for radiation transport codes
Energy Technology Data Exchange (ETDEWEB)
Kostin, M.A.; /Michigan State U., NSCL; Mokhov, N.V.; /Fermilab; Niita, K.; /JAERI, Tokai
2010-09-01
A new parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was integrated with the MARS15 code, and an effort is under way to deploy it in PHITS. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. Several checkpoint files can be merged into one thus combining results of several calculations. The framework also corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.
New Parallel computing framework for radiation transport codes
Kostin, M A; Niita, K
2012-01-01
A new parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The frame work was integrated with the MARS15 code, and an effort is under way to deploy it in PHITS. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. Several checkpoint files can be merged into one thus combining results of several calculations. The framework also corrects some of the known problems with the sch eduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be...
Parallel computing in genomic research: advances and applications.
Ocaña, Kary; de Oliveira, Daniel
2015-01-01
Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities.
Energy Technology Data Exchange (ETDEWEB)
Pan, V.Y. [Lehman College, Bronx, NY (United States)
1996-12-31
We devise effective randomized parallel algorithms for the solution (over any field and some integers rings of constants) of several fundamental problems of computations with polynomials and structured matrices, well known for their resistance to effective parallel solution. This includes computing the gcd of two polynomials, as well as any selected entry of the extended Euclidean scheme for these polynomials and of Pade approximation table, the solution of the Berlekamp-Massey problem of recovering the coefficients of a linear recurrence from its terms, the solution of a nonsingular Toeplitz linear system of equations, computing the ranks of Toephtz matrices, and other related computations with Toeplitz, Hankel, Vandermonde, Cauchy (generalized Hilbert) matrices and with matrices having similar structures. Our algorithms enable us to reach new record estimates for randomized parallel arithmetic complexity of these computations, that is, O((log n){sup 3}) time and O ((n{sup 2} log log n) / (log n){sup 2}) arithmetic processors, n being the input size. The results have further applications to numerous related computations over abstract fields.
Point to point processing of digital images using parallel computing
Directory of Open Access Journals (Sweden)
Eric Olmedo
2012-05-01
Full Text Available This paper presents an approach the point to point processing of digital images using parallel computing, particularly for grayscale, brightening, darkening, thresholding and contrast change. The point to point technique applies a transformation to each pixel on image concurrently rather than sequentially. This approach used CUDA as parallel programming tool on a GPU in order to take advantage of all available cores. Preliminary results show that CUDA obtains better results in most of the used filters. Except in the negative filter with lower resolutions images OpenCV obtained better ones, but using images in high resolutions CUDA performance is better.
Institute of Scientific and Technical Information of China (English)
明东; 肖晓琳; 汤佳贝; 许敏鹏
2015-01-01
A hybrid SSVEP-P300 brain-computer interface ( SP-BCI) system combining steady-state visu-al evoked potential ( SSVEP) and P300 component of event-related potential ( ERP) could induce both signals at the same time, and take advantage of high signal-to-noise ratio and asynchronous compatibility of SSVEP and the ability to present a large number of commands of P300 .It also has the potential to im-prove information transfer rate ( ITR) of the system, but existing evoked paradigms could not give full play to the aforementioned characteristics.This paper proposed a new strategy to make SSVEP evoked and blocked ( SSVEP-B) asynchronously according to the respective frequency, to evoke P300 at the same time, and to combine SSVEP-B with P300 to make classification as well.Ten healthy subjects par-ticipated in the study.The results of the offline tests show that the system could reach the average accura-cy of 84.5% and the highest theoretical information transfer rate of 89.5 bit/min.The results prove that the new strategy is conductive to improve accuracy and information transfer rate of the BCI system, and that related research ideas and technologies can be used as reference to design and generalize a hybrid BCI system.%稳态视觉诱发电位（ SSVEP）与事件相关电位中P300成分相结合的混合范式脑－机接口（ SP－BCI）系统可同时诱发两种特征脑电信号并综合前者的高信噪比和异步兼容特点及后者的大指令集优势，具有提高系统信息传输速率的潜在能力，但现有脑电诱发范式未能充分发挥上述特长．本文提出一种SSVEP按各自频率异步诱发和阻断（ SSVEP－B）且与P300并行诱发的新策略，并融合SSVEP－B与P300特征信息进行脑电分类识别．经10名健康年轻被试离线测试实验结果表明，被试总体平均分类正确率为84．5％，系统最高理论信息传输速率为89．5 bit／min，表明新型诱发策略有助于提高BCI信息识别
Exploring Strategies for Parallel Computing of RS Data Assimilation with SWAP-GA
Directory of Open Access Journals (Sweden)
Shamim Akhter
2007-01-01
Full Text Available An agro-hydrological simulation model is useful for agriculture monitoring. One issue in running such model is parameter identification, especially when the target area is large such as provincial or country level. Remote Sensing (RS provides us with useful information over large area. RS cannot observe input parameters of agro-hydrological models directly. However, a method to estimate input parameters of such model from RS using data assimilation has been proposed by Ines[1] using the SWAP (Soil, Water, Atmosphere and Plant model. Genetic Algorithm (GA was used in this optimization process. The combined model of SWAP and GA is called SWAP-GA model. When dealing with sufficiently large and complex processing with RS data, single computers time processing extends to unacceptable limits. It becomes necessary to introduce methods for using higher processing power such as distributed computing. Cluster based computing support both high performance and load balancing parallel or distributed applications. Implementing SWAP-GA in Cluster computers will remove the computational time constraint, with this hypothesis three different parallel SWAP-GA approaches are proposed in this study. Distributed population (where GA will work on distributed manner, Distributed pixel (Pixels are processed in parallel and Mixed of distributed population and pixel model called Hybrid model. The technical considerations of implementing such methodologies are visited here.
Parallelization of the NASA Goddard Cumulus Ensemble Model for Massively Parallel Computing
Directory of Open Access Journals (Sweden)
Hann-Ming Henry Juang
2007-01-01
Full Text Available Massively parallel computing, using a message passing interface (MPI, has been implemented into a three-dimensional version of the Goddard Cumulus Ensemble (GCE model. The implementation uses the domainresemble concept to design a code structure for both the whole domain and sub-domains after decomposition. Instead of inserting a group of MPI related statements into the model routine, these statements are packed into a single routine. In other words, only a single call statement to the model code is utilized once in a place, thus there is minimal impact on the original code. Therefore, the model is easily modified and/or managed by the model developers and/or users, who have little knowledge of massively parallel computing.
Hoekstra, A.G.; Sloot, P.M.A.; Haan, M.J.; Hertzberger, L.O.; van Leeuwen, J.
1991-01-01
New developments in Computer Science, both hardware and software, offer researchers, such as physicists, unprecedented possibilities to solve their computational intensive problems.However, full exploitation of e.g. new massively parallel computers, parallel languages or runtime environments require
Distributed parallel computing in stochastic modeling of groundwater systems.
Dong, Yanhui; Li, Guomin; Xu, Haizhen
2013-03-01
Stochastic modeling is a rapidly evolving, popular approach to the study of the uncertainty and heterogeneity of groundwater systems. However, the use of Monte Carlo-type simulations to solve practical groundwater problems often encounters computational bottlenecks that hinder the acquisition of meaningful results. To improve the computational efficiency, a system that combines stochastic model generation with MODFLOW-related programs and distributed parallel processing is investigated. The distributed computing framework, called the Java Parallel Processing Framework, is integrated into the system to allow the batch processing of stochastic models in distributed and parallel systems. As an example, the system is applied to the stochastic delineation of well capture zones in the Pinggu Basin in Beijing. Through the use of 50 processing threads on a cluster with 10 multicore nodes, the execution times of 500 realizations are reduced to 3% compared with those of a serial execution. Through this application, the system demonstrates its potential in solving difficult computational problems in practical stochastic modeling. © 2012, The Author(s). Groundwater © 2012, National Ground Water Association.
DEFF Research Database (Denmark)
The following topics are dealt with: parallel scientific computing; numerical algorithms; parallel nonnumerical algorithms; cloud computing; evolutionary computing; metaheuristics; applied mathematics; GPU computing; multicore systems; hybrid architectures; hierarchical parallelism; HPC systems; ...
Institute of Scientific and Technical Information of China (English)
Zhang Laiping; Zhao Zhong; Chang Xinghua; He Xin
2013-01-01
A hybrid grid generation technique and a multigrid/parallel algorithm are presented in this paper for turbulence flow simulations over three-dimensional (3D) complex geometries.The hybrid grid generation technique is based on an agglomeration method of anisotropic tetrahedrons.Firstly,the complex computational domain is covered by pure tetrahedral grids,in which anisotropic tetrahedrons are adopted to discrete the boundary layer and isotropic tetrahedrons in the outer field.Then,the anisotropic tetrahedrons in the boundary layer are agglomerated to generate prismatic grids.The agglomeration method can improve the grid quality in boundary layer and reduce the grid quantity to enhance the numerical accuracy and efficiency.In order to accelerate the convergence history,a multigrid/parallel algorithm is developed also based on anisotropic agglomeration approach.The numerical results demonstrate the excellent accelerating capability of this multigrid method.
Executing a gather operation on a parallel computer
Archer, Charles J [Rochester, MN; Ratterman, Joseph D [Rochester, MN
2012-03-20
Methods, apparatus, and computer program products are disclosed for executing a gather operation on a parallel computer according to embodiments of the present invention. Embodiments include configuring, by the logical root, a result buffer or the logical root, the result buffer having positions, each position corresponding to a ranked node in the operational group and for storing contribution data gathered from that ranked node. Embodiments also include repeatedly for each position in the result buffer: determining, by each compute node of an operational group, whether the current position in the result buffer corresponds with the rank of the compute node, if the current position in the result buffer corresponds with the rank of the compute node, contributing, by that compute node, the compute node's contribution data, if the current position in the result buffer does not correspond with the rank of the compute node, contributing, by that compute node, a value of zero for the contribution data, and storing, by the logical root in the current position in the result buffer, results of a bitwise OR operation of all the contribution data by all compute nodes of the operational group for the current position, the results received through the global combining network.
Establishing a group of endpoints in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.; Xue, Hanhong
2016-02-02
A parallel computer executes a number of tasks, each task includes a number of endpoints and the endpoints are configured to support collective operations. In such a parallel computer, establishing a group of endpoints receiving a user specification of a set of endpoints included in a global collection of endpoints, where the user specification defines the set in accordance with a predefined virtual representation of the endpoints, the predefined virtual representation is a data structure setting forth an organization of tasks and endpoints included in the global collection of endpoints and the user specification defines the set of endpoints without a user specification of a particular endpoint; and defining a group of endpoints in dependence upon the predefined virtual representation of the endpoints and the user specification.
Parallel computing helps 3D depth imaging, processing
Energy Technology Data Exchange (ETDEWEB)
Nestvold, E. O. [IBM, Houston, TX (United States); Su, C. B. [IBM, Dallas, TX (United States); Black, J. L. [Landmark Graphics, Denver, CO (United States); Jack, I. G. [BP Exploration, London (United Kingdom)
1996-10-28
The significance of 3D seismic data in the petroleum industry during the past decade cannot be overstated. Having started as a technology too expensive to be utilized except by major oil companies, 3D technology is now routinely used by independent operators in the US and Canada. As with all emerging technologies, documentation of successes has been limited. There are some successes, however, that have been summarized in the literature in the recent past. Key technological developments contributing to this success have been major advances in RISC workstation technology, 3D depth imaging, and parallel computing. This article presents the basic concepts of parallel seismic computing, showing how it impacts both 3D depth imaging and more-conventional 3D seismic processing.
Final Report: Center for Programming Models for Scalable Parallel Computing
Energy Technology Data Exchange (ETDEWEB)
Mellor-Crummey, John [William Marsh Rice University
2011-09-13
As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.
Development of unstructured mesh generator on parallel computers
Energy Technology Data Exchange (ETDEWEB)
Muramatsu, Kazuhiro [Japan Atomic Energy Research Inst., Tokyo (Japan); Shimada, Akio; Murakami, Hiroyuki; Higashida, Akihiro; Wakatsuki, Shigeto [Fuji Research Institute Corporation, Computational Engineering II, Tokyo (Japan)
2000-09-01
A general-purpose unstructured mesh generator, 'GRID3D/UNST', has been developed on parallel computers. High-speed operations and large-scale memory capacity of parallel computers enable the system to generate a large-scale mesh at high speed. As a matter of fact, the system generates large-scale mesh composed of 2,400,000 nodes and 14,000,000 elements about 1.5 hours on HITACHI SR2201, 64 PEs (Processing Elements) through 2.5 hours pre-process on SUN. Also the system is built on standard FORTRAN, C and Motif, and therefore has high portability. The system enables us to solve a large-scale problem that has been impossible to be solved, and to break new ground in the field of science and engineering. (author)
Performing a local reduction operation on a parallel computer
Blocksome, Michael A.; Faraj, Daniel A.
2012-12-11
A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.
Oloso, Amidu Olawale
A hybrid automatic differentiation/incremental iterative method was implemented in the general purpose advanced computational fluid dynamics code (CFL3D Version 4.1) to yield a new code (CFL3D.ADII) that is capable of computing consistently discrete first order sensitivity derivatives for complex geometries. With the exception of unsteady problems, the new code retains all the useful features and capabilities of the original CFL3D flow analysis code. The superiority of the new code over a carefully applied method of finite-differences is demonstrated. A coarse grain, scalable, distributed-memory, parallel version of CFL3D.ADII was developed based on "derivative stripmining". In this data-parallel approach, an identical copy of CFL3D.ADII is executed on each processor with different derivative input files. The effect of communication overhead on the overall parallel computational efficiency is negligible. However, the fraction of CFL3D.ADII duplicated on all processors has significant impact on the computational efficiency. To reduce the large execution time associated with the sequential 1-D line search in gradient-based aerodynamic optimization, an alternative parallel approach was developed. The execution time of the new approach was reduced effectively to that of one flow analysis, regardless of the number of function evaluations in the 1-D search. The new approach was found to yield design results that are essentially identical to those obtained from the traditional sequential approach but at much smaller execution time. The parallel CFL3D.ADII and the parallel 1-D line search are demonstrated in shape improvement studies of a realistic High Speed Civil Transport (HSCT) wing/body configuration represented by over 100 design variables and 200,000 grid points in inviscid supersonic flow on the 16 node IBM SP2 parallel computer at the Numerical Aerospace Simulation (NAS) facility, NASA Ames Research Center. In addition to making the handling of such a large
Vectorial Preisach-type model designed for parallel computing
Energy Technology Data Exchange (ETDEWEB)
Stancu, Alexandru [Department of Solid State and Theoretical Physics, Al. I. Cuza University, Blvd. Carol I, 11, 700506 Iasi (Romania)]. E-mail: alstancu@uaic.ro; Stoleriu, Laurentiu [Department of Solid State and Theoretical Physics, Al. I. Cuza University, Blvd. Carol I, 11, 700506 Iasi (Romania); Andrei, Petru [Electrical and Computer Engineering, Florida State University, Tallahassee, FL (United States); Electrical and Computer Engineering, Florida A and M University, Tallahassee, FL (United States)
2007-09-15
Most of the hysteresis phenomenological models are scalar, while all the magnetization processes are vectorial. The vector models-phenomenological or micromagnetic (physical)-are time consuming and sometimes difficult to implement. In this paper, we introduce a new vector Preisach-type model that uses micromagnetic results to simulate the magnetic response of a system of several tens of thousands of pseudo-particles. The model has a modular structure that allows easy implementation for parallel computing.
Analysis of coupled heat and moisture transport on parallel computers
Koudelka, Tomáš; Krejčí, Tomáš
2017-07-01
Coupled analysis of heat and moisture transport in complicated structural elements or in whole structures deserves a special attention because after space discretization, large number of degrees of freedom are needed. This paper describes possible solution of such problems based on domain decomposition methods executed on parallel computers. The Schur complement method is used with respect to nonsymmetric systems of algebraic equations. The method described is an alternative to other methods, e.g. two or more scale homogenization.
A scalable PC-based parallel computer for lattice QCD
Fodor, Z; Papp, G
2002-01-01
A PC-based parallel computer for medium/large scale lattice QCD simulations is suggested. The Eotvos Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes. Gigabit Ethernet cards are used for nearest neighbor communication in a two-dimensional mesh. The sustained performance for dynamical staggered(wilson) quarks on large lattices is around 70(110) GFlops. The exceptional price/performance ratio is below $1/Mflop.
Performing an allreduce operation on a plurality of compute nodes of a parallel computer
Faraj, Ahmad [Rochester, MN
2012-04-17
Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer. Each compute node includes at least two processing cores. Each processing core has contribution data for the allreduce operation. Performing an allreduce operation on a plurality of compute nodes of a parallel computer includes: establishing one or more logical rings among the compute nodes, each logical ring including at least one processing core from each compute node; performing, for each logical ring, a global allreduce operation using the contribution data for the processing cores included in that logical ring, yielding a global allreduce result for each processing core included in that logical ring; and performing, for each compute node, a local allreduce operation using the global allreduce results for each processing core on that compute node.
Parallel Computation of the Regional Ocean Modeling System (ROMS)
Energy Technology Data Exchange (ETDEWEB)
Wang, P; Song, Y T; Chao, Y; Zhang, H
2005-04-05
The Regional Ocean Modeling System (ROMS) is a regional ocean general circulation modeling system solving the free surface, hydrostatic, primitive equations over varying topography. It is free software distributed world-wide for studying both complex coastal ocean problems and the basin-to-global scale ocean circulation. The original ROMS code could only be run on shared-memory systems. With the increasing need to simulate larger model domains with finer resolutions and on a variety of computer platforms, there is a need in the ocean-modeling community to have a ROMS code that can be run on any parallel computer ranging from 10 to hundreds of processors. Recently, we have explored parallelization for ROMS using the MPI programming model. In this paper, an efficient parallelization strategy for such a large-scale scientific software package, based on an existing shared-memory computing model, is presented. In addition, scientific applications and data-performance issues on a couple of SGI systems, including Columbia, the world's third-fastest supercomputer, are discussed.
Universal algorithms, mathematics of semirings and parallel computations
Litvinov, G L; Rodionov, A Ya; Sobolevski, A N
2010-01-01
This is a survey paper on applications of mathematics of semirings to numerical analysis and computing. Concepts of universal algorithm and generic program are discussed. Relations between these concepts and mathematics of semirings are examined. A very brief introduction to mathematics of semirings (including idempotent and tropical mathematics) is presented. Concrete applications to optimization problems, idempotent linear algebra and interval analysis are indicated. It is known that some nonlinear problems (and especially optimization problems) become linear over appropriate semirings with idempotent addition (the so-called idempotent superposition principle). This linearity over semirings is convenient for parallel computations.
Local rollback for fault-tolerance in parallel computing systems
Blumrich, Matthias A [Yorktown Heights, NY; Chen, Dong [Yorktown Heights, NY; Gara, Alan [Yorktown Heights, NY; Giampapa, Mark E [Yorktown Heights, NY; Heidelberger, Philip [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Steinmacher-Burow, Burkhard [Boeblingen, DE; Sugavanam, Krishnan [Yorktown Heights, NY
2012-01-24
A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.
Digital image processing using parallel computing based on CUDA technology
Skirnevskiy, I. P.; Pustovit, A. V.; Abdrashitova, M. O.
2017-01-01
This article describes expediency of using a graphics processing unit (GPU) in big data processing in the context of digital images processing. It provides a short description of a parallel computing technology and its usage in different areas, definition of the image noise and a brief overview of some noise removal algorithms. It also describes some basic requirements that should be met by certain noise removal algorithm in the projection to computer tomography. It provides comparison of the performance with and without using GPU as well as with different percentage of using CPU and GPU.
Runtime optimization of an application executing on a parallel computer
Faraj, Daniel A.; Smith, Brian E.
2013-01-29
Identifying a collective operation within an application executing on a parallel computer; identifying a call site of the collective operation; determining whether the collective operation is root-based; if the collective operation is not root-based: establishing a tuning session and executing the collective operation in the tuning session; if the collective operation is root-based, determining whether all compute nodes executing the application identified the collective operation at the same call site; if all compute nodes identified the collective operation at the same call site, establishing a tuning session and executing the collective operation in the tuning session; and if all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session.
Peredo, Oscar; Ortiz, Julián M.; Herrero, José R.
2015-12-01
The Geostatistical Software Library (GSLIB) has been used in the geostatistical community for more than thirty years. It was designed as a bundle of sequential Fortran codes, and today it is still in use by many practitioners and researchers. Despite its widespread use, few attempts have been reported in order to bring this package to the multi-core era. Using all CPU resources, GSLIB algorithms can handle large datasets and grids, where tasks are compute- and memory-intensive applications. In this work, a methodology is presented to accelerate GSLIB applications using code optimization and hybrid parallel processing, specifically for compute-intensive applications. Minimal code modifications are added decreasing as much as possible the elapsed time of execution of the studied routines. If multi-core processing is available, the user can activate OpenMP directives to speed up the execution using all resources of the CPU. If multi-node processing is available, the execution is enhanced using MPI messages between the compute nodes.Four case studies are presented: experimental variogram calculation, kriging estimation, sequential gaussian and indicator simulation. For each application, three scenarios (small, large and extra large) are tested using a desktop environment with 4 CPU-cores and a multi-node server with 128 CPU-nodes. Elapsed times, speedup and efficiency results are shown.
Nguyen, Tuan-Anh; Nakib, Amir; Nguyen, Huy-Nam
2016-06-01
The Non-local means denoising filter has been established as gold standard for image denoising problem in general and particularly in medical imaging due to its efficiency. However, its computation time limited its applications in real world application, especially in medical imaging. In this paper, a distributed version on parallel hybrid architecture is proposed to solve the computation time problem and a new method to compute the filters' coefficients is also proposed, where we focused on the implementation and the enhancement of filters' parameters via taking the neighborhood of the current voxel more accurately into account. In terms of implementation, our key contribution consists in reducing the number of shared memory accesses. The different tests of the proposed method were performed on the brain-web database for different levels of noise. Performances and the sensitivity were quantified in terms of speedup, peak signal to noise ratio, execution time, the number of floating point operations. The obtained results demonstrate the efficiency of the proposed method. Moreover, the implementation is compared to that of other techniques, recently published in the literature.
Control Strategy Optimization for Parallel Hybrid Electric Vehicles Using a Memetic Algorithm
Directory of Open Access Journals (Sweden)
Yu-Huei Cheng
2017-03-01
Full Text Available Hybrid electric vehicle (HEV control strategy is a management approach for generating, using, and saving energy. Therefore, the optimal control strategy is the sticking point to effectively manage hybrid electric vehicles. In order to realize the optimal control strategy, we use a robust evolutionary computation method called a “memetic algorithm (MA” to optimize the control parameters in parallel HEVs. The “local search” mechanism implemented in the MA greatly enhances its search capabilities. In the implementation of the method, the fitness function combines with the ADvanced VehIcle SimulatOR (ADVISOR and is set up according to an electric assist control strategy (EACS to minimize the fuel consumption (FC and emissions (HC, CO, and NOx of the vehicle engine. At the same time, driving performance requirements are also considered in the method. Four different driving cycles, the new European driving cycle (NEDC, Federal Test Procedure (FTP, Economic Commission for Europe + Extra-Urban driving cycle (ECE + EUDC, and urban dynamometer driving schedule (UDDS are carried out using the proposed method to find their respectively optimal control parameters. The results show that the proposed method effectively helps to reduce fuel consumption and emissions, as well as guarantee vehicle performance.
Hybrid parallel execution model for logic-based specification languages
Tsai, Jeffrey J P
2001-01-01
Parallel processing is a very important technique for improving the performance of various software development and maintenance activities. The purpose of this book is to introduce important techniques for parallel executation of high-level specifications of software systems. These techniques are very useful for the construction, analysis, and transformation of reliable large-scale and complex software systems. Contents: Current Approaches; Overview of the New Approach; FRORL Requirements Specification Language and Its Decomposition; Rewriting and Data Dependency, Control Flow Analysis of a Lo
Blocksome, Michael A.; Mamidala, Amith R.
2013-09-03
Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.
Energy Technology Data Exchange (ETDEWEB)
Faraj, Daniel A.
2015-11-19
Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Faraj, Daniel A
2013-07-16
Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Performance of parallel computation using CUDA for solving the one-dimensional elasticity equations
Darmawan, J. B. B.; Mungkasi, S.
2017-01-01
In this paper, we investigate the performance of parallel computation in solving the one-dimensional elasticity equations. Elasticity equations are usually implemented in engineering science. Solving these equations fast and efficiently is desired. Therefore, we propose the use of parallel computation. Our parallel computation uses CUDA of the NVIDIA. Our research results show that parallel computation using CUDA has a great advantage and is powerful when the computation is of large scale.
Semi-coarsening multigrid methods for parallel computing
Energy Technology Data Exchange (ETDEWEB)
Jones, J.E.
1996-12-31
Standard multigrid methods are not well suited for problems with anisotropic coefficients which can occur, for example, on grids that are stretched to resolve a boundary layer. There are several different modifications of the standard multigrid algorithm that yield efficient methods for anisotropic problems. In the paper, we investigate the parallel performance of these multigrid algorithms. Multigrid algorithms which work well for anisotropic problems are based on line relaxation and/or semi-coarsening. In semi-coarsening multigrid algorithms a grid is coarsened in only one of the coordinate directions unlike standard or full-coarsening multigrid algorithms where a grid is coarsened in each of the coordinate directions. When both semi-coarsening and line relaxation are used, the resulting multigrid algorithm is robust and automatic in that it requires no knowledge of the nature of the anisotropy. This is the basic multigrid algorithm whose parallel performance we investigate in the paper. The algorithm is currently being implemented on an IBM SP2 and its performance is being analyzed. In addition to looking at the parallel performance of the basic semi-coarsening algorithm, we present algorithmic modifications with potentially better parallel efficiency. One modification reduces the amount of computational work done in relaxation at the expense of using multiple coarse grids. This modification is also being implemented with the aim of comparing its performance to that of the basic semi-coarsening algorithm.
Parallel matrix transpose algorithms on distributed memory concurrent computers
Energy Technology Data Exchange (ETDEWEB)
Choi, J.; Walker, D.W. [Oak Ridge National Lab., TN (United States); Dongarra, J.J. [Oak Ridge National Lab., TN (United States)]|[Univ. of Tennessee, Knoxville, TN (United States). Dept. of Computer Science
1993-10-01
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. It is assumed that the matrix is distributed over a P x Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A{center_dot}B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T}{center_dot}B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
Guo, Peng; Cheng, Wenming; Wang, Yi
2015-11-01
This article considers the parallel machine scheduling problem with step-deteriorating jobs and sequence-dependent setup times. The objective is to minimize the total tardiness by determining the allocation and sequence of jobs on identical parallel machines. In this problem, the processing time of each job is a step function dependent upon its starting time. An individual extended time is penalized when the starting time of a job is later than a specific deterioration date. The possibility of deterioration of a job makes the parallel machine scheduling problem more challenging than ordinary ones. A mixed integer programming model for the optimal solution is derived. Due to its NP-hard nature, a hybrid discrete cuckoo search algorithm is proposed to solve this problem. In order to generate a good initial swarm, a modified Biskup-Hermann-Gupta (BHG) heuristic called MBHG is incorporated into the population initialization. Several discrete operators are proposed in the random walk of Lévy flights and the crossover search. Moreover, a local search procedure based on variable neighbourhood descent is integrated into the algorithm as a hybrid strategy in order to improve the quality of elite solutions. Computational experiments are executed on two sets of randomly generated test instances. The results show that the proposed hybrid algorithm can yield better solutions in comparison with the commercial solver CPLEX® with a one hour time limit, the discrete cuckoo search algorithm and the existing variable neighbourhood search algorithm.
Fast structural design and analysis via hybrid domain decomposition on massively parallel processors
Farhat, Charbel
1993-01-01
A hybrid domain decomposition framework for static, transient and eigen finite element analyses of structural mechanics problems is presented. Its basic ingredients include physical substructuring and /or automatic mesh partitioning, mapping algorithms, 'gluing' approximations for fast design modifications and evaluations, and fast direct and preconditioned iterative solvers for local and interface subproblems. The overall methodology is illustrated with the structural design of a solar viewing payload that is scheduled to fly in March 1993. This payload has been entirely designed and validated by a group of undergraduate students at the University of Colorado using the proposed hybrid domain decomposition approach on a massively parallel processor. Performance results are reported on the CRAY Y-MP/8 and the iPSC-860/64 Touchstone systems, which represent both extreme parallel architectures. The hybrid domain decomposition methodology is shown to outperform leading solution algorithms and to exhibit an excellent parallel scalability.
Overview of Parallel Platforms for Common High Performance Computing
Directory of Open Access Journals (Sweden)
T. Fryza
2012-04-01
Full Text Available The paper deals with various parallel platforms used for high performance computing in the signal processing domain. More precisely, the methods exploiting the multicores central processing units such as message passing interface and OpenMP are taken into account. The properties of the programming methods are experimentally proved in the application of a fast Fourier transform and a discrete cosine transform and they are compared with the possibilities of MATLAB's built-in functions and Texas Instruments digital signal processors with very long instruction word architectures. New FFT and DCT implementations were proposed and tested. The implementation phase was compared with CPU based computing methods and with possibilities of the Texas Instruments digital signal processing library on C6747 floating-point DSPs. The optimal combination of computing methods in the signal processing domain and new, fast routines' implementation is proposed as well.
DMA shared byte counters in a parallel computer
Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos
2010-04-06
A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
Determining collective barrier operation skew in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Faraj, Daniel A.
2015-11-24
Determining collective barrier operation skew in a parallel computer that includes a number of compute nodes organized into an operational group includes: for each of the nodes until each node has been selected as a delayed node: selecting one of the nodes as a delayed node; entering, by each node other than the delayed node, a collective barrier operation; entering, after a delay by the delayed node, the collective barrier operation; receiving an exit signal from a root of the collective barrier operation; and measuring, for the delayed node, a barrier completion time. The barrier operation skew is calculated by: identifying, from the compute nodes' barrier completion times, a maximum barrier completion time and a minimum barrier completion time and calculating the barrier operation skew as the difference of the maximum and the minimum barrier completion time.
Final Report: Super Instruction Architecture for Scalable Parallel Computations
Energy Technology Data Exchange (ETDEWEB)
Sanders, Beverly Ann [University of Florida; Bartlett, Rodney [University of Florida; Deumens, Erik [University of Florida
2013-12-23
The most advanced methods for reliable and accurate computation of the electronic structure of molecular and nano systems are the coupled-cluster techniques. These high-accuracy methods help us to understand, for example, how biological enzymes operate and contribute to the design of new organic explosives. The ACES III software provides a modern, high-performance implementation of these methods optimized for high performance parallel computer systems, ranging from small clusters typical in individual research groups, through larger clusters available in campus and regional computer centers, all the way to high-end petascale systems at national labs, including exploiting GPUs if available. This project enhanced the ACESIII software package and used it to study interesting scientific problems.
Parallelized Vlasov-Fokker-Planck solver for desktop personal computers
Schönfeldt, Patrik; Brosi, Miriam; Schwarz, Markus; Steinmann, Johannes L.; Müller, Anke-Susanne
2017-03-01
The numerical solution of the Vlasov-Fokker-Planck equation is a well established method to simulate the dynamics, including the self-interaction with its own wake field, of an electron bunch in a storage ring. In this paper we present Inovesa, a modularly extensible program that uses opencl to massively parallelize the computation. It allows a standard desktop PC to work with appropriate accuracy and yield reliable results within minutes. We provide numerical stability-studies over a wide parameter range and compare our numerical findings to known results. Simulation results for the case of coherent synchrotron radiation will be compared to measurements that probe the effects of the microbunching instability occurring in the short bunch operation at ANKA. It will be shown that the impedance model based on the shielding effect of two parallel plates can not only describe the instability threshold, but also the presence of multiple regimes that show differences in the emission of coherent synchrotron radiation.
Center for Programming Models for Scalable Parallel Computing
Energy Technology Data Exchange (ETDEWEB)
John Mellor-Crummey
2008-02-29
Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf.
Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts
Energy Technology Data Exchange (ETDEWEB)
NONE
1997-12-31
This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.
Hybrid cloud and cluster computing paradigms for life science applications.
Qiu, Judy; Ekanayake, Jaliya; Gunarathne, Thilina; Choi, Jong Youl; Bae, Seung-Hee; Li, Hui; Zhang, Bingjing; Wu, Tak-Lon; Ruan, Yang; Ekanayake, Saliya; Hughes, Adam; Fox, Geoffrey
2010-12-21
Clouds and MapReduce have shown themselves to be a broadly useful approach to scientific computing especially for parallel data intensive applications. However they have limited applicability to some areas such as data mining because MapReduce has poor performance on problems with an iterative structure present in the linear algebra that underlies much data analysis. Such problems can be run efficiently on clusters using MPI leading to a hybrid cloud and cluster environment. This motivates the design and implementation of an open source Iterative MapReduce system Twister. Comparisons of Amazon, Azure, and traditional Linux and Windows environments on common applications have shown encouraging performance and usability comparisons in several important non iterative cases. These are linked to MPI applications for final stages of the data analysis. Further we have released the open source Twister Iterative MapReduce and benchmarked it against basic MapReduce (Hadoop) and MPI in information retrieval and life sciences applications. The hybrid cloud (MapReduce) and cluster (MPI) approach offers an attractive production environment while Twister promises a uniform programming environment for many Life Sciences applications. We used commercial clouds Amazon and Azure and the NSF resource FutureGrid to perform detailed comparisons and evaluations of different approaches to data intensive computing. Several applications were developed in MPI, MapReduce and Twister in these different environments.
Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm
Umam, Khoirul; Bustamam, Alhadi; Lestari, Dian
2017-03-01
DNA is one of the carrier of genetic information of living organisms. Encoding, sequencing, and clustering DNA sequences has become the key jobs and routine in the world of molecular biology, in particular on bioinformatics application. There are two type of clustering, hierarchical clustering and partitioning clustering. In this paper, we combined two type clustering i.e. K-Means (partitioning clustering) and DIANA (hierarchical clustering), therefore it called Hybrid clustering. Application of hybrid clustering using Parallel K-Means algorithm and DIANA algorithm used to clustering DNA sequences of Human Papillomavirus (HPV). The clustering process is started with Collecting DNA sequences of HPV are obtained from NCBI (National Centre for Biotechnology Information), then performing characteristics extraction of DNA sequences. The characteristics extraction result is store in a matrix form, then normalize this matrix using Min-Max normalization and calculate genetic distance using Euclidian Distance. Furthermore, the hybrid clustering is applied by using implementation of Parallel K-Means algorithm and DIANA algorithm. The aim of using Hybrid Clustering is to obtain better clusters result. For validating the resulted clusters, to get optimum number of clusters, we use Davies-Bouldin Index (DBI). In this study, the result of implementation of Parallel K-Means clustering is data clustered become 5 clusters with minimal IDB value is 0.8741, and Hybrid Clustering clustered data become 13 sub-clusters with minimal IDB values = 0.8216, 0.6845, 0.3331, 0.1994 and 0.3952. The IDB value of hybrid clustering less than IBD value of Parallel K-Means clustering only that perform at 1ts stage. Its means clustering using Hybrid Clustering have the better result to clustered DNA sequence of HPV than perform parallel K-Means Clustering only.
CX: A Scalable, Robust Network for Parallel Computing
Directory of Open Access Journals (Sweden)
Peter Cappello
2002-01-01
Full Text Available CX, a network-based computational exchange, is presented. The system's design integrates variations of ideas from other researchers, such as work stealing, non-blocking tasks, eager scheduling, and space-based coordination. The object-oriented API is simple, compact, and cleanly separates application logic from the logic that supports interprocess communication and fault tolerance. Computations, of course, run to completion in the presence of computational hosts that join and leave the ongoing computation. Such hosts, or producers, use task caching and prefetching to overlap computation with interprocessor communication. To break a potential task server bottleneck, a network of task servers is presented. Even though task servers are envisioned as reliable, the self-organizing, scalable network of n- servers, described as a sibling-connected height-balanced fat tree, tolerates a sequence of n-1 server failures. Tasks are distributed throughout the server network via a simple "diffusion" process. CX is intended as a test bed for research on automated silent auctions, reputation services, authentication services, and bonding services. CX also provides a test bed for algorithm research into network-based parallel computation.
SPINET: A Parallel Computing Approach to Spine Simulations
Directory of Open Access Journals (Sweden)
Peter G. Kropf
1996-01-01
Full Text Available Research in scientitic programming enables us to realize more and more complex applications, and on the other hand, application-driven demands on computing methods and power are continuously growing. Therefore, interdisciplinary approaches become more widely used. The interdisciplinary SPINET project presented in this article applies modern scientific computing tools to biomechanical simulations: parallel computing and symbolic and modern functional programming. The target application is the human spine. Simulations of the spine help us to investigate and better understand the mechanisms of back pain and spinal injury. Two approaches have been used: the first uses the finite element method for high-performance simulations of static biomechanical models, and the second generates a simulation developmenttool for experimenting with different dynamic models. A finite element program for static analysis has been parallelized for the MUSIC machine. To solve the sparse system of linear equations, a conjugate gradient solver (iterative method and a frontal solver (direct method have been implemented. The preprocessor required for the frontal solver is written in the modern functional programming language SML, the solver itself in C, thus exploiting the characteristic advantages of both functional and imperative programming. The speedup analysis of both solvers show very satisfactory results for this irregular problem. A mixed symbolic-numeric environment for rigid body system simulations is presented. It automatically generates C code from a problem specification expressed by the Lagrange formalism using Maple.
Hyper-parallel photonic quantum computation with coupled quantum dots
Ren, Bao-Cang; Deng, Fu-Guo
2014-01-01
It is well known that a parallel quantum computer is more powerful than a classical one. So far, there are some important works about the construction of universal quantum logic gates, the key elements in quantum computation. However, they are focused on operating on one degree of freedom (DOF) of quantum systems. Here, we investigate the possibility of achieving scalable hyper-parallel quantum computation based on two DOFs of photon systems. We construct a deterministic hyper-controlled-not (hyper-CNOT) gate operating on both the spatial-mode and the polarization DOFs of a two-photon system simultaneously, by exploiting the giant optical circular birefringence induced by quantum-dot spins in double-sided optical microcavities as a result of cavity quantum electrodynamics (QED). This hyper-CNOT gate is implemented by manipulating the four qubits in the two DOFs of a two-photon system without auxiliary spatial modes or polarization modes. It reduces the operation time and the resources consumed in quantum information processing, and it is more robust against the photonic dissipation noise, compared with the integration of several cascaded CNOT gates in one DOF. PMID:24721781
Large-scale parallel genome assembler over cloud computing environment.
Das, Arghya Kusum; Koppa, Praveen Kumar; Goswami, Sayan; Platania, Richard; Park, Seung-Jong
2017-06-01
The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.
Parallel Hybrid Gas-Electric Geared Turbofan Engine Conceptual Design and Benefits Analysis
Lents, Charles; Hardin, Larry; Rheaume, Jonathan; Kohlman, Lee
2016-01-01
The conceptual design of a parallel gas-electric hybrid propulsion system for a conventional single aisle twin engine tube and wing vehicle has been developed. The study baseline vehicle and engine technology are discussed, followed by results of the hybrid propulsion system sizing and performance analysis. The weights analysis for the electric energy storage & conversion system and thermal management system is described. Finally, the potential system benefits are assessed.
Implementing a Commercial-Strength Parallel Hybrid Movie Recommendation Engine
DEFF Research Database (Denmark)
Amolochitis, Emmanouil; Christou, Ioannis T.; Tan, Zheng-Hua
2014-01-01
AMORE is a hybrid recommendation system that provides movie recommendations for a major triple-play services provider in Greece. Combined with our own implementations of several user-, item-, and content-based recommendation algorithms, AMORE significantly outperforms other state-of-the-art imple......-of-the-art implementations both in solution quality and response time. AMORE currently serves daily recommendation requests for all active subscribers of the provider's video-on-demand services and has contributed to an increase of rental profits and customer retention....
Work-Efficient Parallel Skyline Computation for the GPU
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira
2015-01-01
offers the potential for parallelizing skyline computation across thousands of cores. However, attempts to port skyline algorithms to the GPU have prioritized throughput and failed to outperform sequential algorithms. In this paper, we introduce a new skyline algorithm, designed for the GPU, that uses...... a global, static partitioning scheme. With the partitioning, we can permit controlled branching to exploit transitive relationships and avoid most point-to-point comparisons. The result is a non-traditional GPU algorithm, SkyAlign, that prioritizes work-effciency and respectable throughput, rather than...
Routing performance analysis and optimization within a massively parallel computer
Archer, Charles Jens; Peters, Amanda; Pinnow, Kurt Walter; Swartz, Brent Allen
2013-04-16
An apparatus, program product and method optimize the operation of a massively parallel computer system by, in part, receiving actual performance data concerning an application executed by the plurality of interconnected nodes, and analyzing the actual performance data to identify an actual performance pattern. A desired performance pattern may be determined for the application, and an algorithm may be selected from among a plurality of algorithms stored within a memory, the algorithm being configured to achieve the desired performance pattern based on the actual performance data.
Parallel Computation of Persistent Homology using the Blowup Complex
Energy Technology Data Exchange (ETDEWEB)
Lewis, Ryan [Stanford Univ., CA (United States); Morozov, Dmitriy [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
2015-04-27
We describe a parallel algorithm that computes persistent homology, an algebraic descriptor of a filtered topological space. Our algorithm is distinguished by operating on a spatial decomposition of the domain, as opposed to a decomposition with respect to the filtration. We rely on a classical construction, called the Mayer--Vietoris blowup complex, to glue global topological information about a space from its disjoint subsets. We introduce an efficient algorithm to perform this gluing operation, which may be of independent interest, and describe how to process the domain hierarchically. We report on a set of experiments that help assess the strengths and identify the limitations of our method.
Hybrid fluid/kinetic model for parallel heat conduction
Energy Technology Data Exchange (ETDEWEB)
Callen, J.D.; Hegna, C.C.; Held, E.D. [Univ. of Wisconsin, Madison, WI (United States)
1998-12-31
It is argued that in order to use fluid-like equations to model low frequency ({omega} < {nu}) phenomena such as neoclassical tearing modes in low collisionality ({nu} < {omega}{sub b}) tokamak plasmas, a Chapman-Enskog-like approach is most appropriate for developing an equation for the kinetic distortion (F) of the distribution function whose velocity-space moments lead to the needed fluid moment closure relations. Further, parallel heat conduction in a long collision mean free path regime can be described through a combination of a reduced phase space Chapman-Enskog-like approach for the kinetics and a multiple-time-scale analysis for the fluid and kinetic equations.
Representing and computing regular languages on massively parallel networks
Energy Technology Data Exchange (ETDEWEB)
Miller, M.I.; O' Sullivan, J.A. (Electronic Systems and Research Lab., of Electrical Engineering, Washington Univ., St. Louis, MO (US)); Boysam, B. (Dept. of Electrical, Computer and Systems Engineering, Rensselaer Polytechnic Inst., Troy, NY (US)); Smith, K.R. (Dept. of Electrical Engineering, Southern Illinois Univ., Edwardsville, IL (US))
1991-01-01
This paper proposes a general method for incorporating rule-based constraints corresponding to regular languages into stochastic inference problems, thereby allowing for a unified representation of stochastic and syntactic pattern constraints. The authors' approach first established the formal connection of rules to Chomsky grammars, and generalizes the original work of Shannon on the encoding of rule-based channel sequences to Markov chains of maximum entropy. This maximum entropy probabilistic view leads to Gibb's representations with potentials which have their number of minima growing at precisely the exponential rate that the language of deterministically constrained sequences grow. These representations are coupled to stochastic diffusion algorithms, which sample the language-constrained sequences by visiting the energy minima according to the underlying Gibbs' probability law. The coupling to stochastic search methods yields the all-important practical result that fully parallel stochastic cellular automata may be derived to generate samples from the rule-based constraint sets. The production rules and neighborhood state structure of the language of sequences directly determines the necessary connection structures of the required parallel computing surface. Representations of this type have been mapped to the DAP-510 massively-parallel processor consisting of 1024 mesh-connected bit-serial processing elements for performing automated segmentation of electron-micrograph images.
Institute of Scientific and Technical Information of China (English)
ZHANG Yanting; WANG Qingfeng; XIAO Qing; FU Qiang
2006-01-01
Limitations of various accumulators in hybrid hydraulic excavator are analyzed. A program using capacitor as the accumulator based on constant work-point control is put forward. A simulating experimental system of hybrid construction machinery is established, and experimental study on constant work-point control for parallel hybrid system with capacitor accumulator is carried out using the pressure and flow rate derived from boom cylinder of hydraulic excavator in actual work as the simulating loads. A program of double work-point control is proposed and proved by further experiments.
QPACE -- a QCD parallel computer based on Cell processors
Baier, H; Drochner, M; Eicker, N; Fischer, U; Fodor, Z; Frommer, A; Gomez, C; Goldrian, G; Heybrock, S; Hierl, D; Hüsken, M; Huth, T; Krill, B; Lauritsen, J; Lippert, T; Maurer, T; Meyer, N; Nobile, A; Ouda, I; Pivanti, M; Pleiter, D; Schäfer, A; Schick, H; Schifano, F; Simma, H; Solbrig, S; Streuer, T; Sulanke, K -H; Tripiccione, R; Vogt, J -S; Wettig, T; Winter, F
2009-01-01
QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very high packaging density of 26 TFlops per rack a new water cooling concept has been developed and successfully realized. In this paper we give an overview of the architecture and highlight some important technical details of the system. Furthermore, we provide initial performance results and report on the installation of 8 QPACE racks providing an aggregate peak performance of 200 TFlops.
Parallel computing-based sclera recognition for human identification
Lin, Yong; Du, Eliza Y.; Zhou, Zhi
2012-06-01
Compared to iris recognition, sclera recognition which uses line descriptor can achieve comparable recognition accuracy in visible wavelengths. However, this method is too time-consuming to be implemented in a real-time system. In this paper, we propose a GPU-based parallel computing approach to reduce the sclera recognition time. We define a new descriptor in which the information of KD tree structure and sclera edge are added. Registration and matching task is divided into subtasks in various sizes according to their computation complexities. Every affine transform parameters are generated by searching on KD tree. Texture memory, constant memory, and shared memory are used to store templates and transform matrixes. The experiment results show that the proposed method executed on GPU can dramatically improve the sclera matching speed in hundreds of times without accuracy decreasing.
Encryption as a Service using Parallel Computing Frameworks
Directory of Open Access Journals (Sweden)
Alexandru Costin Stanimir
2013-12-01
Full Text Available In this article I present a study of an implementation named Encryption as a service which is a web service that can be deployed on a various number of devices and that can take advantage of parallelism in order to provide basic functionality of a cryptographic system: encrypt, decrypt and store data. This goal was achieved by implementing symmetric key cryptography algorithm Advanced Encryption Standard (AES using Open Computing Language (OpenCL and exposed this functionality through a REST web service .The performance results were obtained by deploying this solution on Windows Azure platform in order to take advantage of 20x CPU computing power m Amazon Web Service platform equipped with 2x Nvidia Tesla K20 GPU and regular home user hardware. This study represents a first step in a broader project which final goal is to provide full support for all encryption algorithms.
Application Specific Performance Technology for Productive Parallel Computing
Energy Technology Data Exchange (ETDEWEB)
Malony, Allen D. [Univ. of Oregon, Eugene, OR (United States); Shende, Sameer [Univ. of Oregon, Eugene, OR (United States)
2008-09-30
Our accomplishments over the last three years of the DOE project Application- Specific Performance Technology for Productive Parallel Computing (DOE Agreement: DE-FG02-05ER25680) are described below. The project will have met all of its objectives by the time of its completion at the end of September, 2008. Two extensive yearly progress reports were produced in in March 2006 and 2007 and were previously submitted to the DOE Office of Advanced Scientific Computing Research (OASCR). Following an overview of the objectives of the project, we summarize for each of the project areas the achievements in the first two years, and then describe in some more detail the project accomplishments this past year. At the end, we discuss the relationship of the proposed renewal application to the work done on the current project.
Progress in Parallel Schur Complement Preconditioning for Computational Fluid Dynamics
Barth, Timothy J.; Chan, Tony F.; Tang, Wei-Pai; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
We consider preconditioning methods for nonself-adjoint advective-diffusive systems based on a non-overlapping Schur complement procedure for arbitrary triangulated domains. The ultimate goal of this research is to develop scalable preconditioning algorithms for fluid flow discretizations on parallel computing architectures. In our implementation of the Schur complement preconditioning technique, the triangulation is first partitioned into a number of subdomains using the METIS multi-level k-way partitioning code. This partitioning induces a natural 2X2 partitioning of the p.d.e. discretization matrix. By considering various inverse approximations of the 2X2 system, we have developed a family of robust preconditioning techniques. A computer code based on these ideas has been developed and tested on the IBM SP2 and the SGI Power Challenge array using MPI message passing protocol. A number of example CFD calculations will be presented to illustrate and assess various Schur complement approximations.
A novel energy recovery system for parallel hybrid hydraulic excavator.
Li, Wei; Cao, Baoyu; Zhu, Zhencai; Chen, Guoan
2014-01-01
Hydraulic excavator energy saving is important to relieve source shortage and protect environment. This paper mainly discusses the energy saving for the hybrid hydraulic excavator. By analyzing the excess energy of three hydraulic cylinders in the conventional hydraulic excavator, a new boom potential energy recovery system is proposed. The mathematical models of the main components including boom cylinder, hydraulic motor, and hydraulic accumulator are built. The natural frequency of the proposed energy recovery system is calculated based on the mathematical models. Meanwhile, the simulation models of the proposed system and a conventional energy recovery system are built by AMESim software. The results show that the proposed system is more effective than the conventional energy saving system. At last, the main components of the proposed energy recovery system including accumulator and hydraulic motor are analyzed for improving the energy recovery efficiency. The measures to improve the energy recovery efficiency of the proposed system are presented.
Energy control strategy for parallel hydrostatic transmission hybrid vehicles
Institute of Scientific and Technical Information of China (English)
SUN Hui; JIANG Ji-hai; WANG Xin
2009-01-01
Aimed at the relatively lower energy density and complicated coordinating operation between two power sources, a special energy control strategy is required to maximize the fuel saving potential. Then a new type of configuration for hydrostatic transmission hybrid vehicles (PHHV) and the selection criterion for impor-tant components are proposed. Based on the optimization of planet gear transmission ratio and the analysis of op-timal energy distribution for the proposed PHHV on a representative urban driving cycle, a fuzzy torque control strategy and a braking energy regeneration strategy are designed and developed to realize the real-time control of energy for the proposed PHHV. Simulation results demonstrate that the energy control strategy effectively im-proves the fuel economy of PHHV.
Harmon, Frederick G; Frank, Andrew A; Joshi, Sanjay S
2005-01-01
A Simulink model, a propulsion energy optimization algorithm, and a CMAC controller were developed for a small parallel hybrid-electric unmanned aerial vehicle (UAV). The hybrid-electric UAV is intended for military, homeland security, and disaster-monitoring missions involving intelligence, surveillance, and reconnaissance (ISR). The Simulink model is a forward-facing simulation program used to test different control strategies. The flexible energy optimization algorithm for the propulsion system allows relative importance to be assigned between the use of gasoline, electricity, and recharging. A cerebellar model arithmetic computer (CMAC) neural network approximates the energy optimization results and is used to control the parallel hybrid-electric propulsion system. The hybrid-electric UAV with the CMAC controller uses 67.3% less energy than a two-stroke gasoline-powered UAV during a 1-h ISR mission and 37.8% less energy during a longer 3-h ISR mission.
Energy Proportionality and Performance in Data Parallel Computing Clusters
Energy Technology Data Exchange (ETDEWEB)
Kim, Jinoh; Chou, Jerry; Rotem, Doron
2011-02-14
Energy consumption in datacenters has recently become a major concern due to the rising operational costs andscalability issues. Recent solutions to this problem propose the principle of energy proportionality, i.e., the amount of energy consumedby the server nodes must be proportional to the amount of work performed. For data parallelism and fault tolerancepurposes, most common file systems used in MapReduce-type clusters maintain a set of replicas for each data block. A coveringset is a group of nodes that together contain at least one replica of the data blocks needed for performing computing tasks. In thiswork, we develop and analyze algorithms to maintain energy proportionality by discovering a covering set that minimizesenergy consumption while placing the remaining nodes in lowpower standby mode. Our algorithms can also discover coveringsets in heterogeneous computing environments. In order to allow more data parallelism, we generalize our algorithms so that itcan discover k-covering sets, i.e., a set of nodes that contain at least k replicas of the data blocks. Our experimental results showthat we can achieve substantial energy saving without significant performance loss in diverse cluster configurations and workingenvironments.
A Hybrid FPGA/Coarse Parallel Processing Architecture for Multi-modal Visual Feature Descriptors
DEFF Research Database (Denmark)
Jensen, Lars Baunegaard With; Kjær-Nielsen, Anders; Alonso, Javier Díaz
2008-01-01
This paper describes the hybrid architecture developed for speeding up the processing of so-called multi-modal visual primitives which are sparse image descriptors extracted along contours. In the system, the first stages of visual processing are implemented on FPGAs due to their highly parallel...
Micrometer and nanometer-scale parallel patterning of ceramic and organic-inorganic hybrid materials
ten Elshof, Johan E.; Khan, Sajid; Göbel, Ole
2010-01-01
This review gives an overview of the progress made in recent years in the development of low-cost parallel patterning techniques for ceramic materials, silica, and organic–inorganic silsesquioxane-based hybrids from wet-chemical solutions and suspensions on the micrometer and nanometer-scale. The
Orlando, Roberto; Delle Piane, Massimo; Bush, Ian J; Ugliengo, Piero; Ferrabone, Matteo; Dovesi, Roberto
2012-10-30
Fully ab initio treatment of complex solid systems needs computational software which is able to efficiently take advantage of the growing power of high performance computing (HPC) architectures. Recent improvements in CRYSTAL, a periodic ab initio code that uses a Gaussian basis set, allows treatment of very large unit cells for crystalline systems on HPC architectures with high parallel efficiency in terms of running time and memory requirements. The latter is a crucial point, due to the trend toward architectures relying on a very high number of cores with associated relatively low memory availability. An exhaustive performance analysis shows that density functional calculations, based on a hybrid functional, of low-symmetry systems containing up to 100,000 atomic orbitals and 8000 atoms are feasible on the most advanced HPC architectures available to European researchers today, using thousands of processors.
Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide
2015-09-01
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Decomposition for optimal synthesis in a parallel computing environment
Energy Technology Data Exchange (ETDEWEB)
Bhatt, V.D.
1989-01-01
Many practical problems in science and engineering require extensive simplification and abstraction in order to be tractable in currently existing computing environments. Obvious examples are: computational fluid dynamics; nuclear and plasma physics; petroleum reservoir modelling; computer graphics and image processing; and structural synthesis. This research will be useful in these (as well as, other) application fields, but structural synthesis has been chosen as the area of application. The research has been inspired by a philosophy of design called multilevel decomposition, which enables solution of these problems in a reasonable time. In this approach a complex system (for example, a modern aircraft or automobile) is decomposed into several manageably smaller subsystems. These subsystems are solved independently without losing integrity with the main or parent system and ultimately achieving satisfactory results for the large scale system. In addition to making problem solution more tractable, the decomposition approach is compatible with a typical design office multidisciplinary organization and the parallel or distributed computing technology existing today. Several example problems (including classical problems in the field and practical application from industry) have been used to check the validity of the approach.
Low latency, high bandwidth data communications between compute nodes in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.
2010-11-02
Methods, parallel computers, and computer program products are disclosed for low latency, high bandwidth data communications between compute nodes in a parallel computer. Embodiments include receiving, by an origin direct memory access (`DMA`) engine of an origin compute node, data for transfer to a target compute node; sending, by the origin DMA engine of the origin compute node to a target DMA engine on the target compute node, a request to send (`RTS`) message; transferring, by the origin DMA engine, a predetermined portion of the data to the target compute node using memory FIFO operation; determining, by the origin DMA engine whether an acknowledgement of the RTS message has been received from the target DMA engine; if the an acknowledgement of the RTS message has not been received, transferring, by the origin DMA engine, another predetermined portion of the data to the target compute node using a memory FIFO operation; and if the acknowledgement of the RTS message has been received by the origin DMA engine, transferring, by the origin DMA engine, any remaining portion of the data to the target compute node using a direct put operation.
Cluster implementation for parallel computation within MATLAB software environment
Energy Technology Data Exchange (ETDEWEB)
Santana, Antonio O. de; Dantas, Carlos C.; Charamba, Luiz G. da R.; Souza Neto, Wilson F. de; Melo, Silvio B. Melo; Lima, Emerson A. de O., E-mail: mailto.aos@ufpe.br, E-mail: ccd@ufpe.br, E-mail: sbm@cin.ufpe.br, E-mail: emathematics@gmail.com [Universidade Federal de Pernambuco (UFPE), Recife, PE (Brazil)
2013-07-01
A cluster for parallel computation with MATLAB software the COCGT - Cluster for Optimizing Computing in Gamma ray Transmission methods, is implemented. The implementation correspond to creation of a local net of computers, facilities and configurations of software, as well as the accomplishment of cluster tests for determine and optimizing of performance in the data processing. The COCGT implementation was required by data computation from gamma transmission measurements applied to fluid dynamic and tomography reconstruction in a FCC-Fluid Catalytic Cracking cold pilot unity, and simulation data as well. As an initial test the determination of SVD - Singular Values Decomposition - of random matrix with dimension (n , n), n=1000, using the Girco's law modified, revealed that COCGT was faster in comparison to the literature [1] cluster, which is similar and operates at the same conditions. Solution of a system of linear equations provided a new test for the COCGT performance by processing a square matrix with n=10000, computing time was 27 s and for square matrix with n=12000, computation time was 45 s. For determination of the cluster behavior in relation to 'parfor' (parallel for-loop) and 'spmd' (single program multiple data), two codes were used containing those two commands and the same problem: determination of SVD of a square matrix with n= 1000. The execution of codes by means of COCGT proved: 1) for the code with 'parfor', the performance improved with the labs number from 1 to 8 labs; 2) for the code 'spmd', just 1 lab (core) was enough to process and give results in less than 1 s. In similar situation, with the difference that now the SVD will be determined from square matrix with n1500, for code with 'parfor', and n=7000, for code with 'spmd'. That results take to conclusions: 1) for the code with 'parfor', the behavior was the same already described above; 2) for code with &apos
Parallel processing using an optical delay-based reservoir computer
Van der Sande, Guy; Nguimdo, Romain Modeste; Verschaffelt, Guy
2016-04-01
Delay systems subject to delayed optical feedback have recently shown great potential in solving computationally hard tasks. By implementing a neuro-inspired computational scheme relying on the transient response to optical data injection, high processing speeds have been demonstrated. However, reservoir computing systems based on delay dynamics discussed in the literature are designed by coupling many different stand-alone components which lead to bulky, lack of long-term stability, non-monolithic systems. Here we numerically investigate the possibility of implementing reservoir computing schemes based on semiconductor ring lasers. Semiconductor ring lasers are semiconductor lasers where the laser cavity consists of a ring-shaped waveguide. SRLs are highly integrable and scalable, making them ideal candidates for key components in photonic integrated circuits. SRLs can generate light in two counterpropagating directions between which bistability has been demonstrated. We demonstrate that two independent machine learning tasks , even with different nature of inputs with different input data signals can be simultaneously computed using a single photonic nonlinear node relying on the parallelism offered by photonics. We illustrate the performance on simultaneous chaotic time series prediction and a classification of the Nonlinear Channel Equalization. We take advantage of different directional modes to process individual tasks. Each directional mode processes one individual task to mitigate possible crosstalk between the tasks. Our results indicate that prediction/classification with errors comparable to the state-of-the-art performance can be obtained even with noise despite the two tasks being computed simultaneously. We also find that a good performance is obtained for both tasks for a broad range of the parameters. The results are discussed in detail in [Nguimdo et al., IEEE Trans. Neural Netw. Learn. Syst. 26, pp. 3301-3307, 2015
Institute of Scientific and Technical Information of China (English)
I.Yamazaki; X.S.Li; E.G.Ng
2010-01-01
A parallel hybrid linear solver based on the Schur complement method has the potential to balance the robustness of direct solvers with the efficiency of precon-ditioned iterative solvers. However, when solving large-scale highly-indefinite linear systems, this hybrid solver often suffers from either slow convergence or large memory requirements to solve the Schur complement systems. To overcome this challenge, we in this paper discuss techniques to preprocess the Schur complement systems in paral-lel. Numerical results of solving large-scale highly-indefinite linear systems from various applications demonstrate that these techniques improve the reliability and performance of the hybrid solver and enable efficient solutions of these linear systems on hundreds of processors, which was previously infeasible using existing state-of-the-art solvers.
Development of magnetron sputtering simulator with GPU parallel computing
Sohn, Ilyoup; Kim, Jihun; Bae, Junkyeong; Lee, Jinpil
2014-12-01
Sputtering devices are widely used in the semiconductor and display panel manufacturing process. Currently, a number of surface treatment applications using magnetron sputtering techniques are being used to improve the efficiency of the sputtering process, through the installation of magnets outside the vacuum chamber. Within the internal space of the low pressure chamber, plasma generated from the combination of a rarefied gas and an electric field is influenced interactively. Since the quality of the sputtering and deposition rate on the substrate is strongly dependent on the multi-physical phenomena of the plasma regime, numerical simulations using PIC-MCC (Particle In Cell, Monte Carlo Collision) should be employed to develop an efficient sputtering device. In this paper, the development of a magnetron sputtering simulator based on the PIC-MCC method and the associated numerical techniques are discussed. To solve the electric field equations in the 2-D Cartesian domain, a Poisson equation solver based on the FDM (Finite Differencing Method) is developed and coupled with the Monte Carlo Collision method to simulate the motion of gas particles influenced by an electric field. The magnetic field created from the permanent magnet installed outside the vacuum chamber is also numerically calculated using Biot-Savart's Law. All numerical methods employed in the present PIC code are validated by comparison with analytical and well-known commercial engineering software results, with all of the results showing good agreement. Finally, the developed PIC-MCC code is parallelized to be suitable for general purpose computing on graphics processing unit (GPGPU) acceleration, so as to reduce the large computation time which is generally required for particle simulations. The efficiency and accuracy of the GPGPU parallelized magnetron sputtering simulator are examined by comparison with the calculated results and computation times from the original serial code. It is found that
Seismic imaging using finite-differences and parallel computers
Energy Technology Data Exchange (ETDEWEB)
Ober, C.C. [Sandia National Labs., Albuquerque, NM (United States)
1997-12-31
A key to reducing the risks and costs of associated with oil and gas exploration is the fast, accurate imaging of complex geologies, such as salt domes in the Gulf of Mexico and overthrust regions in US onshore regions. Prestack depth migration generally yields the most accurate images, and one approach to this is to solve the scalar wave equation using finite differences. As part of an ongoing ACTI project funded by the US Department of Energy, a finite difference, 3-D prestack, depth migration code has been developed. The goal of this work is to demonstrate that massively parallel computers can be used efficiently for seismic imaging, and that sufficient computing power exists (or soon will exist) to make finite difference, prestack, depth migration practical for oil and gas exploration. Several problems had to be addressed to get an efficient code for the Intel Paragon. These include efficient I/O, efficient parallel tridiagonal solves, and high single-node performance. Furthermore, to provide portable code the author has been restricted to the use of high-level programming languages (C and Fortran) and interprocessor communications using MPI. He has been using the SUNMOS operating system, which has affected many of his programming decisions. He will present images created from two verification datasets (the Marmousi Model and the SEG/EAEG 3D Salt Model). Also, he will show recent images from real datasets, and point out locations of improved imaging. Finally, he will discuss areas of current research which will hopefully improve the image quality and reduce computational costs.
Pacing a data transfer operation between compute nodes on a parallel computer
Blocksome, Michael A.
2011-09-13
Methods, systems, and products are disclosed for pacing a data transfer between compute nodes on a parallel computer that include: transferring, by an origin compute node, a chunk of an application message to a target compute node; sending, by the origin compute node, a pacing request to a target direct memory access (`DMA`) engine on the target compute node using a remote get DMA operation; determining, by the origin compute node, whether a pacing response to the pacing request has been received from the target DMA engine; and transferring, by the origin compute node, a next chunk of the application message if the pacing response to the pacing request has been received from the target DMA engine.
Nordic Summer School on Parallel Computing in Optimization
Pardalos, Panos; Storøy, Sverre
1997-01-01
During the last three decades, breakthroughs in computer technology have made a tremendous impact on optimization. In particular, parallel computing has made it possible to solve larger and computationally more difficult prob lems. This volume contains mainly lecture notes from a Nordic Summer School held at the Linkoping Institute of Technology, Sweden in August 1995. In order to make the book more complete, a few authors were invited to contribute chapters that were not part of the course on this first occasion. The purpose of this Nordic course in advanced studies was three-fold. One goal was to introduce the students to the new achievements in a new and very active field, bring them close to world leading researchers, and strengthen their competence in an area with internationally explosive rate of growth. A second goal was to strengthen the bonds between students from different Nordic countries, and to encourage collaboration and joint research ventures over the borders. In this respect, the course bui...
Zhang, Jiapu
2010-01-01
Evolutionary algorithms are parallel computing algorithms and simulated annealing algorithm is a sequential computing algorithm. This paper inserts simulated annealing into evolutionary computations and successful developed a hybrid Self-Adaptive Evolutionary Strategy $\\mu+\\lambda$ method and a hybrid Self-Adaptive Classical Evolutionary Programming method. Numerical results on more than 40 benchmark test problems of global optimization show that the hybrid methods presented in this paper are very effective. Lennard-Jones potential energy minimization is another benchmark for testing new global optimization algorithms. It is studied through the amyloid fibril constructions by this paper. To date, there is little molecular structural data available on the AGAAAAGA palindrome in the hydrophobic region (113-120) of prion proteins.This region belongs to the N-terminal unstructured region (1-123) of prion proteins, the structure of which has proved hard to determine using NMR spectroscopy or X-ray crystallography ...
Parallelized Vlasov-Fokker-Planck solver for desktop personal computers
Directory of Open Access Journals (Sweden)
Patrik Schönfeldt
2017-03-01
Full Text Available The numerical solution of the Vlasov-Fokker-Planck equation is a well established method to simulate the dynamics, including the self-interaction with its own wake field, of an electron bunch in a storage ring. In this paper we present Inovesa, a modularly extensible program that uses opencl to massively parallelize the computation. It allows a standard desktop PC to work with appropriate accuracy and yield reliable results within minutes. We provide numerical stability-studies over a wide parameter range and compare our numerical findings to known results. Simulation results for the case of coherent synchrotron radiation will be compared to measurements that probe the effects of the microbunching instability occurring in the short bunch operation at ANKA. It will be shown that the impedance model based on the shielding effect of two parallel plates can not only describe the instability threshold, but also the presence of multiple regimes that show differences in the emission of coherent synchrotron radiation.
A study on the GPU based parallel computation of a projection image
Lee, Hyunjeong; Han, Miseon; Kim, Jeongtae
2017-05-01
Fast computation of projection images is crucial in many applications such as medical image reconstruction and light field image processing. To do that, parallelization of the computation and efficient implementation of the computation using a parallel processor such as GPGPU (General-Purpose computing on Graphics Processing Units) is essential. In this research, we investigate methods for parallel computation of projection images and efficient implementation of the methods using CUDA (Compute Unified Device Architecture). We also study how to efficiently use the memory of GPU for the parallel processing.
Efficient relaxed-Jacobi smoothers for multigrid on parallel computers
Yang, Xiang; Mittal, Rajat
2017-03-01
In this Technical Note, we present a family of Jacobi-based multigrid smoothers suitable for the solution of discretized elliptic equations. These smoothers are based on the idea of scheduled-relaxation Jacobi proposed recently by Yang & Mittal (2014) [18] and employ two or three successive relaxed Jacobi iterations with relaxation factors derived so as to maximize the smoothing property of these iterations. The performance of these new smoothers measured in terms of convergence acceleration and computational workload, is assessed for multi-domain implementations typical of parallelized solvers, and compared to the lexicographic point Gauss-Seidel smoother. The tests include the geometric multigrid method on structured grids as well as the algebraic grid method on unstructured grids. The tests demonstrate that unlike Gauss-Seidel, the convergence of these Jacobi-based smoothers is unaffected by domain decomposition, and furthermore, they outperform the lexicographic Gauss-Seidel by factors that increase with domain partition count.
Optimized collectives using a DMA on a parallel computer
Chen, Dong; Gabor, Dozsa; Giampapa, Mark E.; Heidelberger; Phillip
2011-02-08
Optimizing collective operations using direct memory access controller on a parallel computer, in one aspect, may comprise establishing a byte counter associated with a direct memory access controller for each submessage in a message. The byte counter includes at least a base address of memory and a byte count associated with a submessage. A byte counter associated with a submessage is monitored to determine whether at least a block of data of the submessage has been received. The block of data has a predetermined size, for example, a number of bytes. The block is processed when the block has been fully received, for example, when the byte count indicates all bytes of the block have been received. The monitoring and processing may continue for all blocks in all submessages in the message.
Modelling of data uncertainties on hybrid computers
Energy Technology Data Exchange (ETDEWEB)
Schneider, Anke (ed.)
2016-06-15
The codes d{sup 3}f and r{sup 3}t are well established for modelling density-driven flow and nuclide transport in the far field of repositories for hazardous material in deep geological formations. They are applicable in porous media as well as in fractured rock or mudstone, for modelling salt- and heat transport as well as a free groundwater surface. Development of the basic framework of d{sup 3}f and r{sup 3}t had begun more than 20 years ago. Since that time significant advancements took place in the requirements for safety assessment as well as for computer hardware development. The period of safety assessment for a repository of high-level radioactive waste was extended to 1 million years, and the complexity of the models is steadily growing. Concurrently, the demands on accuracy increase. Additionally, model and parameter uncertainties become more and more important for an increased understanding of prediction reliability. All this leads to a growing demand for computational power that requires a considerable software speed-up. An effective way to achieve this is the use of modern, hybrid computer architectures which requires basically the set-up of new data structures and a corresponding code revision but offers a potential speed-up by several orders of magnitude. The original codes d{sup 3}f and r{sup 3}t were applications of the software platform UG /BAS 94/ whose development had begun in the early nineteennineties. However, UG had recently been advanced to the C++ based, substantially revised version UG4 /VOG 13/. To benefit also in the future from state-of-the-art numerical algorithms and to use hybrid computer architectures, the codes d{sup 3}f and r{sup 3}t were transferred to this new code platform. Making use of the fact that coupling between different sets of equations is natively supported in UG4, d{sup 3}f and r{sup 3}t were combined to one conjoint code d{sup 3}f++. A direct estimation of uncertainties for complex groundwater flow models with the
Gutzwiller, David; Gontier, Mathieu; Demeulenaere, Alain
2014-11-01
Multi-Block structured solvers hold many advantages over their unstructured counterparts, such as a smaller memory footprint and efficient serial performance. Historically, multi-block structured solvers have not been easily adapted for use in a High Performance Computing (HPC) environment, and the recent trend towards hybrid GPU/CPU architectures has further complicated the situation. This paper will elaborate on developments and innovations applied to the NUMECA FINE/Turbo solver that have allowed near-linear scalability with real-world problems on over 250 hybrid GPU/GPU cluster nodes. Discussion will focus on the implementation of virtual partitioning and load balancing algorithms using a novel meta-block concept. This implementation is transparent to the user, allowing all pre- and post-processing steps to be performed using a simple, unpartitioned grid topology. Additional discussion will elaborate on developments that have improved parallel performance, including fully parallel I/O with the ADIOS API and the GPU porting of the computationally heavy CPUBooster convergence acceleration module. Head of HPC and Release Management, Numeca International.
Low cost, highly effective parallel computing achieved through a Beowulf cluster.
Bitner, Marc; Skelton, Gordon
2003-01-01
A Beowulf cluster is a means of bringing together several computers and using software and network components to make this cluster of computers appear and function as one computer with multiple parallel computing processors. A cluster of computers can provide comparable computing power usually found only in very expensive super computers or servers.
Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong
2010-10-01
Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Parallel Computation of the Topology of Level Sets
Energy Technology Data Exchange (ETDEWEB)
Pascucci, V; Cole-McLaughlin, K
2004-12-16
we can compute the Contour Tree in linear time in many practical cases where t = O(n{sup 1-{epsilon}}). We report the running times for a parallel implementation, showing good scalability with the number of processors.
Design and Implementation of “Many Parallel Task” Hybrid Subsurface Model
Energy Technology Data Exchange (ETDEWEB)
Agarwal, Khushbu; Chase, Jared M.; Schuchardt, Karen L.; Scheibe, Timothy D.; Palmer, Bruce J.; Elsethagen, Todd O.
2011-11-01
Continuum scale models have been used to study subsurface flow, transport, and reactions for many years. Recently, pore scale models, which operate at scales of individual soil grains, have been developed to more accurately model pore scale phenomena, such as precipitation, that may not be well represented at the continuum scale. However, particle-based models become prohibitively expensive for modeling realistic domains. Instead, we are developing a hybrid model that simulates the full domain at continuum scale and applies the pore model only to areas of high reactivity. The hybrid model uses a dimension reduction approach to formulate the mathematical exchange of information across scales. Since the location, size, and number of pore regions in the model varies, an adaptive Pore Generator is being implemented to define pore regions at each iteration. A fourth code will provide data transformation from the pore scale back to the continuum scale. These components are coupled into a single hybrid model using the SWIFT workflow system. Our hybrid model workflow simulates a kinetic controlled mixing reaction in which multiple pore-scale simulations occur for every continuum scale timestep. Each pore-scale simulation is itself parallel, thus exhibiting multi-level parallelism. Our workflow manages these multiple parallel tasks simultaneously, with the number of tasks changing across iterations. It also supports dynamic allocation of job resources and visualization processing at each iteration. We discuss the design, implementation and challenges associated with building a scalable, Many Parallel Task, hybrid model to run efficiently on thousands to tens of thousands of processors.
Bustamam, Alhadi; Burrage, Kevin; Hamilton, Nicholas A
2012-01-01
Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks. However,with increasing vast amount of data on biological networks, performance and scalability issues are becoming a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with their data.
A Testbed of Parallel Kernels for Computer Science Research
Energy Technology Data Exchange (ETDEWEB)
Bailey, David; Demmel, James; Ibrahim, Khaled; Kaiser, Alex; Koniges, Alice; Madduri, Kamesh; Shalf, John; Strohmaier, Erich; Williams, Samuel
2010-04-30
For several decades, computer scientists have sought guidance on how to evolve architectures, languages, and programming models for optimal performance, efficiency, and productivity. Unfortunately, this guidance is most often taken from the existing software/hardware ecosystem. Architects attempt to provide micro-architectural solutions to improve performance on fixed binaries. Researchers tweak compilers to improve code generation for existing architectures and implementations, and they may invent new programming models for fixed processor and memory architectures and computational algorithms. In today's rapidly evolving world of on-chip parallelism, these isolated and iterative improvements to performance may miss superior solutions in the same way gradient descent optimization techniques may get stuck in local minima. In an initial study, we have developed an alternate approach that, rather than starting with an existing hardware/software solution laced with hidden assumptions, defines the computational problems of interest and invites architects, researchers and programmers to implement novel hardware/ software co-designed solutions. Our work builds on the previous ideas of computational dwarfs, motifs, and parallel patterns by selecting a representative set of essential problems for which we provide: An algorithmic description; scalable problem definition; illustrative reference implementations; verification schemes. For simplicity, we focus initially on the computational problems of interest to the scientific computing community but proclaim the methodology (and perhaps a subset of the problems) as applicable to other communities. We intend to broaden the coverage of this problem space through stronger community involvement. Previous work has established a broad categorization of numerical methods of interest to the scientific computing, in the spirit of the NAS Benchmarks, which pioneered the basic idea of a 'pencil and paper benchmark' in the
Dynamic modeling of Tampa Bay urban development using parallel computing
Xian, G.; Crane, M.; Steinwand, D.
2005-01-01
Urban land use and land cover has changed significantly in the environs of Tampa Bay, Florida, over the past 50 years. Extensive urbanization has created substantial change to the region's landscape and ecosystems. This paper uses a dynamic urban-growth model, SLEUTH, which applies six geospatial data themes (slope, land use, exclusion, urban extent, transportation, hillside), to study the process of urbanization and associated land use and land cover change in the Tampa Bay area. To reduce processing time and complete the modeling process within an acceptable period, the model is recoded and ported to a Beowulf cluster. The parallel-processing computer system accomplishes the massive amount of computation the modeling simulation requires. SLEUTH calibration process for the Tampa Bay urban growth simulation spends only 10 h CPU time. The model predicts future land use/cover change trends for Tampa Bay from 1992 to 2025. Urban extent is predicted to double in the Tampa Bay watershed between 1992 and 2025. Results show an upward trend of urbanization at the expense of a decline of 58% and 80% in agriculture and forested lands, respectively. ?? 2005 Elsevier Ltd. All rights reserved.
2009-01-01
At the 19th Annual Conference on Parallel Computational Fluid Dynamics held in Antalya, Turkey, in May 2007, the most recent developments and implementations of large-scale and grid computing were presented. This book, comprised of the invited and selected papers of this conference, details those advances, which are of particular interest to CFD and CFD-related communities. It also offers the results related to applications of various scientific and engineering problems involving flows and flow-related topics. Intended for CFD researchers and graduate students, this book is a state-of-the-art presentation of the relevant methodology and implementation techniques of large-scale computing.
Energy Technology Data Exchange (ETDEWEB)
Nash, T.; Areti, H.; Atac, R.; Biel, J.; Cook, A.; Deppe, J.; Edel, M.; Fischler, M.; Gaines, I.; Hance, R.
1988-08-01
Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs.
A Hybrid Decomposition Parallel Implementation of the Car-Parrinello Method
Wiggs, J K; Wiggs, James K.; Jonsson, Hannes
1994-01-01
We have developed a flexible hybrid decomposition parallel implementation of the first-principles molecular dynamics algorithm of Car and Parrinello. The code allows the problem to be decomposed either spatially, over the electronic orbitals, or any combination of the two. Performance statistics for 32, 64, 128 and 512 Si atom runs on the Touchstone Delta and Intel Paragon parallel supercomputers and comparison with the performance of an optimized code running the smaller systems on the Cray Y-MP and C90 are presented.
Yoo, Youngjin; Prasloski, Thomas; Vavasour, Irene; MacKay, Alexander; Traboulsee, Anthony L; Li, David K B; Tam, Roger C
2015-03-01
To develop a fast algorithm for computing myelin maps from multiecho T2 relaxation data using parallel computation with multicore CPUs and graphics processing units (GPUs). Using an existing MATLAB (MathWorks, Natick, MA) implementation with basic (nonalgorithm-specific) parallelism as a guide, we developed a new version to perform the same computations but using C++ to optimize the hybrid utilization of multicore CPUs and GPUs, based on experimentation to determine which algorithmic components would benefit from CPU versus GPU parallelization. Using 32-echo T2 data of dimensions 256 × 256 × 7 from 17 multiple sclerosis patients and 18 healthy subjects, we compared the two methods in terms of speed, myelin values, and the ability to distinguish between the two patient groups using Student's t-tests. The new method was faster than the MATLAB implementation by 4.13 times for computing a single map and 14.36 times for batch-processing 10 scans. The two methods produced very similar myelin values, with small and explainable differences that did not impact the ability to distinguish the two patient groups. The proposed hybrid multicore approach represents a more efficient alternative to MATLAB, especially for large-scale batch processing. © 2014 Wiley Periodicals, Inc.
Energy Technology Data Exchange (ETDEWEB)
Payne, J.L.; Hassan, B.
1998-09-01
Massively parallel computers have enabled the analyst to solve complicated flow fields (turbulent, chemically reacting) that were previously intractable. Calculations are presented using a massively parallel CFD code called SACCARA (Sandia Advanced Code for Compressible Aerothermodynamics Research and Analysis) currently under development at Sandia National Laboratories as part of the Department of Energy (DOE) Accelerated Strategic Computing Initiative (ASCI). Computations were made on a generic reentry vehicle in a hypersonic flowfield utilizing three different distributed parallel computers to assess the parallel efficiency of the code with increasing numbers of processors. The parallel efficiencies for the SACCARA code will be presented for cases using 1, 150, 100 and 500 processors. Computations were also made on a subsonic/transonic vehicle using both 236 and 521 processors on a grid containing approximately 14.7 million grid points. Ongoing and future plans to implement a parallel overset grid capability and couple SACCARA with other mechanics codes in a massively parallel environment are discussed.
Directory of Open Access Journals (Sweden)
José Miguel Vargas-Félix
2012-11-01
Full Text Available The Finite Element Method (FEM is used to solve problems like solid deformation and heat diffusion in domains with complex geometries. This kind of geometries requires discretization with millions of elements; this is equivalent to solve systems of equations with sparse matrices and tens or hundreds of millions of variables. The aim is to use computer clusters to solve these systems. The solution method used is Schur substructuration. Using it is possible to divide a large system of equations into many small ones to solve them more efficiently. This method allows parallelization. MPI (Message Passing Interface is used to distribute the systems of equations to solve each one in a computer of a cluster. Each system of equations is solved using a solver implemented to use OpenMP as a local parallelization method.The Finite Element Method (FEM is used to solve problems like solid deformation and heat diffusion in domains with complex geometries. This kind of geometries requires discretization with millions of elements; this is equivalent to solve systems of equations with sparse matrices and tens or hundreds of millions of variables. The aim is to use computer clusters to solve these systems. The solution method used is Schur substructuration. Using it is possible to divide a large system of equations into many small ones to solve them more efficiently. This method allows parallelization. MPI (Message Passing Interface is used to distribute the systems of equations to solve each one in a computer of a cluster. Each system of equations is solved using a solver implemented to use OpenMP as a local parallelization method.
Directory of Open Access Journals (Sweden)
Gabriele Jost
2010-01-01
Full Text Available Today most systems in high-performance computing (HPC feature a hierarchical hardware design: shared-memory nodes with several multi-core CPUs are connected via a network infrastructure. When parallelizing an application for these architectures it seems natural to employ a hierarchical programming model such as combining MPI and OpenMP. Nevertheless, there is the general lore that pure MPI outperforms the hybrid MPI/OpenMP approach. In this paper, we describe the hybrid MPI/OpenMP parallelization of IR3D (Incompressible Realistic 3-D code, a full-scale real-world application, which simulates the environmental effects on the evolution of vortices trailing behind control surfaces of underwater vehicles. We discuss performance, scalability and limitations of the pure MPI version of the code on a variety of hardware platforms and show how the hybrid approach can help to overcome certain limitations.
Literature Review on the Hybrid Flow Shop Scheduling Problem with Unrelated Parallel Machines
Directory of Open Access Journals (Sweden)
Eliana Marcela Peña Tibaduiza
2017-01-01
Full Text Available Context: The flow shop hybrid problem with unrelated parallel machines has been less studied in the academia compared to the flow shop hybrid with identical processors. For this reason, there are few reports about the kind of application of this problem in industries. Method: A literature review of the state of the art on flow-shop scheduling problem was conducted by collecting and analyzing academic papers on several scientific databases. For this aim, a search query was constructed using keywords defining the problem and checking the inclusion of unrelated parallel machines in such definition; as a result, 50 papers were finally selected for this study. Results: A classification of the problem according to the characteristics of the production system was performed, also solution methods, constraints and objective functions commonly used are presented. Conclusions: An increasing trend is observed in studies of flow shop with multiple stages, but few are based on industry case-studies.
CERN. Geneva
2016-01-01
Large scale scientific computing raises questions on different levels ranging from the fomulation of the problems to the choice of the best algorithms and their implementation for a specific platform. There are similarities in these different topics that can be exploited by modern-style C++ template metaprogramming techniques to produce readable, maintainable and generic code. Traditional low-level code tend to be fast but platform-dependent, and it obfuscates the meaning of the algorithm. On the other hand, object-oriented approach is nice to read, but may come with an inherent performance penalty. These lectures aim to present he basics of the Expression Template (ET) idiom which allows us to keep the object-oriented approach without sacrificing performance. We will in particular show to to enhance ET to include SIMD vectorization. We will then introduce techniques for abstracting iteration, and introduce thread-level parallelism for use in heavy data-centric loads. We will show to to apply these methods i...
Aspects of operating systems and software engineering with parallel computer architectures
Energy Technology Data Exchange (ETDEWEB)
Foessmeier, R.; Ruede, U.; Zenger, C.
1988-05-01
Making efficient use of parallel computer architectures generally requires special programming techniques. Usually, non-standardized parallel constructs are added to a traditional programming language. This reduces program portability and adds extra difficulties to programming. Coarse-grain parallelism can be exploited by parallel processes. In this field the operating system UNIX - now in widespread use - offers easy-to-use means for describing parallelism, sufficient for basic process synchronisation and communication. Problem structurization required for this kind of parallelism often contributes to the versatility and clarity of the programs. As an example, the elimination of a linear system is parallelized.
Codebook Design and Hybrid Digital/Analog Coding for Parallel Rayleigh Fading Channels
Shi, Shuying; Larsson, Erik G.; Skoglund, Mikael
2011-01-01
Low-delay source-channel transmission over parallel fading channels is studied. In this scenario separate sourceand channel coding is highly suboptimal. A scheme based on hybrid digital/analog joint source-channel coding istherefore proposed, employing scalar quantization and polynomial-based analog bandwidth expansion. Simulationsdemonstrate substantial performance gains. Funding agencies|European Community|248993|EL-LIIT||Knut and Alice Wallenberg Foundation||
Parallel In Situ Indexing for Data-intensive Computing
Energy Technology Data Exchange (ETDEWEB)
Kim, Jinoh; Abbasi, Hasan; Chacon, Luis; Docan, Ciprian; Klasky, Scott; Liu, Qing; Podhorszki, Norbert; Shoshani, Arie; Wu, Kesheng
2011-09-09
As computing power increases exponentially, vast amount of data is created by many scientific re- search activities. However, the bandwidth for storing the data to disks and reading the data from disks has been improving at a much slower pace. These two trends produce an ever-widening data access gap. Our work brings together two distinct technologies to address this data access issue: indexing and in situ processing. From decades of database research literature, we know that indexing is an effective way to address the data access issue, particularly for accessing relatively small fraction of data records. As data sets increase in sizes, more and more analysts need to use selective data access, which makes indexing an even more important for improving data access. The challenge is that most implementations of in- dexing technology are embedded in large database management systems (DBMS), but most scientific datasets are not managed by any DBMS. In this work, we choose to include indexes with the scientific data instead of requiring the data to be loaded into a DBMS. We use compressed bitmap indexes from the FastBit software which are known to be highly effective for query-intensive workloads common to scientific data analysis. To use the indexes, we need to build them first. The index building procedure needs to access the whole data set and may also require a significant amount of compute time. In this work, we adapt the in situ processing technology to generate the indexes, thus removing the need of read- ing data from disks and to build indexes in parallel. The in situ data processing system used is ADIOS, a middleware for high-performance I/O. Our experimental results show that the indexes can improve the data access time up to 200 times depending on the fraction of data selected, and using in situ data processing system can effectively reduce the time needed to create the indexes, up to 10 times with our in situ technique when using identical parallel settings.
A hybrid method for the parallel computation of Green's functions
DEFF Research Database (Denmark)
Petersen, Dan Erik; Li, Song; Stokbro, Kurt;
2009-01-01
Quantum transport models for nanodevices using the non-equilibrium Green's function method require the repeated calculation of the block tridiagonal part of the Green's and lesser Green's function matrices. This problem is related to the calculation of the inverse of a sparse matrix. Because...
Identifying logical planes formed of compute nodes of a subcommunicator in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Davis, Kristan D.; Faraj, Daniel
2016-05-03
In a parallel computer, a plurality of logical planes formed of compute nodes of a subcommunicator may be identified by: for each compute node of the subcommunicator and for a number of dimensions beginning with a first dimension: establishing, by a plane building node, in a positive direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in a positive direction of a second dimension, where the second dimension is orthogonal to the first dimension; and establishing, by the plane building node, in a negative direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in the positive direction of the second dimension.
Identifying logical planes formed of compute nodes of a subcommunicator in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Davis, Kristan D.; Faraj, Daniel A.
2016-03-01
In a parallel computer, a plurality of logical planes formed of compute nodes of a subcommunicator may be identified by: for each compute node of the subcommunicator and for a number of dimensions beginning with a first dimension: establishing, by a plane building node, in a positive direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in a positive direction of a second dimension, where the second dimension is orthogonal to the first dimension; and establishing, by the plane building node, in a negative direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in the positive direction of the second dimension.
Intelligent Energy Management Strategy for a Separated-Axle Parallel Hybrid Electric Vehicle
Directory of Open Access Journals (Sweden)
Naser Fallahi
2014-03-01
Full Text Available Hybrid electric vehicles (HEV in addition to provide the benefits of electric vehicles could satisfy consumers for some performances of conventional internal combustion engine (ICE vehicles such as acceleration and long range. On this way, suitable energy optimization strategies should be employed to get desired efficiency, less fuel consumption and pollution. One of the favorite and simple configurations of HEVs is parallel type. A student team at University of Kashan, IRAN have designed and manufactured Shaheb 2 hybrid electric vehicle. It is a separated-axle (or Through-to-Road (TTR parallel HEV type based on Pride platform. Employed energy management in Shaheb 2 is on/off strategy and three modes; motor, engine and hybrid have been implemented. This paper investigates the modeling of separated-axle (or TTR parallel type of HEV in ADVISOR software and then evaluates two control strategies for Shaheb 2; on/off strategy and an intelligent control based on fuzzy logic. On this way, maximizing the engine is considered as objective function. The simulation results indicate that the fuzzy strategy leads to less fuel consumption and lower pollution for given UDDS driving cycle rather than on/off strategy for Shaheb 2.
Xia, Yidong
The objective this work is to develop a parallel, implicit reconstructed discontinuous Galerkin (RDG) method using Taylor basis for the solution of the compressible Navier-Stokes equations on 3D hybrid grids. This third-order accurate RDG method is based on a hierarchical weighed essentially non- oscillatory reconstruction scheme, termed as HWENO(P1P 2) to indicate that a quadratic polynomial solution is obtained from the underlying linear polynomial DG solution via a hierarchical WENO reconstruction. The HWENO(P1P2) is designed not only to enhance the accuracy of the underlying DG(P1) method but also to ensure non-linear stability of the RDG method. In this reconstruction scheme, a quadratic polynomial (P2) solution is first reconstructed using a least-squares approach from the underlying linear (P1) discontinuous Galerkin solution. The final quadratic solution is then obtained using a Hermite WENO reconstruction, which is necessary to ensure the linear stability of the RDG method on 3D unstructured grids. The first derivatives of the quadratic polynomial solution are then reconstructed using a WENO reconstruction in order to eliminate spurious oscillations in the vicinity of strong discontinuities, thus ensuring the non-linear stability of the RDG method. The parallelization in the RDG method is based on a message passing interface (MPI) programming paradigm, where the METIS library is used for the partitioning of a mesh into subdomain meshes of approximately the same size. Both multi-stage explicit Runge-Kutta and simple implicit backward Euler methods are implemented for time advancement in the RDG method. In the implicit method, three approaches: analytical differentiation, divided differencing (DD), and automatic differentiation (AD) are developed and implemented to obtain the resulting flux Jacobian matrices. The automatic differentiation is a set of techniques based on the mechanical application of the chain rule to obtain derivatives of a function given as
Parallel Solver for H(div) Problems Using Hybridization and AMG
Energy Technology Data Exchange (ETDEWEB)
Lee, Chak S. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Vassilevski, Panayot S. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
2016-01-15
In this paper, a scalable parallel solver is proposed for H(div) problems discretized by arbitrary order finite elements on general unstructured meshes. The solver is based on hybridization and algebraic multigrid (AMG). Unlike some previously studied H(div) solvers, the hybridization solver does not require discrete curl and gradient operators as additional input from the user. Instead, only some element information is needed in the construction of the solver. The hybridization results in a H1-equivalent symmetric positive definite system, which is then rescaled and solved by AMG solvers designed for H1 problems. Weak and strong scaling of the method are examined through several numerical tests. Our numerical results show that the proposed solver provides a promising alternative to ADS, a state-of-the-art solver [12], for H(div) problems. In fact, it outperforms ADS for higher order elements.
Optimal Control for a Parallel Hybrid Hydraulic Excavator Using Particle Swarm Optimization
Directory of Open Access Journals (Sweden)
Dong-yun Wang
2013-01-01
Full Text Available Optimal control using particle swarm optimization (PSO is put forward in a parallel hybrid hydraulic excavator (PHHE. A power-train mathematical model of PHHE is illustrated along with the analysis of components’ parameters. Then, the optimal control problem is addressed, and PSO algorithm is introduced to deal with this nonlinear optimal problem which contains lots of inequality/equality constraints. Then, the comparisons between the optimal control and rule-based one are made, and the results show that hybrids with the optimal control would increase fuel economy. Although PSO algorithm is off-line optimization, still it would bring performance benchmark for PHHE and also help have a deep insight into hybrid excavators.
Mechatronic Design of a New Humanoid Robot with Hybrid Parallel Actuation
Directory of Open Access Journals (Sweden)
Vítor Santos
2012-10-01
Full Text Available Humanoid robotics is unquestionably a challenging and long-term field of research. Of the numerous and most urgent challenges to tackle, autonomous and efficient locomotion may possibly be the most underdeveloped at present in the research community. Therefore, to pursue studies in relation to autonomy with efficient locomotion, the authors have been developing a new teen-sized humanoid platform with hybrid characteristics. The hybrid nature is clear in the mixed actuation based on common electrical motors and passive actuators attached in parallel to the motors. This paper presents the mechatronic design of the humanoid platform, focusing mainly on the mechanical structure, the design and simulation of the hybrid joints, and the different subsystems implemented. Trying to keep the appropriate human proportions and main degrees of freedom, the developed platform utilizes a distributed control architecture and a rich set of sensing capabilities, both ripe for future development and research.
Mechatronic Design of a New Humanoid Robot with Hybrid Parallel Actuation
Directory of Open Access Journals (Sweden)
Vítor Santos
2012-10-01
Full Text Available Humanoid robotics is unquestionably a challenging and long‐term field of research. Of the numerous and most urgent challenges to tackle, autonomous and efficient locomotion may possibly be the most underdeveloped at present in the research community. Therefore, to pursue studies in relation to autonomy with efficient locomotion, the authors have been developing a new teen‐sized humanoid platform with hybrid characteristics. The hybrid nature is clear in the mixed actuation based on common electrical motors and passive actuators attached in parallel to the motors. This paper presents the mechatronic design of the humanoid platform, focusing mainly on the mechanical structure, the design and simulation of the hybrid joints, and the different subsystems implemented. Trying to keep the appropriate human proportions and main degrees of freedom, the developed platform utilizes a distributed control architecture and a rich set of sensing capabilities, both ripe for future development and research.
Population genomics of parallel hybrid zones in the mimetic butterflies, H. melpomene and H. erato
Ruiz, Mayté; Salazar, Patricio; Counterman, Brian; Medina, Jose Alejandro; Ortiz-Zuazaga, Humberto; Morrison, Anna; Papa, Riccardo
2014-01-01
Hybrid zones can be valuable tools for studying evolution and identifying genomic regions responsible for adaptive divergence and underlying phenotypic variation. Hybrid zones between subspecies of Heliconius butterflies can be very narrow and are maintained by strong selection acting on color pattern. The comimetic species, H. erato and H. melpomene, have parallel hybrid zones in which both species undergo a change from one color pattern form to another. We use restriction-associated DNA sequencing to obtain several thousand genome-wide sequence markers and use these to analyze patterns of population divergence across two pairs of parallel hybrid zones in Peru and Ecuador. We compare two approaches for analysis of this type of data—alignment to a reference genome and de novo assembly—and find that alignment gives the best results for species both closely (H. melpomene) and distantly (H. erato, ∼15% divergent) related to the reference sequence. Our results confirm that the color pattern controlling loci account for the majority of divergent regions across the genome, but we also detect other divergent regions apparently unlinked to color pattern differences. We also use association mapping to identify previously unmapped color pattern loci, in particular the Ro locus. Finally, we identify a new cryptic population of H. timareta in Ecuador, which occurs at relatively low altitude and is mimetic with H. melpomene malleti. PMID:24823669
Hybrid Computational Model for High-Altitude Aeroassist Vehicles Project
National Aeronautics and Space Administration — A hybrid continuum/noncontinuum computational model will be developed for analyzing the aerodynamics and heating on aeroassist vehicles. Unique features of this...
Parallel computing for data science with examples in R, C++ and CUDA
Matloff, Norman
2015-01-01
Parallel Computing for Data Science: With Examples in R, C++ and CUDA is one of the first parallel computing books to concentrate exclusively on parallel data structures, algorithms, software tools, and applications in data science. It includes examples not only from the classic ""n observations, p variables"" matrix format but also from time series, network graph models, and numerous other structures common in data science. The examples illustrate the range of issues encountered in parallel programming.With the main focus on computation, the book shows how to compute on three types of platfor
Efficient Parallel Computation of Nearest Neighbor Interchange Distances
Gast, Mikael
2012-01-01
The nni-distance is a well-known distance measure for phylogenetic trees. We construct an efficient parallel approximation algorithm for the nni-distance in the CRCW-PRAM model running in O(log n) time on O(n) processors. Given two phylogenetic trees T1 and T2 on the same set of taxa and with the same multi-set of edge-weights, the algorithm constructs a sequence of nni-operations of weight at most O(log n) \\cdot opt, where opt denotes the minimum weight of a sequence of nni-operations transforming T1 into T2 . This algorithm is based on the sequential approximation algorithm for the nni-distance given by DasGupta et al. (2000). Furthermore, we show that the problem of identifying so called good edge-pairs between two weighted phylogenies can be computed in O(log n) time on O(n log n) processors.
Parallelizing Sylvester-like operations on a distributed memory computer
Energy Technology Data Exchange (ETDEWEB)
Hu, D.Y.; Sorensen, D.C. [Rice Univ., Houston, TX (United States)
1994-12-31
Discretization of linear operators arising in applied mathematics often leads to matrices with the following structure: M(x) = (D {circle_times} A + B {circle_times} I{sub n} + V)x, where x {element_of} R{sup mn}, B, D {element_of} R{sup nxn}, A {element_of} R{sup mxm} and V {element_of} R{sup mnxmn}; both D and V are diagonal. For the notational convenience, the authors assume that both A and B are symmetric. All the results through this paper can be easily extended to the cases with general A and B. The linear operator on R{sup mn} defined above can be viewed as a generalization of the Sylvester operator: S(x) = (I{sub m} {circle_times} A + B {circle_times} I{sub n})x. The authors therefore refer to it as a Sylvester-like operator. The schemes discussed in this paper therefore also apply to Sylvester operator. In this paper, the authors present the SIMD scheme for parallelization of the Sylvester-like operator on a distributed memory computer. This scheme is designed to approach the best possible efficiency by avoiding unnecessary communication among processors.
Adaptive Dynamic Process Scheduling on Distributed Memory Parallel Computers
Directory of Open Access Journals (Sweden)
Wei Shu
1994-01-01
Full Text Available One of the challenges in programming distributed memory parallel machines is deciding how to allocate work to processors. This problem is particularly important for computations with unpredictable dynamic behaviors or irregular structures. We present a scheme for dynamic scheduling of medium-grained processes that is useful in this context. The adaptive contracting within neighborhood (ACWN is a dynamic, distributed, load-dependent, and scalable scheme. It deals with dynamic and unpredictable creation of processes and adapts to different systems. The scheme is described and contrasted with two other schemes that have been proposed in this context, namely the randomized allocation and the gradient model. The performance of the three schemes on an Intel iPSC/2 hypercube is presented and analyzed. The experimental results show that even though the ACWN algorithm incurs somewhat larger overhead than the randomized allocation, it achieves better performance in most cases due to its adaptiveness. Its feature of quickly spreading the work helps it outperform the gradient model in performance and scalability.
I - Template Metaprogramming for Massively Parallel Scientific Computing - Expression Templates
CERN. Geneva
2016-01-01
Large scale scientific computing raises questions on different levels ranging from the fomulation of the problems to the choice of the best algorithms and their implementation for a specific platform. There are similarities in these different topics that can be exploited by modern-style C++ template metaprogramming techniques to produce readable, maintainable and generic code. Traditional low-level code tend to be fast but platform-dependent, and it obfuscates the meaning of the algorithm. On the other hand, object-oriented approach is nice to read, but may come with an inherent performance penalty. These lectures aim to present he basics of the Expression Template (ET) idiom which allows us to keep the object-oriented approach without sacrificing performance. We will in particular show to to enhance ET to include SIMD vectorization. We will then introduce techniques for abstracting iteration, and introduce thread-level parallelism for use in heavy data-centric loads. We will show to to apply these methods i...
Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E
2014-11-18
Methods, apparatuses, and computer program products for endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (`PAMI`) of a parallel computer are provided. Embodiments include establishing by a parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry. Embodiments also include registering in each endpoint in the geometry a dispatch callback function for a collective operation and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.
Digital Potentiometer for Hybrid Computer EAI 680-PDP-8/I
DEFF Research Database (Denmark)
Højberg, Kristian Søe; Olsen, Jens V.
1974-01-01
In this article a description is given of a 12 bit digital potentiometer for hybrid computer application. The system is composed of standard building blocks. Emphasis is laid on the development problems met and the problem solutions developed.......In this article a description is given of a 12 bit digital potentiometer for hybrid computer application. The system is composed of standard building blocks. Emphasis is laid on the development problems met and the problem solutions developed....
An implementation of a parallel MOL solver on the Intel gamma parallel computer
Energy Technology Data Exchange (ETDEWEB)
Lawkins, W.F.; Payne, J.S.
1992-06-17
A implicit parallel method-of-lines solver that has been implemented on the MIMD Intel Gamma prototype supercomputer is discussed. The strategy for implementation is to execute the ODE solver sequentially and to do the numerical linear algebra in parallel. Performance studies for this implementation are presented.
Introduction to massively-parallel computing in high-energy physics
Smith, Mark
1993-01-01
Ever since computers were first used for scientific and numerical work, there has existed an "arms race" between the technical development of faster computing hardware, and the desires of scientists to solve larger problems in shorter time-scales. However, the vast leaps in processor performance achieved through advances in semi-conductor science have reached a hiatus as the technology comes up against the physical limits of the speed of light and quantum effects. This has lead all high performance computer manufacturers to turn towards a parallel architecture for their new machines. In these lectures we will introduce the history and concepts behind parallel computing, and review the various parallel architectures and software environments currently available. We will then introduce programming methodologies that allow efficient exploitation of parallel machines, and present case studies of the parallelization of typical High Energy Physics codes for the two main classes of parallel computing architecture (S...
Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB
Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.
2017-01-01
Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.
Wigton, Larry
1996-01-01
Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.
A Parallel Computational Fluid Dynamics Unstructured Grid Generator
1993-12-01
and parallel processing. I had a great deal of help in this effort. I would especially like to thank my advisor, LtCol Hobart, and my committee members...Mathematics Sciences Section at Oak Ridgr ’ -ratory, especially Barry Peyton and Dave MacKay for their help in providing me with their parallel recursive...solvers is due, in part, to the evoluion Of unstructured grids. Problem This research develops a parallel algorithm to create a two-dimensional
Parallel computations using a cluster of workstations to simulate elasticity problems
Darmawan, J. B. B.; Mungkasi, S.
2016-11-01
Computational physics has played important roles in real world problems. This paper is within the applied computational physics area. The aim of this study is to observe the performance of parallel computations using a cluster of workstations (COW) to simulate elasticity problems. Parallel computations with the COW configuration are conducted using the Message Passing Interface (MPI) standard. In parallel computations with COW, we consider five scenarios with twenty simulations. In addition to the execution time, efficiency is used to evaluate programming algorithm scenarios. Sequential and parallel programming performances are evaluated based on their execution time and efficiency. Results show that the one-dimensional elasticity equations are not appropriate to be solved in parallel with MPI_Send and MPI_Recv technique in the MPI standard, because the total amount of time to exchange data is considered more dominant compared with the total amount of time to conduct the basic elasticity computation.
Wu, X.
2011-07-18
The NAS Parallel Benchmarks (NPB) are well-known applications with fixed algorithms for evaluating parallel systems and tools. Multicore clusters provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the data sharing with the multicores that comprise a node, and MPI can be used with the communication between nodes. In this paper, we use Scalar Pentadiagonal (SP) and Block Tridiagonal (BT) benchmarks of MPI NPB 3.3 as a basis for a comparative approach to implement hybrid MPI/OpenMP versions of SP and BT. In particular, we can compare the performance of the hybrid SP and BT with the MPI counterparts on large-scale multicore clusters, Intrepid (BlueGene/P) at Argonne National Laboratory and Jaguar (Cray XT4/5) at Oak Ridge National Laboratory. Our performance results indicate that the hybrid SP outperforms the MPI SP by up to 20.76 %, and the hybrid BT outperforms the MPI BT by up to 8.58 % on up to 10 000 cores on Intrepid and Jaguar. We also use performance tools and MPI trace libraries available on these clusters to further investigate the performance characteristics of the hybrid SP and BT. © 2011 The Author. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.
Hybrid system for computing reachable workspaces for redundant manipulators
Alameldin, Tarek K.; Sobh, Tarek M.
1991-03-01
An efficient computation of 3D workspaces for redundant manipulators is based on a " hybrid" a!- gorithm between direct kinematics and screw theory. Direct kinematics enjoys low computational cost but needs edge detection algorithms when workspace boundaries are needed. Screw theory has exponential computational cost per workspace point but does not need edge detection. Screw theory allows computing workspace points in prespecified directions while direct kinematics does not. Applications of the algorithm are discussed.
Marx, Alain; Lütjens, Hinrich
2017-03-01
A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton-Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the two-fluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations.
Jagiella, Nick; Rickert, Dennis; Theis, Fabian J; Hasenauer, Jan
2017-02-22
Mechanistic understanding of multi-scale biological processes, such as cell proliferation in a changing biological tissue, is readily facilitated by computational models. While tools exist to construct and simulate multi-scale models, the statistical inference of the unknown model parameters remains an open problem. Here, we present and benchmark a parallel approximate Bayesian computation sequential Monte Carlo (pABC SMC) algorithm, tailored for high-performance computing clusters. pABC SMC is fully automated and returns reliable parameter estimates and confidence intervals. By running the pABC SMC algorithm for ∼10(6) hr, we parameterize multi-scale models that accurately describe quantitative growth curves and histological data obtained in vivo from individual tumor spheroid growth in media droplets. The models capture the hybrid deterministic-stochastic behaviors of 10(5)-10(6) of cells growing in a 3D dynamically changing nutrient environment. The pABC SMC algorithm reliably converges to a consistent set of parameters. Our study demonstrates a proof of principle for robust, data-driven modeling of multi-scale biological systems and the feasibility of multi-scale model parameterization through statistical inference.
Memory Benchmarks for SMP-Based High Performance Parallel Computers
Energy Technology Data Exchange (ETDEWEB)
Yoo, A B; de Supinski, B; Mueller, F; Mckee, S A
2001-11-20
As the speed gap between CPU and main memory continues to grow, memory accesses increasingly dominates the performance of many applications. The problem is particularly acute for symmetric multiprocessor (SMP) systems, where the shared memory may be accessed concurrently by a group of threads running on separate CPUs. Unfortunately, several key issues governing memory system performance in current systems are not well understood. Complex interactions between the levels of the memory hierarchy, buses or switches, DRAM back-ends, system software, and application access patterns can make it difficult to pinpoint bottlenecks and determine appropriate optimizations, and the situation is even more complex for SMP systems. To partially address this problem, we formulated a set of multi-threaded microbenchmarks for characterizing and measuring the performance of the underlying memory system in SMP-based high-performance computers. We report our use of these microbenchmarks on two important SMP-based machines. This paper has four primary contributions. First, we introduce a microbenchmark suite to systematically assess and compare the performance of different levels in SMP memory hierarchies. Second, we present a new tool based on hardware performance monitors to determine a wide array of memory system characteristics, such as cache sizes, quickly and easily; by using this tool, memory performance studies can be targeted to the full spectrum of performance regimes with many fewer data points than is otherwise required. Third, we present experimental results indicating that the performance of applications with large memory footprints remains largely constrained by memory. Fourth, we demonstrate that thread-level parallelism further degrades memory performance, even for the latest SMPs with hardware prefetching and switch-based memory interconnects.
Algorithmic support for commodity-based parallel computing systems.
Energy Technology Data Exchange (ETDEWEB)
Leung, Vitus Joseph; Bender, Michael A. (State University of New York, Stony Brook, NY); Bunde, David P. (University of Illinois, Urbna, IL); Phillips, Cynthia Ann
2003-10-01
The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.
PARALLEL SIMULATION OF THE ELLIPTIC MILD SLOPE EQUATION WITH A PERSONAL COMPUTER CLUSTER
Institute of Scientific and Technical Information of China (English)
Zheng Yong-hong; You Ya-ge; Wang Li-sheng; Shen Yong-ming
2003-01-01
With the massage passing interface, a parallel solution method was proposed for the simulation of the elliptic mild slope equation, and implemented numerically on a parallel system based on a personal computer cluster, which was constructed by the authors. The wave transformations over two typical topographies with mild slopes were simulated. Numerical results show that the parallel solution method presented in this paper can not only increase the computational efficiency, but also decrease very much the memory storage on a single computer, so the parallel system based on a PCC can be used to simulate wave transformations over much larger areas.
Energy Technology Data Exchange (ETDEWEB)
Lober, R.R.; Tautges, T.J.; Vaughan, C.T.
1997-03-01
Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.
Institute of Scientific and Technical Information of China (English)
Jun-yi LIANG; Jian-long ZHANG; Xi ZHANG; Shi-fei YUAN; Cheng-liang YIN
2013-01-01
To solve the low power density issue of hybrid electric vehicular batteries,a combination of batteries and ultracapacitors(UCs)could be a solution.The high power density feature of UCs can improve the performance of battery/UC hybrid energy storage systems(HESSs).This paper presents a parallel hybrid electric vehicle(HEV)equipped with an internal combustion engine and an HESS.An advanced energy management strategy(EMS),mainly based on fuzzy logic,is proposed to improve the fuel economy of the HEV and the endurance of the HESS.The EMS is capable of determining the ideal distribution of output power among the internal combustion engine,battery,and UC according to the propelling power or regenerative braking power of the vehicle.To validate the effectiveness of the EMS,numerical simulation and experimental validations are carried out.The results indicate that EMS can effectively control the power sources to work within their respective efficient areas.The battery load can be mitigated and prolonged battery life can be expected.The electrical energy consumption in the HESS is reduced by 3.91％compared with that in the battery only system.Fuel consumption of the HEV is reduced by 24.3％ compared with that of the same class conventional vehicles under Economic Commission of Europe driving cycle.
Survey of CPU/GPU Synergetic Parallel Computing%CPU/GPU协同并行计算研究综述
Institute of Scientific and Technical Information of China (English)
卢风顺; 宋君强; 银福康; 张理论
2011-01-01
CPU/GPU异构混合并行系统以其强劲计算能力、高性价比和低能耗等特点成为新型高性能计算平台,但其复杂体系结构为并行计算研究提出了巨大挑战.CPU/GPU协同并行计算属于新兴研究领域,是一个开放的课题.根据所用计算资源的规模将CPU/GPU协同并行计算研究划分为三类,尔后从立项依据、研究内容和研究方法等方面重点介绍了几个混合计算项目,并指出了可进一步研究的方向,以期为领域科学家进行协同并行计算研究提供一定参考.%With the features of tremendous capability, high performance/price ratio and low power, the heterogeneous hybrid CPU/GPU parallel systems have become the new high performance computing platforms. However, the architecture complexity of the hybrid system poses many challenges on the parallel algorithms design on the infrastructure. According to the scale of computational resources involved in the synergetic parallel computing, we classified the recent researches into three categories, detailed the motivations, methodologies and applications of several projects, and discussed some on-going research issues in this direction in the end. We hope the domain experts can gain useful information about synergetic parallel computing from this work.
Design, Dynamics, and Workspace of a Hybrid-Driven-Based Cable Parallel Manipulator
Directory of Open Access Journals (Sweden)
Bin Zi
2013-01-01
Full Text Available The design, dynamics, and workspace of a hybrid-driven-based cable parallel manipulator (HDCPM are presented. The HDCPM is able to perform high efficiency, heavy load, and high-performance motion due to the advantages of both the cable parallel manipulator and the hybrid-driven planar five-bar mechanism. The design is performed according to theories of mechanism structure synthesis for cable parallel manipulators. The dynamic formulation of the HDCPM is established on the basis of Newton-Euler method. The workspace of the manipulator is analyzed additionally. As an example, a completely restrained HDCPM with 3 degrees of freedom is studied in simulation in order to verify the validity of the proposed design, workspace, and dynamic analysis. The simulation results, compared with the theoretical analysis, and the case study previously performed show that the manipulator design is reasonable and the mathematical models are correct, which provides the theoretical basis for future physical prototype and control system design.
Hierarchical Control of Parallel AC-DC Converter Interfaces for Hybrid Microgrids
DEFF Research Database (Denmark)
Lu, Xiaonan; Guerrero, Josep M.; Sun, Kai;
2014-01-01
In this paper, a hierarchical control system for parallel power electronics interfaces between ac bus and dc bus in a hybrid microgrid is presented. Both standalone and grid-connected operation modes in the dc side of the microgrid are analyzed. Concretely, a three-level hierarchical control syst...... the three control levels is developed in order to adjust the main control parameters and study the system stability. Experimental results of a 2×2.2 kW parallel ac-dc converter system have shown satisfactory realization of the designed system.......In this paper, a hierarchical control system for parallel power electronics interfaces between ac bus and dc bus in a hybrid microgrid is presented. Both standalone and grid-connected operation modes in the dc side of the microgrid are analyzed. Concretely, a three-level hierarchical control system...... is implemented. In the primary control level, the decentralized control is realized by using the droop method. Local ac current proportional-resonant controller and dc voltage proportional-integral controller are employed. When the local load is connected to the dc bus, dc droop control is applied to obtain...
A Hybrid Parallel Execution Model for Logic Based Requirement Specifications (Invited Paper
Directory of Open Access Journals (Sweden)
Jeffrey J. P. Tsai
1999-05-01
Full Text Available It is well known that undiscovered errors in a requirements specification is extremely expensive to be fixed when discovered in the software maintenance phase. Errors in the requirement phase can be reduced through the validation and verification of the requirements specification. Many logic-based requirements specification languages have been developed to achieve these goals. However, the execution and reasoning of a logic-based requirements specification can be very slow. An effective way to improve their performance is to execute and reason the logic-based requirements specification in parallel. In this paper, we present a hybrid model to facilitate the parallel execution of a logic-based requirements specification language. A logic-based specification is first applied by a data dependency analysis technique which can find all the mode combinations that exist within a specification clause. This mode information is used to support a novel hybrid parallel execution model, which combines both top-down and bottom-up evaluation strategies. This new execution model can find the failure in the deepest node of the search tree at the early stage of the evaluation, thus this new execution model can reduce the total number of nodes searched in the tree, the total processes needed to be generated, and the total communication channels needed in the search process. A simulator has been implemented to analyze the execution behavior of the new model. Experiments show significant improvement based on several criteria.
Parallel computing strategy for the simulation of particulate flows with immersed boundary method
Institute of Scientific and Technical Information of China (English)
WANG ZeLi; FAN JianRen; LUO Kun
2008-01-01
A parallel computing strategy for the simulation of particulate flows with immersed boundary technique is proposed. This strategy can deal with the coupling between fluid and particle easily when particle crosses the boundaries of sub-domains which are decomposed from original computational domain. And a two-dimen-sional circular particle settling in a closed rectangular domain is simulated with the parallel technique and immersed boundary method to validate the parallel effi-ciency.
Parallel computing strategy for the simulation of particulate flows with immersed boundary method
Institute of Scientific and Technical Information of China (English)
2008-01-01
A parallel computing strategy for the simulation of particulate flows with immersed boundary technique is proposed. This strategy can deal with the coupling between fluid and particle easily when particle crosses the boundaries of sub-domains which are decomposed from original computational domain. And a two- dimen- sional circular particle settling in a closed rectangular domain is simulated with the parallel technique and immersed boundary method to validate the parallel effi- ciency.
Directory of Open Access Journals (Sweden)
Huei Peng
2013-04-01
Full Text Available This paper compares two optimal energy management methods for parallel hybrid electric vehicles using an Automatic Manual Transmission (AMT. A control-oriented model of the powertrain and vehicle dynamics is built first. The energy management is formulated as a typical optimal control problem to trade off the fuel consumption and gear shifting frequency under admissible constraints. The Dynamic Programming (DP and Pontryagin’s Minimum Principle (PMP are applied to obtain the optimal solutions. Tuning with the appropriate co-states, the PMP solution is found to be very close to that from DP. The solution for the gear shifting in PMP has an algebraic expression associated with the vehicular velocity and can be implemented more efficiently in the control algorithm. The computation time of PMP is significantly less than DP.
Energy Technology Data Exchange (ETDEWEB)
Sadjadi, Seyed Jafar [Department of Industrial Engineering, Iran University of Science and Technology, Tehran (Iran, Islamic Republic of)], E-mail: sjsadjadi@iust.ac.ir; Soltani, R. [Department of Industrial Engineering, Iran University of Science and Technology, Tehran (Iran, Islamic Republic of)
2009-11-15
We present a heuristic approach to solve a general framework of serial-parallel redundancy problem where the reliability of the system is maximized subject to some general linear constraints. The complexity of the redundancy problem is generally considered to be NP-Hard and the optimal solution is not normally available. Therefore, to evaluate the performance of the proposed method, a hybrid genetic algorithm is also implemented whose parameters are calibrated via Taguchi's robust design method. Then, various test problems are solved and the computational results indicate that the proposed heuristic approach could provide us some promising reliabilities, which are fairly close to optimal solutions in a reasonable amount of time.
Parallel MOPEX: Computing Mosaics of Large-Area Spitzer Surveys on a Cluster Computer
Directory of Open Access Journals (Sweden)
Joseph C. Jacob
2007-01-01
Full Text Available The Spitzer Science Center's MOPEX software is a part of the Spitzer Space Telescope's operational pipeline that enables detection of cosmic ray collisions with the detector array, masking of the corrupted pixels due to these collisions, subsequent mosaicking of image fields, and extraction of point sources to create catalogs of celestial objects. This paper reports on our experiences in parallelizing the parts of MOPEX related to cosmic ray rejection and mosaicking on a 1,024-processor cluster computer at NASA's Jet Propulsion Laboratory. The architecture and performance of the new Parallel MOPEX software are described. This work was done in order to rapidly mosaic the IRAC shallow survey data, covering a region of the sky observed with one of Spitzer's infrared instruments for the study of galaxy clusters, large-scale structure, and brown dwarfs.
Two applications of parallel processing in power system computation
Energy Technology Data Exchange (ETDEWEB)
Lemaitre, C.; Thomas, B. [Electricite de France, 92 - Clamart (France). Research and Development Div.
1996-12-31
Performance improvements are discussed achieved in two power system software modules through the use of parallel processing techniques. The first software module, EVARISTE, outputs a voltage stability indicator for various power system situations. The second module, MEXICO, assesses power system reliability and operating costs by simulating a large number of contingencies for generation and transmission equipment. Both software modules are well-suited to coarse-grain parallel processing. The first module was parallelized on a distributed-memory machine and the second on a shared-memory machine. A description of the parallelization process used in these two cases is presented, and details on the performance levels achieved are discussed, including aspects of programming, parameter selection, and machine characteristics. (author) 7 refs.
Bansal, Shonak; Singh, Arun Kumar; Gupta, Neena
2016-07-01
In real-life, multi-objective engineering design problems are very tough and time consuming optimization problems due to their high degree of nonlinearities, complexities and inhomogeneity. Nature-inspired based multi-objective optimization algorithms are now becoming popular for solving multi-objective engineering design problems. This paper proposes original multi-objective Bat algorithm (MOBA) and its extended form, namely, novel parallel hybrid multi-objective Bat algorithm (PHMOBA) to generate shortest length Golomb ruler called optimal Golomb ruler (OGR) sequences at a reasonable computation time. The OGRs found their application in optical wavelength division multiplexing (WDM) systems as channel-allocation algorithm to reduce the four-wave mixing (FWM) crosstalk. The performances of both the proposed algorithms to generate OGRs as optical WDM channel-allocation is compared with other existing classical computing and nature-inspired algorithms, including extended quadratic congruence (EQC), search algorithm (SA), genetic algorithms (GAs), biogeography based optimization (BBO) and big bang-big crunch (BB-BC) optimization algorithms. Simulations conclude that the proposed parallel hybrid multi-objective Bat algorithm works efficiently as compared to original multi-objective Bat algorithm and other existing algorithms to generate OGRs for optical WDM systems. The algorithm PHMOBA to generate OGRs, has higher convergence and success rate than original MOBA. The efficiency improvement of proposed PHMOBA to generate OGRs up to 20-marks, in terms of ruler length and total optical channel bandwidth (TBW) is 100 %, whereas for original MOBA is 85 %. Finally the implications for further research are also discussed.
Bansal, Shonak; Singh, Arun Kumar; Gupta, Neena
2017-02-01
In real-life, multi-objective engineering design problems are very tough and time consuming optimization problems due to their high degree of nonlinearities, complexities and inhomogeneity. Nature-inspired based multi-objective optimization algorithms are now becoming popular for solving multi-objective engineering design problems. This paper proposes original multi-objective Bat algorithm (MOBA) and its extended form, namely, novel parallel hybrid multi-objective Bat algorithm (PHMOBA) to generate shortest length Golomb ruler called optimal Golomb ruler (OGR) sequences at a reasonable computation time. The OGRs found their application in optical wavelength division multiplexing (WDM) systems as channel-allocation algorithm to reduce the four-wave mixing (FWM) crosstalk. The performances of both the proposed algorithms to generate OGRs as optical WDM channel-allocation is compared with other existing classical computing and nature-inspired algorithms, including extended quadratic congruence (EQC), search algorithm (SA), genetic algorithms (GAs), biogeography based optimization (BBO) and big bang-big crunch (BB-BC) optimization algorithms. Simulations conclude that the proposed parallel hybrid multi-objective Bat algorithm works efficiently as compared to original multi-objective Bat algorithm and other existing algorithms to generate OGRs for optical WDM systems. The algorithm PHMOBA to generate OGRs, has higher convergence and success rate than original MOBA. The efficiency improvement of proposed PHMOBA to generate OGRs up to 20-marks, in terms of ruler length and total optical channel bandwidth (TBW) is 100 %, whereas for original MOBA is 85 %. Finally the implications for further research are also discussed.
Directory of Open Access Journals (Sweden)
Ramazan Kurt,
2012-06-01
Full Text Available Experimental parallel strand lumbers (PSLs were manufactured from fast growing rotary peeled I-214 (Populus x euramericana and I-77/51 (Populus deltoides hybrid poplar clones veneer strands with melamine urea formaldehyde (MUF adhesive. The results showed that hybrid poplar clones can be used in PSLs manufacturing. Physical and mechanical properties of PSLs were affected by clone types. The I-77/51 clone had better properties and was found to be more suitable for PSLs manufacturing compared to the I-214 clone. PSLs properties were higher than those of solid woods (SWs and laminated veneer lumbers (LVLs of the same poplar clones. This increase may be due to materials, densification as a result of high pressure use, and the manufacturing techniques. The degree of contribution of SWs properties to the PSLs properties was lower than that of LVLs. This indicated that factors other than SWs properties played more important roles in the strength increase of PSLs.
Institute of Scientific and Technical Information of China (English)
HuangMiao-hua; JinGuo-dong
2003-01-01
The Hierarchical Structure Fuzzy Logic Control(HSFLC) strategies of torque distribute for Parallel Hybrid Electric Vehicle (PHEV) in the mocle of operation of the vehicle i. e. , acceleration, cruise, deceleration etc. have been studied. Using secondly developed the hybrid vehicle simulation tool ADVISOR, the dynamic model of PHEV has been set up by MATLAB/SIMULINK. The engine, motor as well as the battery characteristics have been studied. Simulation results show that the proposed hierarchical structured fuzzy logic control strategy is effective over the entire operating range of the vehicle in terms of fuel economy. Based on the analyses of the simulation results and driver's experiences, a fuzzy controller is designed and developed to control the torque distribution. The controller is evaluated via hardware-in-the-loop simulator (HILS). The results show that controller verify its value.
Institute of Scientific and Technical Information of China (English)
WANG Bing; SHU Jiwu; ZHENG Weimin; WANG Jinzhao; CHEN Min
2005-01-01
A hybrid decomposition method for molecular dynamics simulations was presented, using simultaneously spatial decomposition and force decomposition to fit the architecture of a cluster of symmetric multi-processor (SMP) nodes. The method distributes particles between nodes based on the spatial decomposition strategy to reduce inter-node communication costs. The method also partitions particle pairs within each node using the force decomposition strategy to improve the load balance for each node. Simulation results for a nucleation process with 4 000 000 particles show that the hybrid method achieves better parallel performance than either spatial or force decomposition alone, especially when applied to a large scale particle system with non-uniform spatial density.
Parameters Design for a Parallel Hybrid Electric Bus Using Regenerative Brake Model
Directory of Open Access Journals (Sweden)
Zilin Ma
2014-01-01
Full Text Available A design methodology which uses the regenerative brake model is introduced to determine the major system parameters of a parallel electric hybrid bus drive train. Hybrid system parameters mainly include the power rating of internal combustion engine (ICE, gear ratios of transmission, power rating, and maximal torque of motor, power, and capacity of battery. The regenerative model is built in the vehicle model to estimate the regenerative energy in the real road conditions. The design target is to ensure that the vehicle meets the specified vehicle performance, such as speed and acceleration, and at the same time, operates the ICE within an expected speed range. Several pairs of parameters are selected from the result analysis, and the fuel saving result in the road test shows that a 25% reduction is achieved in fuel consumption.
Numerical Simulation of Multi-phase Flow in Porous Media on Parallel Computers
Liu, Hui; Chen, Zhangxin; Luo, Jia; Deng, Hui; He, Yanfeng
2016-01-01
This paper is concerned with developing parallel computational methods for two-phase flow on distributed parallel computers; techniques for linear solvers and nonlinear methods are studied, and the standard and inexact Newton methods are investigated. A multi-stage preconditioner for two-phase flow is proposed and advanced matrix processing strategies are implemented. Numerical experiments show that these computational methods are scalable and efficient, and are capable of simulating large-scale problems with tens of millions of grid blocks using thousands of CPU cores on parallel computers. The nonlinear techniques, preconditioner and matrix processing strategies can also be applied to three-phase black oil, compositional and thermal models.
DEFF Research Database (Denmark)
Ghzaiel, Walid; Jebali-Ben Ghorbal, Manel; Slama-Belkhodja, Ilhem
2014-01-01
This paper presents a hybrid islanding detection algorithm integrated on the distributed generation unit more close to the point of common coupling of a Microgrid based on parallel inverters where one of them is responsible to control the system. The method is based on resonance excitation under...... parameters, both resistive and inductive parts, from the injected resonance frequency determination. Finally, the inverter will disconnect the microgrid from the faulty grid and reconnect the parallel inverter system to the controllable distributed system in order to ensure high power quality. This paper...... shows that grid impedance variation detection estimation can be an efficient method for islanding detection in microgrid systems. Theoretical analysis and simulation results are presented to validate the proposed method....
High Performance Input/Output for Parallel Computer Systems
Ligon, W. B.
1996-01-01
The goal of our project is to study the I/O characteristics of parallel applications used in Earth Science data processing systems such as Regional Data Centers (RDCs) or EOSDIS. Our approach is to study the runtime behavior of typical programs and the effect of key parameters of the I/O subsystem both under simulation and with direct experimentation on parallel systems. Our three year activity has focused on two items: developing a test bed that facilitates experimentation with parallel I/O, and studying representative programs from the Earth science data processing application domain. The Parallel Virtual File System (PVFS) has been developed for use on a number of platforms including the Tiger Parallel Architecture Workbench (TPAW) simulator, The Intel Paragon, a cluster of DEC Alpha workstations, and the Beowulf system (at CESDIS). PVFS provides considerable flexibility in configuring I/O in a UNIX- like environment. Access to key performance parameters facilitates experimentation. We have studied several key applications fiom levels 1,2 and 3 of the typical RDC processing scenario including instrument calibration and navigation, image classification, and numerical modeling codes. We have also considered large-scale scientific database codes used to organize image data.
Generalised Computability and Applications to Hybrid Systems
DEFF Research Database (Denmark)
Korovina, Margarita V.; Kudinov, Oleg V.
2001-01-01
We investigate the concept of generalised computability of operators and functionals defined on the set of continuous functions, firstly introduced in [9]. By working in the reals, with equality and without equality, we study properties of generalised computable operators and functionals. Also we...
Chaining direct memory access data transfer operations for compute nodes in a parallel computer
Archer, Charles J.; Blocksome, Michael A.
2010-09-28
Methods, systems, and products are disclosed for chaining DMA data transfer operations for compute nodes in a parallel computer that include: receiving, by an origin DMA engine on an origin node in an origin injection FIFO buffer for the origin DMA engine, a RGET data descriptor specifying a DMA transfer operation data descriptor on the origin node and a second RGET data descriptor on the origin node, the second RGET data descriptor specifying a target RGET data descriptor on the target node, the target RGET data descriptor specifying an additional DMA transfer operation data descriptor on the origin node; creating, by the origin DMA engine, an RGET packet in dependence upon the RGET data descriptor, the RGET packet containing the DMA transfer operation data descriptor and the second RGET data descriptor; and transferring, by the origin DMA engine to a target DMA engine on the target node, the RGET packet.
Narayanan, Kiran
2012-07-17
A hybrid parallelization method composed of a coarse-grained genetic algorithm (GA) and fine-grained objective function evaluations is implemented on a heterogeneous computational resource consisting of 16 IBM Blue Gene/P racks, a single x86 cluster node and a high-performance file system. The GA iterator is coupled with a finite-element (FE) analysis code developed in house to facilitate computational steering in order to calculate the optimal impact velocities of a projectile colliding with a polyurea/structural steel composite plate. The FE code is capable of capturing adiabatic shear bands and strain localization, which are typically observed in high-velocity impact applications, and it includes several constitutive models of plasticity, viscoelasticity and viscoplasticity for metals and soft materials, which allow simulation of ductile fracture by void growth. A strong scaling study of the FE code was conducted to determine the optimum number of processes run in parallel. The relative efficiency of the hybrid, multi-level parallelization method is studied in order to determine the parameters for the parallelization. Optimal impact velocities of the projectile calculated using the proposed approach, are reported. © The Author(s) 2012.
Automated Parallel Computing Tools for Multicore Machines and Clusters Project
National Aeronautics and Space Administration — We propose to improve productivity of high performance computing for applications on multicore computers and clusters. These machines built from one or more chips...
Locating hardware faults in a data communications network of a parallel computer
Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.
2010-01-12
Hardware faults location in a data communications network of a parallel computer. Such a parallel computer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.
Comparative Simulation Study of Production Scheduling in the Hybrid and the Parallel Flow
Directory of Open Access Journals (Sweden)
Varela Maria L.R.
2017-06-01
Full Text Available Scheduling is one of the most important decisions in production control. An approach is proposed for supporting users to solve scheduling problems, by choosing the combination of physical manufacturing system configuration and the material handling system settings. The approach considers two alternative manufacturing scheduling configurations in a two stage product oriented manufacturing system, exploring the hybrid flow shop (HFS and the parallel flow shop (PFS environments. For illustrating the application of the proposed approach an industrial case from the automotive components industry is studied. The main aim of this research to compare results of study of production scheduling in the hybrid and the parallel flow, taking into account the makespan minimization criterion. Thus the HFS and the PFS performance is compared and analyzed, mainly in terms of the makespan, as the transportation times vary. The study shows that the performance HFS is clearly better when the work stations’ processing times are unbalanced, either in nature or as a consequence of the addition of transport times just to one of the work station processing time but loses advantage, becoming worse than the performance of the PFS configuration when the work stations’ processing times are balanced, either in nature or as a consequence of the addition of transport times added on the work stations’ processing times. This means that physical layout configurations along with the way transport time are including the work stations’ processing times should be carefully taken into consideration due to its influence on the performance reached by both HFS and PFS configurations.
Parallel computing for simultaneous iterative tomographic imaging by graphics processing units
Bello-Maldonado, Pedro D.; López, Ricardo; Rogers, Colleen; Jin, Yuanwei; Lu, Enyue
2016-05-01
In this paper, we address the problem of accelerating inversion algorithms for nonlinear acoustic tomographic imaging by parallel computing on graphics processing units (GPUs). Nonlinear inversion algorithms for tomographic imaging often rely on iterative algorithms for solving an inverse problem, thus computationally intensive. We study the simultaneous iterative reconstruction technique (SIRT) for the multiple-input-multiple-output (MIMO) tomography algorithm which enables parallel computations of the grid points as well as the parallel execution of multiple source excitation. Using graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA) programming model an overall improvement of 26.33x was achieved when combining both approaches compared with sequential algorithms. Furthermore we propose an adaptive iterative relaxation factor and the use of non-uniform weights to improve the overall convergence of the algorithm. Using these techniques, fast computations can be performed in parallel without the loss of image quality during the reconstruction process.
Weeks, Cindy Lou
1986-01-01
Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.
Robust Fuzzy PD Method with Parallel Computed Fuel Ratio Estimation Applied to Automotive Engine
Directory of Open Access Journals (Sweden)
Farzin Piltan
2013-07-01
Full Text Available Both fuzzy logic and computed fuel ratio can compensate the steady-state error of proportional-derivative (PD method. This paper presents parallel computed fuel ratio compensation for fuzzy plus PID control management with application to internal combustion (IC engine. The asymptotic stability of fuzzy plus PID control methodology with first-order computed fuel ratio estimation in the parallel structure is proven. For the parallel structure, the finite time convergence with a super-twisting second-order sliding-mode is guaranteed.
Enhancing e-Infrastructures with Advanced Technical Computing Parallel MATLAB® on the Grid
Chakravarti, A; Laure, E; Jouvin, M; Philippon, G; Loomis, C; Floros, E
2008-01-01
MATLAB® is widely used within the engineering and scientific fields as the language and environment for technical computing, while collaborative Grid computing on e-Infrastructures is used by scientific communities to deliver a faster time to solution. MATLAB allows users to express parallelism in their applications, and then execute code on multiprocessor environments such as large-scale e-Infrastructures. This paper demonstrates the integration of MATLAB and Grid technology with a representative implementation that uses gLite middleware to run parallel programs. Experimental results highlight the increases in productivity and performance that users obtain with MATLAB parallel computing on Grids.
Energy Technology Data Exchange (ETDEWEB)
Patil, Chinmaya; Naghshtabrizi, Payam; Verma, Rajeev; Tang, Zhijun; Smith, Kandler; Shi, Ying
2016-08-01
This paper presents a control strategy to maximize fuel economy of a parallel hybrid electric vehicle over a target life of the battery. Many approaches to maximizing fuel economy of parallel hybrid electric vehicle do not consider the effect of control strategy on the life of the battery. This leads to an oversized and underutilized battery. There is a trade-off between how aggressively to use and 'consume' the battery versus to use the engine and consume fuel. The proposed approach addresses this trade-off by exploiting the differences in the fast dynamics of vehicle power management and slow dynamics of battery aging. The control strategy is separated into two parts, (1) Predictive Battery Management (PBM), and (2) Predictive Power Management (PPM). PBM is the higher level control with slow update rate, e.g. once per month, responsible for generating optimal set points for PPM. The considered set points in this paper are the battery power limits and State Of Charge (SOC). The problem of finding the optimal set points over the target battery life that minimize engine fuel consumption is solved using dynamic programming. PPM is the lower level control with high update rate, e.g. a second, responsible for generating the optimal HEV energy management controls and is implemented using model predictive control approach. The PPM objective is to find the engine and battery power commands to achieve the best fuel economy given the battery power and SOC constraints imposed by PBM. Simulation results with a medium duty commercial hybrid electric vehicle and the proposed two-level hierarchical control strategy show that the HEV fuel economy is maximized while meeting a specified target battery life. On the other hand, the optimal unconstrained control strategy achieves marginally higher fuel economy, but fails to meet the target battery life.
Compact and Wideband Parallel-Strip 180° Hybrid Coupler with Arbitrary Power Division Ratios
Directory of Open Access Journals (Sweden)
Leung Chiu
2013-01-01
Full Text Available This paper presents a class of wideband 180° hybrid (rat race couplers implemented by parallel-strip line. By replacing the 270° arm of a conventional 180° hybrid coupler by a 90° arm with phase inverter, the bandwidth of the coupler is greatly enhanced and the total circuit size is reduced by almost half. Simple design formulas relating the characteristic impedance of the arms and power division ration are derived. To demonstrate the concept, four couplers with different power division ratios of 1, 2, 4, and 8 were designed, fabricated, and tested. S-parameters of the coupler are simulated and measured with good agreement. All working prototypes operate more than 112% impedance bandwidth with more than 25 dB port-to-port isolation and less than 5° absolute phase imbalance. The proposed 180° hybrid couplers can be employed as a wideband in-phase/differential power divider/combiner, which are essential for many RF and microwave subsystem designs.
Eroglu, Duygu Yilmaz; Ozmutlu, H Cenk
2014-01-01
We developed mixed integer programming (MIP) models and hybrid genetic-local search algorithms for the scheduling problem of unrelated parallel machines with job sequence and machine-dependent setup times and with job splitting property. The first contribution of this paper is to introduce novel algorithms which make splitting and scheduling simultaneously with variable number of subjobs. We proposed simple chromosome structure which is constituted by random key numbers in hybrid genetic-local search algorithm (GAspLA). Random key numbers are used frequently in genetic algorithms, but it creates additional difficulty when hybrid factors in local search are implemented. We developed algorithms that satisfy the adaptation of results of local search into the genetic algorithms with minimum relocation operation of genes' random key numbers. This is the second contribution of the paper. The third contribution of this paper is three developed new MIP models which are making splitting and scheduling simultaneously. The fourth contribution of this paper is implementation of the GAspLAMIP. This implementation let us verify the optimality of GAspLA for the studied combinations. The proposed methods are tested on a set of problems taken from the literature and the results validate the effectiveness of the proposed algorithms.
Marek, A; Blum, V; Johanni, R; Havu, V; Lang, B; Auckenthaler, T; Heinecke, A; Bungartz, H-J; Lederer, H
2014-05-28
Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N(3)) with the size of the investigated problem, N (e.g. the electron count in electronic structure theory), and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on a few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the Scalable Linear Algebra PACKage (ScaLAPACK) library but replacing all actual parallel solution steps with subroutines of its own. For these steps, ELPA significantly outperforms the corresponding ScaLAPACK routines and proprietary libraries that implement the ScaLAPACK interface (e.g. Intel's MKL). The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10,000 CPU cores for problem
MEDUSA - An overset grid flow solver for network-based parallel computer systems
Smith, Merritt H.; Pallis, Jani M.
1993-01-01
Continuing improvement in processing speed has made it feasible to solve the Reynolds-Averaged Navier-Stokes equations for simple three-dimensional flows on advanced workstations. Combining multiple workstations into a network-based heterogeneous parallel computer allows the application of programming principles learned on MIMD (Multiple Instruction Multiple Data) distributed memory parallel computers to the solution of larger problems. An overset-grid flow solution code has been developed which uses a cluster of workstations as a network-based parallel computer. Inter-process communication is provided by the Parallel Virtual Machine (PVM) software. Solution speed equivalent to one-third of a Cray-YMP processor has been achieved from a cluster of nine commonly used engineering workstation processors. Load imbalance and communication overhead are the principal impediments to parallel efficiency in this application.
A parallel finite-difference method for computational aerodynamics
Swisshelm, Julie M.
1989-01-01
A finite-difference scheme for solving complex three-dimensional aerodynamic flow on parallel-processing supercomputers is presented. The method consists of a basic flow solver with multigrid convergence acceleration, embedded grid refinements, and a zonal equation scheme. Multitasking and vectorization have been incorporated into the algorithm. Results obtained include multiprocessed flow simulations from the Cray X-MP and Cray-2. Speedups as high as 3.3 for the two-dimensional case and 3.5 for segments of the three-dimensional case have been achieved on the Cray-2. The entire solver attained a factor of 2.7 improvement over its unitasked version on the Cray-2. The performance of the parallel algorithm on each machine is analyzed.
Efficient Multidimensional Data Redistribution for Resizable Parallel Computations
Sudarsan, Rajesh
2007-01-01
Traditional parallel schedulers running on cluster supercomputers support only static scheduling, where the number of processors allocated to an application remains fixed throughout the execution of the job. This results in under-utilization of idle system resources thereby decreasing overall system throughput. In our research, we have developed a prototype framework called ReSHAPE, which supports dynamic resizing of parallel MPI applications executing on distributed memory platforms. The resizing library in ReSHAPE includes support for releasing and acquiring processors and efficiently redistributing application state to a new set of processors. In this paper, we derive an algorithm for redistributing two-dimensional block-cyclic arrays from $P$ to $Q$ processors, organized as 2-D processor grids. The algorithm ensures a contention-free communication schedule for data redistribution if $P_r \\leq Q_r$ and $P_c \\leq Q_c$. In other cases, the algorithm implements circular row and column shifts on the communicat...
Cost Optimization Using Hybrid Evolutionary Algorithm in Cloud Computing
Directory of Open Access Journals (Sweden)
B. Kavitha
2015-07-01
Full Text Available The main aim of this research is to design the hybrid evolutionary algorithm for minimizing multiple problems of dynamic resource allocation in cloud computing. The resource allocation is one of the big problems in the distributed systems when the client wants to decrease the cost for the resource allocation for their task. In order to assign the resource for the task, the client must consider the monetary cost and computational cost. Allocation of resources by considering those two costs is difficult. To solve this problem in this study, we make the main task of client into many subtasks and we allocate resources for each subtask instead of selecting the single resource for the main task. The allocation of resources for the each subtask is completed through our proposed hybrid optimization algorithm. Here, we hybrid the Binary Particle Swarm Optimization (BPSO and Binary Cuckoo Search algorithm (BCSO by considering monetary cost and computational cost which helps to minimize the cost of the client. Finally, the experimentation is carried out and our proposed hybrid algorithm is compared with BPSO and BCSO algorithms. Also we proved the efficiency of our proposed hybrid optimization algorithm.
Design of a Parallel Robotic Manipulator using Evolutionary Computing
António M. Lopes; Solteiro Pires, E. J.; Manuel R. Barbosa
2012-01-01
In this paper the kinematic design of a 6-dof parallel robotic manipulator is analysed. Firstly, the condition number of the inverse kinematic jacobian is considered as the objective function, measuring the manipulator's dexterity and a genetic algorithm is used to solve the optimization problem. In a second approach, a neural network model of the analytical objective function is developed and subsequently used as the objective function in the genetic algorithm optimization search process. It...
A parallel code for multiprecision computations of the Lane-Emden differential equation
Geroyannis, Vassilis S
2016-01-01
We compute multiprecision solutions of the Lane-Emden equation. This differential equation arises when introducing the well-known polytropic model into the equation of hydrostatic equilibrium for a nondistorted star. Since such multiprecision computations are time-consuming, we apply to this problem parallel programming techniques and thus the execution time of the computations is drastically reduced.
Applications of Parallel Computational Methods to Charged-Particle Beam Dynamics
Energy Technology Data Exchange (ETDEWEB)
Kabel, A.; Cai, Y.; /SLAC; Dohlus, M.; /DESY; Sen, T.; /Fermilab; Uplenchwar, R.; /SLAC /DESY
2007-10-16
The availability of parallel computation hardware and the advent of standardized programming interfaces has made a new class of beam dynamics problems accessible to numerical simulations. We describe recent progress in code development for simulations of coherent synchrotron radiation and the weak-strong and strong-strong beam-beam interaction. Parallelization schemes will be discussed, and typical results will be presented.
Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)
1993-01-01
This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.
Scalable Parallelization of Skyline Computation for Multi-core Processors
DEFF Research Database (Denmark)
Chester, Sean; Sidlauskas, Darius; Assent, Ira
2015-01-01
The skyline is an important query operator for multi-criteria decision making. It reduces a dataset to only those points that offer optimal trade-offs of dimensions. In general, it is very expensive to compute. Recently, multi-core CPU algorithms have been proposed to accelerate the computation o...
Paging memory from random access memory to backing storage in a parallel computer
Archer, Charles J; Blocksome, Michael A; Inglett, Todd A; Ratterman, Joseph D; Smith, Brian E
2013-05-21
Paging memory from random access memory (`RAM`) to backing storage in a parallel computer that includes a plurality of compute nodes, including: executing a data processing application on a virtual machine operating system in a virtual machine on a first compute node; providing, by a second compute node, backing storage for the contents of RAM on the first compute node; and swapping, by the virtual machine operating system in the virtual machine on the first compute node, a page of memory from RAM on the first compute node to the backing storage on the second compute node.
Computer simulation program for parallel SITAN. [Sandia Inertia Terrain-Aided Navigation, in FORTRAN
Energy Technology Data Exchange (ETDEWEB)
Andreas, R.D.; Sheives, T.C.
1980-11-01
This computer program simulates the operation of parallel SITAN using digitized terrain data. An actual trajectory is modeled including the effects of inertial navigation errors and radar altimeter measurements.
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].
Furuta, Takuya; Sato, Tatsuhiko
2015-01-01
Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Peters, Amanda A [Rochester, MN; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN
2012-01-10
Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.
Line-plane broadcasting in a data communications network of a parallel computer
Archer, Charles J.; Berg, Jeremy E.; Blocksome, Michael A.; Smith, Brian E.
2010-11-23
Methods, apparatus, and products are disclosed for line-plane broadcasting in a data communications network of a parallel computer, the parallel computer comprising a plurality of compute nodes connected together through the network, the network optimized for point to point data communications and characterized by at least a first dimension, a second dimension, and a third dimension, that include: initiating, by a broadcasting compute node, a broadcast operation, including sending a message to all of the compute nodes along an axis of the first dimension for the network; sending, by each compute node along the axis of the first dimension, the message to all of the compute nodes along an axis of the second dimension for the network; and sending, by each compute node along the axis of the second dimension, the message to all of the compute nodes along an axis of the third dimension for the network.
Line-plane broadcasting in a data communications network of a parallel computer
Archer, Charles J.; Berg, Jeremy E.; Blocksome, Michael A.; Smith, Brian E.
2010-06-08
Methods, apparatus, and products are disclosed for line-plane broadcasting in a data communications network of a parallel computer, the parallel computer comprising a plurality of compute nodes connected together through the network, the network optimized for point to point data communications and characterized by at least a first dimension, a second dimension, and a third dimension, that include: initiating, by a broadcasting compute node, a broadcast operation, including sending a message to all of the compute nodes along an axis of the first dimension for the network; sending, by each compute node along the axis of the first dimension, the message to all of the compute nodes along an axis of the second dimension for the network; and sending, by each compute node along the axis of the second dimension, the message to all of the compute nodes along an axis of the third dimension for the network.
Load flow computations in hybrid transmission - distributed power systems
Wobbes, E.D.; Lahaye, D.J.P.
2013-01-01
We interconnect transmission and distribution power systems and perform load flow computations in the hybrid network. In the largest example we managed to build, fifty copies of a distribution network consisting of fifteen nodes is connected to the UCTE study model, resulting in a system consisting
THE IMPROVEMENT OF THE COMPUTATIONAL PERFORMANCE OF THE ZONAL MODEL POMA USING PARALLEL TECHNIQUES
Directory of Open Access Journals (Sweden)
Yao Yu
2014-01-01
Full Text Available The zonal modeling approach is a new simplified computational method used to predict temperature distribution, energy in multi-zone building and indoor airflow thermal behaviors of building. Although this approach is known to use less computer resource than CFD models, the computational time is still an issue especially when buildings are characterized by complicated geometry and indoor layout of furnishings. Therefore, using a new computing technique to the current zonal models in order to reduce the computational time is a promising way to further improve the model performance and promote the wide application of zonal models. Parallel computing techniques provide a way to accomplish these purposes. Unlike the serial computations that are commonly used in the current zonal models, these parallel techniques decompose the serial program into several discrete instructions which can be executed simultaneously on different processors/threads. As a result, the computational time of the parallelized program can be significantly reduced, compared to that of the traditional serial program. In this article, a parallel computing technique, Open Multi-Processing (OpenMP, is used into the zonal model, Pressurized zOnal Model with the Air diffuser (POMA, in order to improve the model computational performance, including the reduction of computational time and the investigation of the model scalability.
Self-pacing direct memory access data transfer operations for compute nodes in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Blocksome, Michael A
2015-02-17
Methods, apparatus, and products are disclosed for self-pacing DMA data transfer operations for nodes in a parallel computer that include: transferring, by an origin DMA on an origin node, a RTS message to a target node, the RTS message specifying an message on the origin node for transfer to the target node; receiving, in an origin injection FIFO for the origin DMA from a target DMA on the target node in response to transferring the RTS message, a target RGET descriptor followed by a DMA transfer operation descriptor, the DMA descriptor for transmitting a message portion to the target node, the target RGET descriptor specifying an origin RGET descriptor on the origin node that specifies an additional DMA descriptor for transmitting an additional message portion to the target node; processing, by the origin DMA, the target RGET descriptor; and processing, by the origin DMA, the DMA transfer operation descriptor.
Low latency, high bandwidth data communications between compute nodes in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Blocksome, Michael A
2014-04-01
Methods, systems, and products are disclosed for data transfers between nodes in a parallel computer that include: receiving, by an origin DMA on an origin node, a buffer identifier for a buffer containing data for transfer to a target node; sending, by the origin DMA to the target node, a RTS message; transferring, by the origin DMA, a data portion to the target node using a memory FIFO operation that specifies one end of the buffer from which to begin transferring the data; receiving, by the origin DMA, an acknowledgement of the RTS message from the target node; and transferring, by the origin DMA in response to receiving the acknowledgement, any remaining data portion to the target node using a direct put operation that specifies the other end of the buffer from which to begin transferring the data, including initiating the direct put operation without invoking an origin processing core.
Low latency, high bandwidth data communications between compute nodes in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Blocksome, Michael A
2014-04-22
Methods, systems, and products are disclosed for data transfers between nodes in a parallel computer that include: receiving, by an origin DMA on an origin node, a buffer identifier for a buffer containing data for transfer to a target node; sending, by the origin DMA to the target node, a RTS message; transferring, by the origin DMA, a data portion to the target node using a memory FIFO operation that specifies one end of the buffer from which to begin transferring the data; receiving, by the origin DMA, an acknowledgement of the RTS message from the target node; and transferring, by the origin DMA in response to receiving the acknowledgement, any remaining data portion to the target node using a direct put operation that specifies the other end of the buffer from which to begin transferring the data, including initiating the direct put operation without invoking an origin processing core.
Efficient Data-parallel Computations on Distributed Systems
Institute of Scientific and Technical Information of China (English)
无
2002-01-01
Task scheduling determines the performance of NOW computing to a large extent.However,the computer system architecture, computing capability and sys tem load are rarely proposed together.In this paper,a biggest-heterogeneous scheduling algorithm is presented.It fully considers the system characterist ics (from application view), structure and state.So it always can utilize all processing resource under a reasonable premise.The results of experiment show the algorithm can significantly shorten the response time of jobs.
New multi-DSP parallel computing architecture for real-time image processing
Institute of Scientific and Technical Information of China (English)
Hu Junhong; Zhang Tianxu; Jiang Haoyang
2006-01-01
The flexibility of traditional image processing system is limited because those system are designed for specific applications. In this paper, a new TMS320C64x-based multi-DSP parallel computing architecture is presented. It has many promising characteristics such as powerful computing capability, broad I/O bandwidth, topology flexibility, and expansibility. The parallel system performance is evaluated by practical experiment.
From devil to angel, transmission lines boost parallel computing of linear resistor networks
Wei, Fei
2009-01-01
Transmission line is always big trouble for integrated circuits designers; however, it could be of great help to the parallel computing of extremely large linear resistor networks. In this paper, we introduce the virtual transmission method (VTM), which brings virtual transmission lines into linear resistor networks to achieve distributed and asynchronous parallel computing in the virtual time domain. Numerical experiments show that VTM could be efficiently running on the 2D or 3D microprocessor with arbitrary number of cores.
Automatic Choice of Scheduling Heuristics for Parallel/Distributed Computing
Directory of Open Access Journals (Sweden)
Clayton S. Ferner
1999-01-01
Full Text Available Task mapping and scheduling are two very difficult problems that must be addressed when a sequential program is transformed into a parallel program. Since these problems are NP‐hard, compiler writers have opted to concentrate their efforts on optimizations that produce immediate gains in performance. As a result, current parallelizing compilers either use very simple methods to deal with task scheduling or they simply ignore it altogether. Unfortunately, the programmer does not have this luxury. The burden of repartitioning or rescheduling, should the compiler produce inefficient parallel code, lies entirely with the programmer. We were able to create an algorithm (called a metaheuristic, which automatically chooses a scheduling heuristic for each input program. The metaheuristic produces better schedules in general than the heuristics upon which it is based. This technique was tested on a suite of real scientific programs written in SISAL and simulated on four different network configurations. Averaged over all of the test cases, the metaheuristic out‐performed all eight underlying scheduling algorithms; beating the best one by 2%, 12%, 13%, and 3% on the four separate network configurations. It is able to do this, not always by picking the best heuristic, but rather by avoiding the heuristics when they would produce very poor schedules. For example, while the metaheuristic only picked the best algorithm about 50% of the time for the 100 Gbps Ethernet, its worst decision was only 49% away from optimal. In contrast, the best of the eight scheduling algorithms was optimal 30% of the time, but its worst decision was 844% away from optimal.
Non-parallel processing: Gendered attrition in academic computer science
Cohoon, Joanne Louise Mcgrath
2000-10-01
This dissertation addresses the issue of disproportionate female attrition from computer science as an instance of gender segregation in higher education. By adopting a theoretical framework from organizational sociology, it demonstrates that the characteristics and processes of computer science departments strongly influence female retention. The empirical data identifies conditions under which women are retained in the computer science major at comparable rates to men. The research for this dissertation began with interviews of students, faculty, and chairpersons from five computer science departments. These exploratory interviews led to a survey of faculty and chairpersons at computer science and biology departments in Virginia. The data from these surveys are used in comparisons of the computer science and biology disciplines, and for statistical analyses that identify which departmental characteristics promote equal attrition for male and female undergraduates in computer science. This three-pronged methodological approach of interviews, discipline comparisons, and statistical analyses shows that departmental variation in gendered attrition rates can be explained largely by access to opportunity, relative numbers, and other characteristics of the learning environment. Using these concepts, this research identifies nine factors that affect the differential attrition of women from CS departments. These factors are: (1) The gender composition of enrolled students and faculty; (2) Faculty turnover; (3) Institutional support for the department; (4) Preferential attitudes toward female students; (5) Mentoring and supervising by faculty; (6) The local job market, starting salaries, and competitiveness of graduates; (7) Emphasis on teaching; and (8) Joint efforts for student success. This work contributes to our understanding of the gender segregation process in higher education. In addition, it contributes information that can lead to effective solutions for an
Implementation of a 3D mixing layer code on parallel computers
Energy Technology Data Exchange (ETDEWEB)
Roe, K.; Thakur, R.; Dang, T.; Bogucz, E. [Syracuse Univ., NY (United States)
1995-09-01
This paper summarizes our progress and experience in the development of a Computational-Fluid-Dynamics code on parallel computers to simulate three-dimensional spatially-developing mixing layers. In this initial study, the three-dimensional time-dependent Euler equations are solved using a finite-volume explicit time-marching algorithm. The code was first programmed in Fortran 77 for sequential computers. The code was then converted for use on parallel computers using the conventional message-passing technique, while we have not been able to compile the code with the present version of HPF compilers.
Parallel Computational Intelligence-Based Multi-Camera Surveillance System
Directory of Open Access Journals (Sweden)
Sergio Orts-Escolano
2014-04-01
Full Text Available In this work, we present a multi-camera surveillance system based on the use of self-organizing neural networks to represent events on video. The system processes several tasks in parallel using GPUs (graphic processor units. It addresses multiple vision tasks at various levels, such as segmentation, representation or characterization, analysis and monitoring of the movement. These features allow the construction of a robust representation of the environment and interpret the behavior of mobile agents in the scene. It is also necessary to integrate the vision module into a global system that operates in a complex environment by receiving images from multiple acquisition devices at video frequency. Offering relevant information to higher level systems, monitoring and making decisions in real time, it must accomplish a set of requirements, such as: time constraints, high availability, robustness, high processing speed and re-configurability. We have built a system able to represent and analyze the motion in video acquired by a multi-camera network and to process multi-source data in parallel on a multi-GPU architecture.
Operational mesoscale atmospheric dispersion prediction using a parallel computing cluster
Indian Academy of Sciences (India)
C V Srinivas; R Venkatesan; N V Muralidharan; Someshwar Das; Hari Dass; P Eswara Kumar
2006-06-01
An operational atmospheric dispersion prediction system is implemented on a cluster supercomputer for Online Emergency Response at the Kalpakkam nuclear site.This numerical system constitutes a parallel version of a nested grid meso-scale meteorological model MM5 coupled to a random walk particle dispersion model FLEXPART.The system provides 48-hour forecast of the local weather and radioactive plume dispersion due to hypothetical airborne releases in a range of 100 km around the site.The parallel code was implemented on different cluster con ﬁgurations like distributed and shared memory systems.A 16-node dual Xeon distributed memory gigabit ethernet cluster has been found sufficient for operational applications.The runtime of a triple nested domain MM5 is about 4 h for a 24 h forecast.The system had been operated continuously for a few months and results were ported on the IMSc home page. Initial and periodic boundary condition data for MM5 are provided by NCMRWF,New Delhi. An alternative source is found to be NCEP,USA.These two sources provide the input data to the operational models at different spatial and temporal resolutions using different assimilation methods.A comparative study on the results of forecast is presented using these two data sources for present operational use.Improvement is noticed in rainfall forecasts that used NCEP data, probably because of its high spatial and temporal resolution.
A survey of checkpointing algorithms for parallel and distributed computers
Indian Academy of Sciences (India)
S Kalaiselvi; V Rajaraman
2000-10-01
Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. This paper surveysthe algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for sharedmemory systems primarily extend cache coherence protocolstomaintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed.It is howeverfelt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.
Multiscale Methods, Parallel Computation, and Neural Networks for Real-Time Computer Vision.
Battiti, Roberto
1990-01-01
This thesis presents new algorithms for low and intermediate level computer vision. The guiding ideas in the presented approach are those of hierarchical and adaptive processing, concurrent computation, and supervised learning. Processing of the visual data at different resolutions is used not only to reduce the amount of computation necessary to reach the fixed point, but also to produce a more accurate estimation of the desired parameters. The presented adaptive multiple scale technique is applied to the problem of motion field estimation. Different parts of the image are analyzed at a resolution that is chosen in order to minimize the error in the coefficients of the differential equations to be solved. Tests with video-acquired images show that velocity estimation is more accurate over a wide range of motion with respect to the homogeneous scheme. In some cases introduction of explicit discontinuities coupled to the continuous variables can be used to avoid propagation of visual information from areas corresponding to objects with different physical and/or kinematic properties. The human visual system uses concurrent computation in order to process the vast amount of visual data in "real -time." Although with different technological constraints, parallel computation can be used efficiently for computer vision. All the presented algorithms have been implemented on medium grain distributed memory multicomputers with a speed-up approximately proportional to the number of processors used. A simple two-dimensional domain decomposition assigns regions of the multiresolution pyramid to the different processors. The inter-processor communication needed during the solution process is proportional to the linear dimension of the assigned domain, so that efficiency is close to 100% if a large region is assigned to each processor. Finally, learning algorithms are shown to be a viable technique to engineer computer vision systems for different applications starting from
Managing internode data communications for an uninitialized process in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E
2014-05-20
A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.
A control strategy for parallel hybrid electric vehicles based on extremum seeking
Dinçmen, Erkin; Aksun Güvenç, Bilin
2012-02-01
An energy management control strategy for a parallel hybrid electric vehicle based on the extremum-seeking method for splitting torque between the internal combustion engine and electric motor is proposed in this paper. The control strategy has two levels of operation: the upper and lower levels. The upper level decision-making controller chooses the vehicle operation mode such as the simultaneous use of the internal combustion engine and electric motor, use of only the electric motor, use of only the internal combustion engine, or regenerative braking. In the simultaneous use of the internal combustion engine and electric motor, the optimum energy distribution between these two sources of energy is determined via the extremum-seeking algorithm that searches for maximum drivetrain efficiency. A dynamic programming solution is also obtained and used to form a benchmark for performance evaluation of the proposed method based on extremum seeking. Detailed simulations using a realistic model are presented to illustrate the effectiveness of the methodology.
Institute of Scientific and Technical Information of China (English)
GU Yanchun; YIN Chengliang; ZHANG Jianwu
2007-01-01
In parallel hybrid electrical vehicle (PHEV) equipped with automatic mechanical transmission (AMT), the driving smoothness and the clutch abrasion are the primary considerations for powertrain control during gearshift and clutch operation. To improve these performance indexes of PHEV, a coordinated control system is proposed through the analyzing of HEV powertrain dynamic characteristics. Using the method of minimum principle, the input torque of transmission is optimized to improve the driving sinoothness of vehicle. Using the methods of fuzzy logic and fuzzy-PID, the engaging speed of clutch and the throttle opening of engine are manipulated to ensure the smoothness of clutch engagement and reduce the abrasion of clutch friction plates. The motor provides the difference between the required input torque of transmission and the torque transmitted through clutch plates. Results of simulation and experiments show that the proposed control strategy performs better than the contrastive control system, the smoothness of driving and the abrasion of clutch can be improved simultaneously.
Hybrid simulation of a parallel collisionless shock in the Large Plasma Device
Weidl, M S; Jenko, F; Niemann, C
2016-01-01
We present two-dimensional hybrid kinetic/magnetohydrodynamic simulations of planned laser-ablation experiments in the Large Plasma Device (LAPD). Our results, based on parameters which have been validated in previous experiments, show that a parallel collisionless shock can begin forming within the available space. Carbon-debris ions that stream along the magnetic-field direction with a blow-off speed of four times the Alfven velocity excite strong magnetic fluctuations, eventually transfering part of their kinetic energy to the surrounding hydrogen ions. This acceleration and compression of the background plasma creates a shock front, which satisfies the Rankine-Hugoniot conditions and can therefore propagate on its own. Furthermore, we analyze the upstream turbulence and show that it is dominated by the right-hand resonant instability.
A new parallel-vector finite element analysis software on distributed-memory computers
Qin, Jiangning; Nguyen, Duc T.
1993-01-01
A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.
Massively Parallel Computing at the Large Hadron Collider up to the HL-LHC
AUTHOR|(CDS)2080997; Halyo, Valerie
2015-01-01
As the Large Hadron Collider (LHC) continues its upward progression in energy and luminosity towards the planned High-Luminosity LHC (HL-LHC) in 2025, the challenges of the experiments in processing increasingly complex events will also continue to increase. Improvements in computing technologies and algorithms will be a key part of the advances necessary to meet this challenge. Parallel computing techniques, especially those using massively parallel computing (MPC), promise to be a significant part of this effort. In these proceedings, we discuss these algorithms in the specific context of a particularly important problem: the reconstruction of charged particle tracks in the trigger algorithms in an experiment, in which high computing performance is critical for executing the track reconstruction in the available time. We discuss some areas where parallel computing has already shown benefits to the LHC experiments, and also demonstrate how a MPC-based trigger at the CMS experiment could not only improve perf...
Virtual parallel computing and a search algorithm using matrix product states
Chamon, Claudio
2012-01-01
We propose a form of parallel computing on classical computers that is based on matrix product states. The virtual parallelization is accomplished by evolving all possible results for multiple inputs, with bits represented by matrices. The action by classical probabilistic 1-bit and deterministic 2-bit gates such as NAND are implemented in terms of matrix operations and, as opposed to quantum computing, it is possible to copy bits. We present a way to explore this method of computation to solve search problems and count the number of solutions. We argue that if the classical computational cost of testing solutions (witnesses) requires less than O(n^2) local two-bit gates acting on n bits, the search problem can be fully solved in subexponential time. Therefore, for this restricted type of search problem, the virtual parallelization scheme is faster than Grover's quantum algorithm.
Energy Technology Data Exchange (ETDEWEB)
Blocksome, Michael A.; Mamidala, Amith R.
2015-07-14
Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to a deterministic data communications network through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and the deterministic data communications network; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.
Energy Technology Data Exchange (ETDEWEB)
Blocksome, Michael A.; Mamidala, Amith R.
2015-07-07
Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to a deterministic data communications network through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and the deterministic data communications network; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.
A Parallel and Distributed Surrogate Model Implementation for Computational Steering
Butnaru, Daniel
2012-06-01
Understanding the influence of multiple parameters in a complex simulation setting is a difficult task. In the ideal case, the scientist can freely steer such a simulation and is immediately presented with the results for a certain configuration of the input parameters. Such an exploration process is however not possible if the simulation is computationally too expensive. For these cases we present in this paper a scalable computational steering approach utilizing a fast surrogate model as substitute for the time-consuming simulation. The surrogate model we propose is based on the sparse grid technique, and we identify the main computational tasks associated with its evaluation and its extension. We further show how distributed data management combined with the specific use of accelerators allows us to approximate and deliver simulation results to a high-resolution visualization system in real-time. This significantly enhances the steering workflow and facilitates the interactive exploration of large datasets. © 2012 IEEE.
Silberstein, M.; Tzemach, A.; Dovgolevsky, N.; Fishelson, M.; Schuster, A.; Geiger, D.
2006-01-01
Computation of LOD scores is a valuable tool for mapping disease-susceptibility genes in the study of Mendelian and complex diseases. However, computation of exact multipoint likelihoods of large inbred pedigrees with extensive missing data is often beyond the capabilities of a single computer. We present a distributed system called “SUPERLINK-ONLINE,” for the computation of multipoint LOD scores of large inbred pedigrees. It achieves high performance via the efficient parallelization of the ...
Performance Improvement of Threshold based Audio Steganography using Parallel Computation
Directory of Open Access Journals (Sweden)
Muhammad Shoaib
2016-10-01
Full Text Available Audio steganography is used to hide secret information inside audio signal for the secure and reliable transfer of information. Various steganography techniques have been proposed and implemented to ensure adequate security level. The existing techniques either focus on the payload or security, but none of them has ensured both security and payload at same time. Data Dependency in existing solution was reluctant for the execution of steganography mechanism serially. The audio data and secret data pre-processing were done and existing techniques were experimentally tested in Matlab that ensured the existence of problem in efficient execution. The efficient least significant bit steganography scheme removed the pipelining hazard and calculated Steganography parallel on distributed memory systems. This scheme ensures security, focuses on payload along with provisioning of efficient solution. The result depicts that it not only ensures adequate security level but also provides better and efficient solution.
Parallel computation for blood cell classification in medical hyperspectral imagery
Li, Wei; Wu, Lucheng; Qiu, Xianbo; Ran, Qiong; Xie, Xiaoming
2016-09-01
With the advantage of fine spectral resolution, hyperspectral imagery provides great potential for cell classification. This paper provides a promising classification system including the following three stages: (1) band selection for a subset of spectral bands with distinctive and informative features, (2) spectral-spatial feature extraction, such as local binary patterns (LBP), and (3) followed by an effective classifier. Moreover, these three steps are further implemented on graphics processing units (GPU) respectively, which makes the system real-time and more practical. The GPU parallel implementation is compared with the serial implementation on central processing units (CPU). Experimental results based on real medical hyperspectral data demonstrate that the proposed system is able to offer high accuracy and fast speed, which are appealing for cell classification in medical hyperspectral imagery.
Work-Efficient Parallel Skyline Computation for the GPU
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira
2015-01-01
The skyline operator returns records in a dataset that provide optimal trade-offs of multiple dimensions. State-of-the-art skyline computation involves complex tree traversals, data-ordering, and conditional branching to minimize the number of point-to-point comparisons. Meanwhile, GPGPU computing...... a global, static partitioning scheme. With the partitioning, we can permit controlled branching to exploit transitive relationships and avoid most point-to-point comparisons. The result is a non-traditional GPU algorithm, SkyAlign, that prioritizes work-effciency and respectable throughput, rather than...... maximal throughput, to achieve orders of magnitude faster performance....
Bostic, Susan W.; Fulton, Robert E.
1987-01-01
Eigenvalue analyses of complex structures is a computationally intensive task which can benefit significantly from new and impending parallel computers. This study reports on a parallel computer implementation of the Lanczos method for free vibration analysis. The approach used here subdivides the major Lanczos calculation tasks into subtasks and introduces parallelism down to the subtask levels such as matrix decomposition and forward/backward substitution. The method was implemented on a commercial parallel computer and results were obtained for a long flexible space structure. While parallel computing efficiency is problem and computer dependent, the efficiency for the Lanczos method was good for a moderate number of processors for the test problem. The greatest reduction in time was realized for the decomposition of the stiffness matrix, a calculation which took 70 percent of the time in the sequential program and which took 25 percent of the time on eight processors. For a sample calculation of the twenty lowest frequencies of a 486 degree of freedom problem, the total sequential computing time was reduced by almost a factor of ten using 16 processors.
APES-based procedure for super-resolution SAR imagery with GPU parallel computing
Jia, Weiwei; Xu, Xiaojian; Xu, Guangyao
2015-10-01
The amplitude and phase estimation (APES) algorithm is widely used in modern spectral analysis. Compared with conventional Fourier transform (FFT), APES results in lower sidelobes and narrower spectral peaks. However, in synthetic aperture radar (SAR) imaging with large scene, without parallel computation, it is difficult to apply APES directly to super-resolution radar image processing due to its great amount of calculation. In this paper, a procedure is proposed to achieve target extraction and parallel computing of APES for super-resolution SAR imaging. Numerical experimental are carried out on Tesla K40C with 745 MHz GPU clock rate and 2880 CUDA cores. Results of SAR image with GPU parallel computing show that the parallel APES is remarkably more efficient than that of CPU-based with the same super-resolution.
Malkov, Ewgenij A.; Poleshkin, Sergey O.; Kudryavtsev, Alexey N.; Shershnev, Anton A.
2016-10-01
The paper presents the software implementation of the Boltzmann equation solver based on the deterministic finite-difference method. The solver allows one to carry out parallel computations of rarefied flows on a hybrid computational cluster with arbitrary number of central processor units (CPU) and graphical processor units (GPU). Employment of GPUs leads to a significant acceleration of the computations, which enables us to simulate two-dimensional flows with high resolution in a reasonable time. The developed numerical code was validated by comparing the obtained solutions with the Direct Simulation Monte Carlo (DSMC) data. For this purpose the supersonic flow past a flat plate at zero angle of attack is used as a test case.
Multigrid Methods on Parallel Computers: A Survey on Recent Developments
1990-12-01
multi- color (red-black, four color etc.) order- ing of the grid points. Clearly, computation of defects, interpolation and restriction can be also...73716 72555 .984 85750 82919 95800 85206 .889 113086 97406 16406 16383 .999 22042 21845 23024 21853 .949 31668 29143 Table 6: Evaluated time
Directory of Open Access Journals (Sweden)
Thamilselvan Rakkiannan
2012-01-01
Full Text Available Problem statement: The Job Shop Scheduling Problem (JSSP is observed as one of the most difficult NP-hard, combinatorial problem. The problem consists of determining the most efficient schedule for jobs that are processed on several machines. Approach: In this study Genetic Algorithm (GA is integrated with the parallel version of Simulated Annealing Algorithm (SA is applied to the job shop scheduling problem. The proposed algorithm is implemented in a distributed environment using Remote Method Invocation concept. The new genetic operator and a parallel simulated annealing algorithm are developed for solving job shop scheduling. Results: The implementation is done successfully to examine the convergence and effectiveness of the proposed hybrid algorithm. The JSS problems tested with very well-known benchmark problems, which are considered to measure the quality of proposed system. Conclusion/Recommendations: The empirical results show that the proposed genetic algorithm with simulated annealing is quite successful to achieve better solution than the individual genetic or simulated annealing algorithm."
An implementation of hybrid parallel CUDA code for the hyperonic nuclear forces
Nemura, Hidekatsu
2016-01-01
We present our recent effort to develop a GPGPU program to calculate 52 channels of the Nambu-Bethe-Salpeter (NBS) wave functions in order to study the baryon interactions, from nucleon-nucleon to $\\Xi-\\Xi$, from lattice QCD. We adopt CUDA programming to perform the multi-GPU execution on a hybrid parallel programming with MPI and OpenMP. Effective baryon block algorithm is briefly outlined, which calculates efficaciously a large number of NBS wave functions at a time, and three CUDA kernel programs are implemented to materialize the effective baryon block algorithm using GPUs on the single-program multiple-data (SPMD) programming model. In order to parallelize multiple GPUs, we take both two approaches by dividing the time dimension and by dividing the spatial dimensions. Performances are measured using HA-PACS supercomputer in University of Tsukuba, which includes NVIDIA M2090 and NVIDIA K20X GPUs. Strong scaling and weak scaling measured by using both M2090 and K20X GPUs are presented. We find distinct dif...
A domain decomposition study of massively parallel computing in compressible gas dynamics
Energy Technology Data Exchange (ETDEWEB)
Wong, C.C.; Blottner, F.G.; Payne, J.L. [Sandia National Labs., Albuquerque, NM (United States); Soetrisno, M. [Amtec Engineering, Inc., Bellevue, WA (United States)
1995-01-01
The appropriate utilization of massively parallel computers for solving the Navier-Stokes equations is investigated and determined from an engineering perspective. The issues investigated are: (1) Should strip or patch domain decomposition of the spatial mesh be used to reduce computer time? (2) How many computer nodes should be used for a problem with a given sized mesh to reduce computer time? (3) Is the convergence of the Navier-Stokes solution procedure (LU-SGS) adversely influenced by the domain decomposition approach? The results of the paper show that the present Navier-Stokes solution technique has good performance on a massively parallel computer for transient flow problems. For steady-state problems with a large number of mesh cells, the solution procedure will require significant computer time due to an increased number of iterations to achieve a converged solution. There is an optimum number of computer nodes to use for a problem with a given global mesh size.
Kaliman, Ilya A; Slipchenko, Lyudmila V
2015-01-15
A new hybrid MPI/OpenMP parallelization scheme is introduced for the Effective Fragment Potential (EFP) method implemented in the libefp software library. The new implementation employs dynamic load balancing algorithm that uses a master/slave model. The software shows excellent parallel scaling up to several hundreds of CPU-cores across multiple nodes. The code uses functions only from the well-established MPI-1 standard that simplifies portability of the library. This new parallel EFP implementation greatly expands the applicability of the EFP and QM/EFP methods by extending attainable time- and length-scales.
Hybrid Algorithm for Optimal Load Sharing in Grid Computing
Directory of Open Access Journals (Sweden)
A. Krishnan
2012-01-01
Full Text Available Problem statement: Grid Computing is the fast growing industry, which shares the resources in the organization in an effective manner. Resource sharing requires more optimized algorithmic structure, otherwise the waiting time and response time are increased and the resource utilization is reduced. Approach: In order to avoid such reduction in the performances of the grid system, an optimal resource sharing algorithm is required. In recent days, many load sharing technique are proposed, which provides feasibility but there are many critical issues are still present in these algorithms. Results: In this study a hybrid algorithm for optimization of load sharing is proposed. The hybrid algorithm contains two components which are Hash Table (HT and Distributed Hash Table (DHT. Conclusion: The results of the proposed study show that the hybrid algorithm will optimize the task than existing systems.
Use of a hybrid computer in engineering-seismology research
Park, R.B.; Hays, W.W.
1977-01-01
A hybrid computer is an important tool in the seismological research conducted by the U.S. Geological Survey in support of the Energy Research and Development Administration nuclear explosion testing program at the Nevada Test Site and the U.S. Geological Survey Earthquake Hazard Reduction Program. The hybrid computer system, which employs both digital and analog computational techniques, facilitates efficient seismic data processing. Standard data processing operations include: (1) preview of dubbed magnetic tapes of data; (2) correction of data for instrument response; (3) derivation of displacement and acceleration time histories from velocity recordings; (4) extraction of peak-amplitude data; (5) digitization of time histories; (6) rotation of instrumental axes; (7) derivation of response spectra; and (8) derivation of relative transfer functions between recording sites. Catalog of time histories and response spectra of ground motion from nuclear explosions and earthquakes that have been processed by the hybrid computer are used in the Earthquake Hazard Research Program to evaluate the effects of source, propagation path, and site effects on recorded ground motion; to assess seismic risk; to predict system response; and to solve system design problems.
Lilith: A scalable secure tool for massively parallel distributed computing
Energy Technology Data Exchange (ETDEWEB)
Armstrong, R.C.; Camp, L.J.; Evensky, D.A.; Gentile, A.C.
1997-06-01
Changes in high performance computing have necessitated the ability to utilize and interrogate potentially many thousands of processors. The ASCI (Advanced Strategic Computing Initiative) program conducted by the United States Department of Energy, for example, envisions thousands of distinct operating systems connected by low-latency gigabit-per-second networks. In addition multiple systems of this kind will be linked via high-capacity networks with latencies as low as the speed of light will allow. Code which spans systems of this sort must be scalable; yet constructing such code whether for applications, debugging, or maintenance is an unsolved problem. Lilith is a research software platform that attempts to answer these questions with an end toward meeting these needs. Presently, Lilith exists as a test-bed, written in Java, for various spanning algorithms and security schemes. The test-bed software has, and enforces, hooks allowing implementation and testing of various security schemes.
Sixty Years of Parallel Computing%并行计算六十年
Institute of Scientific and Technical Information of China (English)
杨学军
2012-01-01
Parallel computing is the main technology to implement high performance computing. This paper reviews the history of parallel computing over the past 60 years, and reaffirms the fact that the measurement equations for parallel scalability have played an important role in the development of parallel computing. Based on our analysis of challenges in exascale computing in the future, new scalability measurement model for parallel computing has been built, in which factors affecting performance of exascale computing are considered, including memory access, communication, reliability and power consumption. Through quantitative analysis, it has been found that there are some scalability "walls" during the development when parallel computing advances to higher performance. Finally, in consideration of the conditions of our country, the author proposes some suggestions for the development of high performance computing in our country.%并行计算是实现高性能计算的主要技术手段.本文回顾了并行计算技术六十多年来的发展历史,重温了并行可扩展性度量公式在并行计算发展进程中的重要地位.分析了并行计算向未来E级计算发展时面临的挑战,并建立了新的并行计算可扩展性度量模型,建模了访存、通信、可靠性、能耗等影响E级计算的因素.通过定量分析,发现和研究了并行计算向更高性能发展面临的可扩展性“墙”.最后,针对我国国情,提出了作者关于我国高性能计算未来发展的体会与思考.
COMPTEL skymapping: a new approach using parallel computing
Strong, A.W.; Bloemen, H.; Diehl, R.; Hermsen, W.; Schoenfelder, V.
1998-01-01
Large-scale skymapping with COMPTEL using the full survey database presents challenging problems on account of the complex response and time-variable background. A new approach which attempts to address some of these problems is described, in which the information about each observation is preserved throughout the analysis. In this method, a maximum-entropy algorithm is used to determine image and background simultaneously. Because of the extreme computing requirements, the method has been im...
Parallel Radiosity Techniques for Mesh-Connected SIMD Computers
1991-07-01
of equations Ax = b, one can find corresponding stages in the Gauss- Seidel method . The form factor calculation stage corresponds to the computation...to be planar F,, = 0 for all i ), iterative techniques such as the Gauss- Seidel method fare much better for this system. In the progressive refinement...this light, the solution of the radiosity system of equations using the Gauss- Seidel method is a sequential one. at least at a macro level. However
Parallel computer processing and modeling: applications for the ICU
Baxter, Grant; Pranger, L. Alex; Draghic, Nicole; Sims, Nathaniel M.; Wiesmann, William P.
2003-07-01
Current patient monitoring procedures in hospital intensive care units (ICUs) generate vast quantities of medical data, much of which is considered extemporaneous and not evaluated. Although sophisticated monitors to analyze individual types of patient data are routinely used in the hospital setting, this equipment lacks high order signal analysis tools for detecting long-term trends and correlations between different signals within a patient data set. Without the ability to continuously analyze disjoint sets of patient data, it is difficult to detect slow-forming complications. As a result, the early onset of conditions such as pneumonia or sepsis may not be apparent until the advanced stages. We report here on the development of a distributed software architecture test bed and software medical models to analyze both asynchronous and continuous patient data in real time. Hardware and software has been developed to support a multi-node distributed computer cluster capable of amassing data from multiple patient monitors and projecting near and long-term outcomes based upon the application of physiologic models to the incoming patient data stream. One computer acts as a central coordinating node; additional computers accommodate processing needs. A simple, non-clinical model for sepsis detection was implemented on the system for demonstration purposes. This work shows exceptional promise as a highly effective means to rapidly predict and thereby mitigate the effect of nosocomial infections.
Parallel Genetic Algorithms with Dynamic Topology using Cluster Computing
Directory of Open Access Journals (Sweden)
ADAR, N.
2016-08-01
Full Text Available A parallel genetic algorithm (PGA conducts a distributed meta-heuristic search by employing genetic algorithms on more than one subpopulation simultaneously. PGAs migrate a number of individuals between subpopulations over generations. The layout that facilitates the interactions of the subpopulations is called the topology. Static migration topologies have been widely incorporated into PGAs. In this article, a PGA with a dynamic migration topology (D-PGA is proposed. D-PGA generates a new migration topology in every epoch based on the average fitness values of the subpopulations. The D-PGA has been tested against ring and fully connected migration topologies in a Beowulf Cluster. The D-PGA has outperformed the ring migration topology with comparable communication cost and has provided competitive or better results than a fully connected migration topology with significantly lower communication cost. PGA convergence behaviors have been analyzed in terms of the diversities within and between subpopulations. Conventional diversity can be considered as the diversity within a subpopulation. A new concept of permeability has been introduced to measure the diversity between subpopulations. It is shown that the success of the proposed D-PGA can be attributed to maintaining a high level of permeability while preserving diversity within subpopulations.
CLIMP: Clustering Motifs via Maximal Cliques with Parallel Computing Design.
Zhang, Shaoqiang; Chen, Yong
2016-01-01
A set of conserved binding sites recognized by a transcription factor is called a motif, which can be found by many applications of comparative genomics for identifying over-represented segments. Moreover, when numerous putative motifs are predicted from a collection of genome-wide data, their similarity data can be represented as a large graph, where these motifs are connected to one another. However, an efficient clustering algorithm is desired for clustering the motifs that belong to the same groups and separating the motifs that belong to different groups, or even deleting an amount of spurious ones. In this work, a new motif clustering algorithm, CLIMP, is proposed by using maximal cliques and sped up by parallelizing its program. When a synthetic motif dataset from the database JASPAR, a set of putative motifs from a phylogenetic foot-printing dataset, and a set of putative motifs from a ChIP dataset are used to compare the performances of CLIMP and two other high-performance algorithms, the results demonstrate that CLIMP mostly outperforms the two algorithms on the three datasets for motif clustering, so that it can be a useful complement of the clustering procedures in some genome-wide motif prediction pipelines. CLIMP is available at http://sqzhang.cn/climp.html.
Fazanaro, Filipe I.; Soriano, Diogo C.; Suyama, Ricardo; Madrid, Marconi K.; Oliveira, José Raimundo de; Muñoz, Ignacio Bravo; Attux, Romis
2016-08-01
The characterization of nonlinear dynamical systems and their attractors in terms of invariant measures, basins of attractions and the structure of their vector fields usually outlines a task strongly related to the underlying computational cost. In this work, the practical aspects related to the use of parallel computing - specially the use of Graphics Processing Units (GPUS) and of the Compute Unified Device Architecture (CUDA) - are reviewed and discussed in the context of nonlinear dynamical systems characterization. In this work such characterization is performed by obtaining both local and global Lyapunov exponents for the classical forced Duffing oscillator. The local divergence measure was employed by the computation of the Lagrangian Coherent Structures (LCSS), revealing the general organization of the flow according to the obtained separatrices, while the global Lyapunov exponents were used to characterize the attractors obtained under one or more bifurcation parameters. These simulation sets also illustrate the required computation time and speedup gains provided by different parallel computing strategies, justifying the employment and the relevance of GPUS and CUDA in such extensive numerical approach. Finally, more than simply providing an overview supported by a representative set of simulations, this work also aims to be a unified introduction to the use of the mentioned parallel computing tools in the context of nonlinear dynamical systems, providing codes and examples to be executed in MATLAB and using the CUDA environment, something that is usually fragmented in different scientific communities and restricted to specialists on parallel computing strategies.
Implementation of QR up- and downdating on a massively parallel |computer
DEFF Research Database (Denmark)
Bendtsen, Claus; Hansen, Per Christian; Madsen, Kaj;
1995-01-01
We describe an implementation of QR up- and downdating on a massively parallel computer (the Connection Machine CM-200) and show that the algorithm maps well onto the computer. In particular, we show how the use of corrected semi-normal equations for downdating can be efficiently implemented. We...
Decker, K. M.; Jayewardena, C.; Rehmann, R.
We describe the library lgtlib, and lgttool, the corresponding development environment for Monte Carlo simulations of lattice gauge theory on multiprocessor vector computers with shared memory. We explain why distributed memory parallel processor (DMPP) architectures are particularly appealing for compute-intensive scientific applications, and introduce the design of a general application and program development environment system for scientific applications on DMPP architectures.
Improving efficiency of a global barrier operation in a parallel computer
Energy Technology Data Exchange (ETDEWEB)
None
2016-10-04
Performing a global barrier operation in a parallel computer that includes compute nodes coupled for data communications, where each compute node executes tasks, with one task on each compute node designated as a master task, including: for each task on each compute node until all master tasks have joined a global barrier: determining whether the task is a master task; if the task is not a master task, joining a single local barrier; if the task is a master task, joining the global barrier and the single local barrier only after all other tasks on the compute node have joined the single local barrier.
Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones
Directory of Open Access Journals (Sweden)
Norman Matloff
2016-07-01
Full Text Available The growth in the use of computationally intensive statistical procedures, especially with big data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPUs, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP algorithms, especially on non-shared memory platforms. This paper develops a broadlyapplicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations.