Protocol-Based Verification of Message-Passing Parallel Programs
López-Acosta, Hugo-Andrés; Eduardo R. B. Marques, Eduardo R. B.; Martins, Francisco
2015-01-01
We present ParTypes, a type-based methodology for the verification of Message Passing Interface (MPI) programs written in the C programming language. The aim is to statically verify programs against protocol specifications, enforcing properties such as fidelity and absence of deadlocks. We develo...
Message passing with parallel queue traversal
Underwood, Keith D [Albuquerque, NM; Brightwell, Ronald B [Albuquerque, NM; Hemmert, K Scott [Albuquerque, NM
2012-05-01
In message passing implementations, associative matching structures are used to permit list entries to be searched in parallel fashion, thereby avoiding the delay of linear list traversal. List management capabilities are provided to support list entry turnover semantics and priority ordering semantics.
Parallelization of a hydrological model using the message passing interface
Wu, Yiping; Li, Tiejian; Sun, Liqun; Chen, Ji
2013-01-01
With the increasing knowledge about the natural processes, hydrological models such as the Soil and Water Assessment Tool (SWAT) are becoming larger and more complex with increasing computation time. Additionally, other procedures such as model calibration, which may require thousands of model iterations, can increase running time and thus further reduce rapid modeling and analysis. Using the widely-applied SWAT as an example, this study demonstrates how to parallelize a serial hydrological model in a Windows® environment using a parallel programing technology—Message Passing Interface (MPI). With a case study, we derived the optimal values for the two parameters (the number of processes and the corresponding percentage of work to be distributed to the master process) of the parallel SWAT (P-SWAT) on an ordinary personal computer and a work station. Our study indicates that model execution time can be reduced by 42%–70% (or a speedup of 1.74–3.36) using multiple processes (two to five) with a proper task-distribution scheme (between the master and slave processes). Although the computation time cost becomes lower with an increasing number of processes (from two to five), this enhancement becomes less due to the accompanied increase in demand for message passing procedures between the master and all slave processes. Our case study demonstrates that the P-SWAT with a five-process run may reach the maximum speedup, and the performance can be quite stable (fairly independent of a project size). Overall, the P-SWAT can help reduce the computation time substantially for an individual model run, manual and automatic calibration procedures, and optimization of best management practices. In particular, the parallelization method we used and the scheme for deriving the optimal parameters in this study can be valuable and easily applied to other hydrological or environmental models.
Future-based Static Analysis of Message Passing Programs
Wytse Oortwijn
2016-06-01
Full Text Available Message passing is widely used in industry to develop programs consisting of several distributed communicating components. Developing functionally correct message passing software is very challenging due to the concurrent nature of message exchanges. Nonetheless, many safety-critical applications rely on the message passing paradigm, including air traffic control systems and emergency services, which makes proving their correctness crucial. We focus on the modular verification of MPI programs by statically verifying concrete Java code. We use separation logic to reason about local correctness and define abstractions of the communication protocol in the process algebra used by mCRL2. We call these abstractions futures as they predict how components will interact during program execution. We establish a provable link between futures and program code and analyse the abstract futures via model checking to prove global correctness. Finally, we verify a leader election protocol to demonstrate our approach.
Algorithms for parallel flow solvers on message passing architectures
Vanderwijngaart, Rob F.
1995-01-01
The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those
Stampi: a message passing library for distributed parallel computing. User's guide
Imamura, Toshiyuki; Koide, Hiroshi; Takemiya, Hiroshi
1998-11-01
A new message passing library, Stampi, has been developed to realize a computation with different kind of parallel computers arbitrarily and making MPI (Message Passing Interface) as an unique interface for communication. Stampi is based on MPI2 specification. It realizes dynamic process creation to different machines and communication between spawned one within the scope of MPI semantics. Vender implemented MPI as a closed system in one parallel machine and did not support both functions; process creation and communication to external machines. Stampi supports both functions and enables us distributed parallel computing. Currently Stampi has been implemented on COMPACS (COMplex PArallel Computer System) introduced in CCSE, five parallel computers and one graphic workstation, and any communication on them can be processed on. (author)
The specification of Stampi, a message passing library for distributed parallel computing
Imamura, Toshiyuki; Takemiya, Hiroshi; Koide, Hiroshi
2000-03-01
At CCSE, Center for Promotion of Computational Science and Engineering, a new message passing library for heterogeneous and distributed parallel computing has been developed, and it is called as Stampi. Stampi enables us to communicate between any combination of parallel computers as well as workstations. Currently, a Stampi system is constructed from Stampi library and Stampi/Java. It provides functions to connect a Stampi application with not only those on COMPACS, COMplex Parallel Computer System, but also applets which work on WWW browsers. This report summarizes the specifications of Stampi and details the development of its system. (author)
Deng Li; Xie Zhongsheng
1999-01-01
The coupled neutron and photon transport Monte Carlo code MCNP (version 3B) has been parallelized in parallel virtual machine (PVM) and message passing interface (MPI) by modifying a previous serial code. The new code has been verified by solving sample problems. The speedup increases linearly with the number of processors and the average efficiency is up to 99% for 12-processor. (author)
Stampi: a message passing library for distributed parallel computing. User's guide, second edition
Imamura, Toshiyuki; Koide, Hiroshi; Takemiya, Hiroshi
2000-02-01
A new message passing library, Stampi, has been developed to realize a computation with different kind of parallel computers arbitrarily and making MPI (Message Passing Interface) as an unique interface for communication. Stampi is based on the MPI2 specification, and it realizes dynamic process creation to different machines and communication between spawned one within the scope of MPI semantics. Main features of Stampi are summarized as follows: (i) an automatic switch function between external- and internal communications, (ii) a message routing/relaying with a routing module, (iii) a dynamic process creation, (iv) a support of two types of connection, Master/Slave and Client/Server, (v) a support of a communication with Java applets. Indeed vendors implemented MPI libraries as a closed system in one parallel machine or their systems, and did not support both functions; process creation and communication to external machines. Stampi supports both functions and enables us distributed parallel computing. Currently Stampi has been implemented on COMPACS (COMplex PArallel Computer System) introduced in CCSE, five parallel computers and one graphic workstation, moreover on eight kinds of parallel machines, totally fourteen systems. Stampi provides us MPI communication functionality on them. This report describes mainly the usage of Stampi. (author)
Message Passing Framework for Globally Interconnected Clusters
Hafeez, M; Riaz, N; Asghar, S; Malik, U A; Rehman, A
2011-01-01
In prevailing technology trends it is apparent that the network requirements and technologies will advance in future. Therefore the need of High Performance Computing (HPC) based implementation for interconnecting clusters is comprehensible for scalability of clusters. Grid computing provides global infrastructure of interconnecting clusters consisting of dispersed computing resources over Internet. On the other hand the leading model for HPC programming is Message Passing Interface (MPI). As compared to Grid computing, MPI is better suited for solving most of the complex computational problems. MPI itself is restricted to a single cluster. It does not support message passing over the internet to use the computing resources of different clusters in an optimal way. We propose a model that provides message passing capabilities between parallel applications over the internet. The proposed model is based on Architecture for Java Universal Message Passing (A-JUMP) framework and Enterprise Service Bus (ESB) named as High Performance Computing Bus. The HPC Bus is built using ActiveMQ. HPC Bus is responsible for communication and message passing in an asynchronous manner. Asynchronous mode of communication offers an assurance for message delivery as well as a fault tolerance mechanism for message passing. The idea presented in this paper effectively utilizes wide-area intercluster networks. It also provides scheduling, dynamic resource discovery and allocation, and sub-clustering of resources for different jobs. Performance analysis and comparison study of the proposed framework with P2P-MPI are also presented in this paper.
The design of multi-core DSP parallel model based on message passing and multi-level pipeline
Niu, Jingyu; Hu, Jian; He, Wenjing; Meng, Fanrong; Li, Chuanrong
2017-10-01
Currently, the design of embedded signal processing system is often based on a specific application, but this idea is not conducive to the rapid development of signal processing technology. In this paper, a parallel processing model architecture based on multi-core DSP platform is designed, and it is mainly suitable for the complex algorithms which are composed of different modules. This model combines the ideas of multi-level pipeline parallelism and message passing, and summarizes the advantages of the mainstream model of multi-core DSP (the Master-Slave model and the Data Flow model), so that it has better performance. This paper uses three-dimensional image generation algorithm to validate the efficiency of the proposed model by comparing with the effectiveness of the Master-Slave and the Data Flow model.
On the adequacy of message-passing parallel supercomputers for solving neutron transport problems
Azmy, Y.Y.
1990-01-01
A coarse-grained, static-scheduling parallelization of the standard iterative scheme used for solving the discrete-ordinates approximation of the neutron transport equation is described. The parallel algorithm is based on a decomposition of the angular domain along the discrete ordinates, thus naturally producing a set of completely uncoupled systems of equations in each iteration. Implementation of the parallel code on Intcl's iPSC/2 hypercube, and solutions to test problems are presented as evidence of the high speedup and efficiency of the parallel code. The performance of the parallel code on the iPSC/2 is analyzed, and a model for the CPU time as a function of the problem size (order of angular quadrature) and the number of participating processors is developed and validated against measured CPU times. The performance model is used to speculate on the potential of massively parallel computers for significantly speeding up real-life transport calculations at acceptable efficiencies. We conclude that parallel computers with a few hundred processors are capable of producing large speedups at very high efficiencies in very large three-dimensional problems. 10 refs., 8 figs
Solchenbach, K.; Thole, C.A.; Trottenberg, U.
1987-01-01
For a wide class of problems in scientific computing, in particular for partial differential equations, the multigrid principle has proved to yield highly efficient numerical methods. However, the principle has to be applied carefully: if the multigrid components are not chosen adequately with respect to the given problem, the efficiency may be much smaller than possible. This has been demonstrated for many practical problems. Unfortunately, the general theories on multigrid convergence do not give much help in constructing really efficient multigrid algorithms. Although some progress has been made in bridging the gap between theory and practice during the last few years, there are still several theoretical approaches which are misleading rather than helpful with respect to the objective of real efficiency. The research in finding highly efficient algorithms for non-model applications therefore is still a sophisticated mixture of theoretical considerations, a transfer of experiences from model to real life problems and systematical experimental work. The emphasis of the practical research activity today lies - among others - in the following fields: - finding efficient multigrid components for really complex problems, - combining the multigrid approach with advanced discretizative techniques: - constructing highly parallel multigrid algorithms. In this paper, we want to deal mainly with the last topic
Li, J; Guo, L-X; Zeng, H; Han, X-B
2009-06-01
A message-passing-interface (MPI)-based parallel finite-difference time-domain (FDTD) algorithm for the electromagnetic scattering from a 1-D randomly rough sea surface is presented. The uniaxial perfectly matched layer (UPML) medium is adopted for truncation of FDTD lattices, in which the finite-difference equations can be used for the total computation domain by properly choosing the uniaxial parameters. This makes the parallel FDTD algorithm easier to implement. The parallel performance with different processors is illustrated for one sea surface realization, and the computation time of the parallel FDTD algorithm is dramatically reduced compared to a single-process implementation. Finally, some numerical results are shown, including the backscattering characteristics of sea surface for different polarization and the bistatic scattering from a sea surface with large incident angle and large wind speed.
A real-time MPEG software decoder using a portable message-passing library
Kwong, Man Kam; Tang, P.T. Peter; Lin, Biquan
1995-12-31
We present a real-time MPEG software decoder that uses message-passing libraries such as MPL, p4 and MPI. The parallel MPEG decoder currently runs on the IBM SP system but can be easil ported to other parallel machines. This paper discusses our parallel MPEG decoding algorithm as well as the parallel programming environment under which it uses. Several technical issues are discussed, including balancing of decoding speed, memory limitation, 1/0 capacities, and optimization of MPEG decoding components. This project shows that a real-time portable software MPEG decoder is feasible in a general-purpose parallel machine.
Message passing for quantified Boolean formulas
Zhang, Pan; Ramezanpour, Abolfazl; Zecchina, Riccardo; Zdeborová, Lenka
2012-01-01
We introduce two types of message passing algorithms for quantified Boolean formulas (QBF). The first type is a message passing based heuristics that can prove unsatisfiability of the QBF by assigning the universal variables in such a way that the remaining formula is unsatisfiable. In the second type, we use message passing to guide branching heuristics of a Davis–Putnam–Logemann–Loveland (DPLL) complete solver. Numerical experiments show that on random QBFs our branching heuristics give robust exponential efficiency gain with respect to state-of-the-art solvers. We also manage to solve some previously unsolved benchmarks from the QBFLIB library. Apart from this, our study sheds light on using message passing in small systems and as subroutines in complete solvers
Blind sensor calibration using approximate message passing
Schülke, Christophe; Caltagirone, Francesco; Zdeborová, Lenka
2015-01-01
The ubiquity of approximately sparse data has led a variety of communities to take great interest in compressed sensing algorithms. Although these are very successful and well understood for linear measurements with additive noise, applying them to real data can be problematic if imperfect sensing devices introduce deviations from this ideal signal acquisition process, caused by sensor decalibration or failure. We propose a message passing algorithm called calibration approximate message passing (Cal-AMP) that can treat a variety of such sensor-induced imperfections. In addition to deriving the general form of the algorithm, we numerically investigate two particular settings. In the first, a fraction of the sensors is faulty, giving readings unrelated to the signal. In the second, sensors are decalibrated and each one introduces a different multiplicative gain to the measurements. Cal-AMP shares the scalability of approximate message passing, allowing us to treat large sized instances of these problems, and experimentally exhibits a phase transition between domains of success and failure. (paper)
Message-Passing Receivers for Single Carrier Systems with Frequency-Domain Equalization
Zhang, Chuanzong; Manchón, Carles Navarro; Wang, Zhongyong
2015-01-01
In this letter, we design iterative receiver algorithms for joint frequency-domain equalization and decoding in a single carrier system assuming perfect channel state information. Based on an approximate inference framework that combines belief propagation (BP) and the mean field (MF) approximation......, we propose two receiver algorithms with, respectively, parallel and sequential message-passing schedules in the MF part. A recently proposed receiver based on generalized approximate message passing (GAMP) is used as a benchmarking reference. The simulation results show that the BP-MF receiver...
High performance message passing for the ATLAS DAQ/EF-1 project
Mornacchi, Giuseppe
1999-01-01
Summary form only. A message passing library has been developed in the context of the ATLAS DAQ/EF-1 project. It is used for time critical applications within the front-end part of the DAQ system, mainly to exchange data control messages between I/O processors. Key objectives of the design were low message overheads, efficient use of the data transfer buses, provision of broadcast functionality and a hardware and operating system independent implementation of the application interface. The design and implementation of the message passing library are presented. As required by the project, the implementation is based on commercial components, namely VMEbus, PCI, the Lynx-OS real-time operating system and an additional inter- processor link, PVIC. The latter offers broadcast functionality identified as being important to the overall performance of the message passing. In addition, performance benchmarks for all implementing buses are presented for both simple test programs and the full DAQ applications. (0 refs)...
Message passing vs. shared address space on a cluster of SMPs
Shan, Hongzhang; Singh, Jaswinder Pal; Oliker, Leonid; Biswas, Rupak
2001-01-01
The emergence of scalable computer architectures using clusters of PCs or PC-SMPs with commodity networking has made them attractive platforms for high-end scientific computing. Currently, message passing (MP) and shared address space (SAS) are the two leading programming paradigms for these systems. MP has been standardized with MPI, and is the most common and mature parallel programming approach. However, MP code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, they compare the performance of and programming effort required for six applications under both programming models on a 32-CPU PC-SMP cluster. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of the applications; however, on certain classes of problems, SAS performance is competitive with MPI
The serial message-passing schedule for LDPC decoding algorithms
Liu, Mingshan; Liu, Shanshan; Zhou, Yuan; Jiang, Xue
2015-12-01
The conventional message-passing schedule for LDPC decoding algorithms is the so-called flooding schedule. It has the disadvantage that the updated messages cannot be used until next iteration, thus reducing the convergence speed . In this case, the Layered Decoding algorithm (LBP) based on serial message-passing schedule is proposed. In this paper the decoding principle of LBP algorithm is briefly introduced, and then proposed its two improved algorithms, the grouped serial decoding algorithm (Grouped LBP) and the semi-serial decoding algorithm .They can improve LBP algorithm's decoding speed while maintaining a good decoding performance.
EXIT Chart Analysis of Binary Message-Passing Decoders
Lechner, Gottfried; Pedersen, Troels; Kramer, Gerhard
2007-01-01
Binary message-passing decoders for LDPC codes are analyzed using EXIT charts. For the analysis, the variable node decoder performs all computations in the L-value domain. For the special case of a hard decision channel, this leads to the well know Gallager B algorithm, while the analysis can...... be extended to channels with larger output alphabets. By increasing the output alphabet from hard decisions to four symbols, a gain of more than 1.0 dB is achieved using optimized codes. For this code optimization, the mixing property of EXIT functions has to be modified to the case of binary message......-passing decoders....
Track-stitching using graphical models and message passing
Van der Merwe, LJ
2013-07-01
Full Text Available In order to stitch tracks together, two tasks are required, namely tracking and track stitching. In this study track stitching is performed using a graphical model and message passing (belief propagation) approach. Tracks are modelled as nodes in a...
A message passing algorithm for the evaluation of social influence
Vassio, Luca; Fagnani, Fabio; Frasca, Paolo; Ozdaglar, Asuman
2014-01-01
In this paper, we define a new measure of node centrality in social networks, the Harmonic Influence Centrality, which emerges naturally in the study of social influence over networks. Next, we introduce a distributed message passing algorithm to compute the Harmonic Influence Centrality of each
Portable parallel programming in a Fortran environment
May, E.N.
1989-01-01
Experience using the Argonne-developed PARMACs macro package to implement a portable parallel programming environment is described. Fortran programs with intrinsic parallelism of coarse and medium granularity are easily converted to parallel programs which are portable among a number of commercially available parallel processors in the class of shared-memory bus-based and local-memory network based MIMD processors. The parallelism is implemented using standard UNIX (tm) tools and a small number of easily understood synchronization concepts (monitors and message-passing techniques) to construct and coordinate multiple cooperating processes on one or many processors. Benchmark results are presented for parallel computers such as the Alliant FX/8, the Encore MultiMax, the Sequent Balance, the Intel iPSC/2 Hypercube and a network of Sun 3 workstations. These parallel machines are typical MIMD types with from 8 to 30 processors, each rated at from 1 to 10 MIPS processing power. The demonstration code used for this work is a Monte Carlo simulation of the response to photons of a ''nearly realistic'' lead, iron and plastic electromagnetic and hadronic calorimeter, using the EGS4 code system. 6 refs., 2 figs., 2 tabs
Analysis and Design of Binary Message-Passing Decoders
Lechner, Gottfried; Pedersen, Troels; Kramer, Gerhard
2012-01-01
Binary message-passing decoders for low-density parity-check (LDPC) codes are studied by using extrinsic information transfer (EXIT) charts. The channel delivers hard or soft decisions and the variable node decoder performs all computations in the L-value domain. A hard decision channel results...... message-passing decoders. Finally, it is shown that errors on cycles consisting only of degree two and three variable nodes cannot be corrected and a necessary and sufficient condition for the existence of a cycle-free subgraph is derived....... in the well-know Gallager B algorithm, and increasing the output alphabet from hard decisions to two bits yields a gain of more than 1.0 dB in the required signal to noise ratio when using optimized codes. The code optimization requires adapting the mixing property of EXIT functions to the case of binary...
A Model for Speedup of Parallel Programs
1997-01-01
Sanjeev. K Setia . The interaction between mem- ory allocation and adaptive partitioning in message- passing multicomputers. In IPPS Workshop on Job...Scheduling Strategies for Parallel Processing, pages 89{99, 1995. [15] Sanjeev K. Setia and Satish K. Tripathi. A compar- ative analysis of static
MPI_XSTAR: MPI-based parallelization of XSTAR program
Danehkar, A.
2017-12-01
MPI_XSTAR parallelizes execution of multiple XSTAR runs using Message Passing Interface (MPI). XSTAR (ascl:9910.008), part of the HEASARC's HEAsoft (ascl:1408.004) package, calculates the physical conditions and emission spectra of ionized gases. MPI_XSTAR invokes XSTINITABLE from HEASoft to generate a job list of XSTAR commands for given physical parameters. The job list is used to make directories in ascending order, where each individual XSTAR is spawned on each processor and outputs are saved. HEASoft's XSTAR2TABLE program is invoked upon the contents of each directory in order to produce table model FITS files for spectroscopy analysis tools.
S-AMP: Approximate Message Passing for General Matrix Ensembles
Cakmak, Burak; Winther, Ole; Fleury, Bernard H.
2014-01-01
the approximate message-passing (AMP) algorithm to general matrix ensembles with a well-defined large system size limit. The generalization is based on the S-transform (in free probability) of the spectrum of the measurement matrix. Furthermore, we show that the optimality of S-AMP follows directly from its......We propose a novel iterative estimation algorithm for linear observation models called S-AMP. The fixed points of S-AMP are the stationary points of the exact Gibbs free energy under a set of (first- and second-) moment consistency constraints in the large system limit. S-AMP extends...
Fault-tolerant Agreement in Synchronous Message-passing Systems
Raynal, Michel
2010-01-01
The present book focuses on the way to cope with the uncertainty created by process failures (crash, omission failures and Byzantine behavior) in synchronous message-passing systems (i.e., systems whose progress is governed by the passage of time). To that end, the book considers fundamental problems that distributed synchronous processes have to solve. These fundamental problems concern agreement among processes (if processes are unable to agree in one way or another in presence of failures, no non-trivial problem can be solved). They are consensus, interactive consistency, k-set agreement an
Design of a Message Passing Model for Use in a Heterogeneous CPU-NFP Framework for Network Analytics
Pennefather, S
2017-09-01
Full Text Available of applications written in the Go programming language to be executed on a Network Flow Processor (NFP) for enhanced performance. This paper explores the need and feasibility of implementing a message passing model for data transmission between the NFP and CPU...
Introduction to parallel programming
Brawer, Steven
1989-01-01
Introduction to Parallel Programming focuses on the techniques, processes, methodologies, and approaches involved in parallel programming. The book first offers information on Fortran, hardware and operating system models, and processes, shared memory, and simple parallel programs. Discussions focus on processes and processors, joining processes, shared memory, time-sharing with multiple processors, hardware, loops, passing arguments in function/subroutine calls, program structure, and arithmetic expressions. The text then elaborates on basic parallel programming techniques, barriers and race
Weighted community detection and data clustering using message passing
Shi, Cheng; Liu, Yanchen; Zhang, Pan
2018-03-01
Grouping objects into clusters based on the similarities or weights between them is one of the most important problems in science and engineering. In this work, by extending message-passing algorithms and spectral algorithms proposed for an unweighted community detection problem, we develop a non-parametric method based on statistical physics, by mapping the problem to the Potts model at the critical temperature of spin-glass transition and applying belief propagation to solve the marginals corresponding to the Boltzmann distribution. Our algorithm is robust to over-fitting and gives a principled way to determine whether there are significant clusters in the data and how many clusters there are. We apply our method to different clustering tasks. In the community detection problem in weighted and directed networks, we show that our algorithm significantly outperforms existing algorithms. In the clustering problem, where the data were generated by mixture models in the sparse regime, we show that our method works all the way down to the theoretical limit of detectability and gives accuracy very close to that of the optimal Bayesian inference. In the semi-supervised clustering problem, our method only needs several labels to work perfectly in classic datasets. Finally, we further develop Thouless-Anderson-Palmer equations which heavily reduce the computation complexity in dense networks but give almost the same performance as belief propagation.
Azmy, Y.Y.
1997-01-01
The effect of three communication schemes for solving Arbitrarily High Order Transport (AHOT) methods of the Nodal type on parallel performance is examined via direct measurements and performance models. The target architecture in this study is Oak Ridge National Laboratory's 128 node Paragon XP/S 5 computer and the parallelization is based on the Parallel Virtual Machine (PVM) library. However, the conclusions reached can be easily generalized to a large class of message passing platforms and communication software. The three schemes considered here are: (1) PVM's global operations (broadcast and reduce) which utilizes the Paragon's native corresponding operations based on a spanning tree routing; (2) the Bucket algorithm wherein the angular domain decomposition of the mesh sweep is complemented with a spatial domain decomposition of the accumulation process of the scalar flux from the angular flux and the convergence test; (3) a distributed memory version of the Bucket algorithm that pushes the spatial domain decomposition one step farther by actually distributing the fixed source and flux iterates over the memories of the participating processes. Their conclusion is that the Bucket algorithm is the most efficient of the three if all participating processes have sufficient memories to hold the entire problem arrays. Otherwise, the third scheme becomes necessary at an additional cost to speedup and parallel efficiency that is quantifiable via the parallel performance model
MPI_XSTAR: MPI-based Parallelization of the XSTAR Photoionization Program
Danehkar, Ashkbiz; Nowak, Michael A.; Lee, Julia C.; Smith, Randall K.
2018-02-01
We describe a program for the parallel implementation of multiple runs of XSTAR, a photoionization code that is used to predict the physical properties of an ionized gas from its emission and/or absorption lines. The parallelization program, called MPI_XSTAR, has been developed and implemented in the C++ language by using the Message Passing Interface (MPI) protocol, a conventional standard of parallel computing. We have benchmarked parallel multiprocessing executions of XSTAR, using MPI_XSTAR, against a serial execution of XSTAR, in terms of the parallelization speedup and the computing resource efficiency. Our experience indicates that the parallel execution runs significantly faster than the serial execution, however, the efficiency in terms of the computing resource usage decreases with increasing the number of processors used in the parallel computing.
A Message-Passing Hardware/Software Cosimulation Environment for Reconfigurable Computing Systems
Manuel Saldaña
2009-01-01
Full Text Available High-performance reconfigurable computers (HPRCs provide a mix of standard processors and FPGAs to collectively accelerate applications. This introduces new design challenges, such as the need for portable programming models across HPRCs and system-level verification tools. To address the need for cosimulating a complete heterogeneous application using both software and hardware in an HPRC, we have created a tool called the Message-passing Simulation Framework (MSF. We have used it to simulate and develop an interface enabling an MPI-based approach to exchange data between X86 processors and hardware engines inside FPGAs. The MSF can also be used as an application development tool that enables multiple FPGAs in simulation to exchange messages amongst themselves and with X86 processors. As an example, we simulate a LINPACK benchmark hardware core using an Intel-FSB-Xilinx-FPGA platform to quickly prototype the hardware, to test the communications. and to verify the benchmark results.
Testing New Programming Paradigms with NAS Parallel Benchmarks
Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.
2000-01-01
Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage
Ha, Jeongmok; Jeong, Hong
2016-07-01
This study investigates the directed acyclic subgraph (DAS) algorithm, which is used to solve discrete labeling problems much more rapidly than other Markov-random-field-based inference methods but at a competitive accuracy. However, the mechanism by which the DAS algorithm simultaneously achieves competitive accuracy and fast execution speed, has not been elucidated by a theoretical derivation. We analyze the DAS algorithm by comparing it with a message passing algorithm. Graphical models, inference methods, and energy-minimization frameworks are compared between DAS and message passing algorithms. Moreover, the performances of DAS and other message passing methods [sum-product belief propagation (BP), max-product BP, and tree-reweighted message passing] are experimentally compared.
Parallel Programming with Intel Parallel Studio XE
Blair-Chappell , Stephen
2012-01-01
Optimize code for multi-core processors with Intel's Parallel Studio Parallel programming is rapidly becoming a "must-know" skill for developers. Yet, where to start? This teach-yourself tutorial is an ideal starting point for developers who already know Windows C and C++ and are eager to add parallelism to their code. With a focus on applying tools, techniques, and language extensions to implement parallelism, this essential resource teaches you how to write programs for multicore and leverage the power of multicore in your programs. Sharing hands-on case studies and real-world examples, the
Application Portable Parallel Library
Cole, Gary L.; Blech, Richard A.; Quealy, Angela; Townsend, Scott
1995-01-01
Application Portable Parallel Library (APPL) computer program is subroutine-based message-passing software library intended to provide consistent interface to variety of multiprocessor computers on market today. Minimizes effort needed to move application program from one computer to another. User develops application program once and then easily moves application program from parallel computer on which created to another parallel computer. ("Parallel computer" also include heterogeneous collection of networked computers). Written in C language with one FORTRAN 77 subroutine for UNIX-based computers and callable from application programs written in C language or FORTRAN 77.
Baron, E.; Hauschildt, Peter H.
1998-01-01
We describe an important addition to the parallel implementation of our generalized nonlocal thermodynamic equilibrium (NLTE) stellar atmosphere and radiative transfer computer program PHOENIX. In a previous paper in this series we described data and task parallel algorithms we have developed for radiative transfer, spectral line opacity, and NLTE opacity and rate calculations. These algorithms divided the work spatially or by spectral lines, that is, distributing the radial zones, individual spectral lines, or characteristic rays among different processors and employ, in addition, task parallelism for logically independent functions (such as atomic and molecular line opacities). For finite, monotonic velocity fields, the radiative transfer equation is an initial value problem in wavelength, and hence each wavelength point depends upon the previous one. However, for sophisticated NLTE models of both static and moving atmospheres needed to accurately describe, e.g., novae and supernovae, the number of wavelength points is very large (200,000 - 300,000) and hence parallelization over wavelength can lead both to considerable speedup in calculation time and the ability to make use of the aggregate memory available on massively parallel supercomputers. Here, we describe an implementation of a pipelined design for the wavelength parallelization of PHOENIX, where the necessary data from the processor working on a previous wavelength point is sent to the processor working on the succeeding wavelength point as soon as it is known. Our implementation uses a MIMD design based on a relatively small number of standard message passing interface (MPI) library calls and is fully portable between serial and parallel computers. copyright 1998 The American Astronomical Society
McMPI – a managed-code message passing interface library for high performance communication in C#
Holmes, Daniel John
2012-01-01
This work endeavours to achieve technology transfer between established best-practice in academic high-performance computing and current techniques in commercial high-productivity computing. It shows that a credible high-performance message-passing communication library, with semantics and syntax following the Message-Passing Interface (MPI) Standard, can be built in pure C# (one of the .Net suite of computer languages). Message-passing has been the dominant paradigm in high-pe...
Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts
NONE
1997-12-31
This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.
Neighbourhood-consensus message passing and its potentials in image processing applications
Ružic, Tijana; Pižurica, Aleksandra; Philips, Wilfried
2011-03-01
In this paper, a novel algorithm for inference in Markov Random Fields (MRFs) is presented. Its goal is to find approximate maximum a posteriori estimates in a simple manner by combining neighbourhood influence of iterated conditional modes (ICM) and message passing of loopy belief propagation (LBP). We call the proposed method neighbourhood-consensus message passing because a single joint message is sent from the specified neighbourhood to the central node. The message, as a function of beliefs, represents the agreement of all nodes within the neighbourhood regarding the labels of the central node. This way we are able to overcome the disadvantages of reference algorithms, ICM and LBP. On one hand, more information is propagated in comparison with ICM, while on the other hand, the huge amount of pairwise interactions is avoided in comparison with LBP by working with neighbourhoods. The idea is related to the previously developed iterated conditional expectations algorithm. Here we revisit it and redefine it in a message passing framework in a more general form. The results on three different benchmarks demonstrate that the proposed technique can perform well both for binary and multi-label MRFs without any limitations on the model definition. Furthermore, it manifests improved performance over related techniques either in terms of quality and/or speed.
An efficient communication scheme for solving Sn equations on message-passing multiprocessors
Azmy, Y.Y.
1993-01-01
Early models of Intel's hypercube multiprocessors, e.g., the iPSC/1 and iPSC/2, were characterized by the high latency of message passing. This relatively weak dependence of the communication penalty on the size of messages, in contrast to its strong dependence on the number of messages, justified using the Fan-in Fan-out algorithm (which implements a minimum spanning tree path) to perform global operations, such as global sums, etc. Recent models of message-passing computers, such as the iPSC/860 and the Paragon, have been found to possess much smaller latency, thus forcing a reexamination of the issue of performance optimization with respect to communication schemes. Essentially, the Fan-in Fan-out scheme minimizes the number of nonsimultaneous messages sent but not the volume of data traffic across the network. Furthermore, if a global operation is performed in conjunction with the message passing, a large fraction of the attached nodes remains idle as the number of utilized processors is halved in each step of the process. On the other hand, the Recursive Halving scheme offers the smallest communication cost for global operations but has some drawbacks
Parallel programming with Python
Palach, Jan
2014-01-01
A fast, easy-to-follow and clear tutorial to help you develop Parallel computing systems using Python. Along with explaining the fundamentals, the book will also introduce you to slightly advanced concepts and will help you in implementing these techniques in the real world. If you are an experienced Python programmer and are willing to utilize the available computing resources by parallelizing applications in a simple way, then this book is for you. You are required to have a basic knowledge of Python development to get the most of this book.
Abstract Level Parallelization of Finite Difference Methods
Edwin Vollebregt
1997-01-01
Full Text Available A formalism is proposed for describing finite difference calculations in an abstract way. The formalism consists of index sets and stencils, for characterizing the structure of sets of data items and interactions between data items (“neighbouring relations”. The formalism provides a means for lifting programming to a more abstract level. This simplifies the tasks of performance analysis and verification of correctness, and opens the way for automaticcode generation. The notation is particularly useful in parallelization, for the systematic construction of parallel programs in a process/channel programming paradigm (e.g., message passing. This is important because message passing, unfortunately, still is the only approach that leads to acceptable performance for many more unstructured or irregular problems on parallel computers that have non-uniform memory access times. It will be shown that the use of index sets and stencils greatly simplifies the determination of which data must be exchanged between different computing processes.
Practical parallel programming
Bauer, Barr E
2014-01-01
This is the book that will teach programmers to write faster, more efficient code for parallel processors. The reader is introduced to a vast array of procedures and paradigms on which actual coding may be based. Examples and real-life simulations using these devices are presented in C and FORTRAN.
Oliker, Leonid; Heber, Gerd; Biswas, Rupak
2000-01-01
The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.
A message-passing approach to random constraint satisfaction problems with growing domains
Zhao, Chunyan; Zheng, Zhiming; Zhou, Haijun; Xu, Ke
2011-01-01
Message-passing algorithms based on belief propagation (BP) are implemented on a random constraint satisfaction problem (CSP) referred to as model RB, which is a prototype of hard random CSPs with growing domain size. In model RB, the number of candidate discrete values (the domain size) of each variable increases polynomially with the variable number N of the problem formula. Although the satisfiability threshold of model RB is exactly known, finding solutions for a single problem formula is quite challenging and attempts have been limited to cases of N ∼ 10 2 . In this paper, we propose two different kinds of message-passing algorithms guided by BP for this problem. Numerical simulations demonstrate that these algorithms allow us to find a solution for random formulas of model RB with constraint tightness slightly less than p cr , the threshold value for the satisfiability phase transition. To evaluate the performance of these algorithms, we also provide a local search algorithm (random walk) as a comparison. Besides this, the simulated time dependence of the problem size N and the entropy of the variables for growing domain size are discussed
Writing parallel programs that work
CERN. Geneva
2012-01-01
Serial algorithms typically run inefficiently on parallel machines. This may sound like an obvious statement, but it is the root cause of why parallel programming is considered to be difficult. The current state of the computer industry is still that almost all programs in existence are serial. This talk will describe the techniques used in the Intel Parallel Studio to provide a developer with the tools necessary to understand the behaviors and limitations of the existing serial programs. Once the limitations are known the developer can refactor the algorithms and reanalyze the resulting programs with the tools in the Intel Parallel Studio to create parallel programs that work. About the speaker Paul Petersen is a Sr. Principal Engineer in the Software and Solutions Group (SSG) at Intel. He received a Ph.D. degree in Computer Science from the University of Illinois in 1993. After UIUC, he was employed at Kuck and Associates, Inc. (KAI) working on auto-parallelizing compiler (KAP), and was involved in th...
Ohshima, Hiroyuki
2001-10-01
A whole core thermal-hydraulic analysis program ACT is being developed for the purpose of evaluating detailed in-core thermal hydraulic phenomena of fast reactors including the effect of the flow between wrapper-tube walls (inter-wrapper flow) under various reactor operation conditions. As appropriate boundary conditions in addition to a detailed modeling of the core are essential for accurate simulations of in-core thermal hydraulics, ACT consists of not only fuel assembly and inter-wrapper flow analysis modules but also a heat transport system analysis module that gives response of the plant dynamics to the core model. This report describes incorporation of a simplified model to the fuel assembly analysis module and program parallelization by a message passing method toward large-scale simulations. ACT has a fuel assembly analysis module which can simulate a whole fuel pin bundle in each fuel assembly of the core and, however, it may take much CPU time for a large-scale core simulation. Therefore, a simplified fuel assembly model that is thermal-hydraulically equivalent to the detailed one has been incorporated in order to save the simulation time and resources. This simplified model is applied to several parts of fuel assemblies in a core where the detailed simulation results are not required. With regard to the program parallelization, the calculation load and the data flow of ACT were analyzed and the optimum parallelization has been done including the improvement of the numerical simulation algorithm of ACT. Message Passing Interface (MPI) is applied to data communication between processes and synchronization in parallel calculations. Parallelized ACT was verified through a comparison simulation with the original one. In addition to the above works, input manuals of the core analysis module and the heat transport system analysis module have been prepared. (author)
Teaching Scientific Computing: A Model-Centered Approach to Pipeline and Parallel Programming with C
Vladimiras Dolgopolovas
2015-01-01
Full Text Available The aim of this study is to present an approach to the introduction into pipeline and parallel computing, using a model of the multiphase queueing system. Pipeline computing, including software pipelines, is among the key concepts in modern computing and electronics engineering. The modern computer science and engineering education requires a comprehensive curriculum, so the introduction to pipeline and parallel computing is the essential topic to be included in the curriculum. At the same time, the topic is among the most motivating tasks due to the comprehensive multidisciplinary and technical requirements. To enhance the educational process, the paper proposes a novel model-centered framework and develops the relevant learning objects. It allows implementing an educational platform of constructivist learning process, thus enabling learners’ experimentation with the provided programming models, obtaining learners’ competences of the modern scientific research and computational thinking, and capturing the relevant technical knowledge. It also provides an integral platform that allows a simultaneous and comparative introduction to pipelining and parallel computing. The programming language C for developing programming models and message passing interface (MPI and OpenMP parallelization tools have been chosen for implementation.
Partition function expansion on region graphs and message-passing equations
Zhou, Haijun; Wang, Chuang; Xiao, Jing-Qing; Bi, Zedong
2011-01-01
Disordered and frustrated graphical systems are ubiquitous in physics, biology, and information science. For models on complete graphs or random graphs, deep understanding has been achieved through the mean-field replica and cavity methods. But finite-dimensional 'real' systems remain very challenging because of the abundance of short loops and strong local correlations. A statistical mechanics theory is constructed in this paper for finite-dimensional models based on the mathematical framework of the partition function expansion and the concept of region graphs. Rigorous expressions for the free energy and grand free energy are derived. Message-passing equations on the region graph, such as belief propagation and survey propagation, are also derived rigorously. (letter)
P3T+: A Performance Estimator for Distributed and Parallel Programs
T. Fahringer
2000-01-01
Full Text Available Developing distributed and parallel programs on today's multiprocessor architectures is still a challenging task. Particular distressing is the lack of effective performance tools that support the programmer in evaluating changes in code, problem and machine sizes, and target architectures. In this paper we introduce P3T+ which is a performance estimator for mostly regular HPF (High Performance Fortran programs but partially covers also message passing programs (MPI. P3T+ is unique by modeling programs, compiler code transformations, and parallel and distributed architectures. It computes at compile-time a variety of performance parameters including work distribution, number of transfers, amount of data transferred, transfer times, computation times, and number of cache misses. Several novel technologies are employed to compute these parameters: loop iteration spaces, array access patterns, and data distributions are modeled by employing highly effective symbolic analysis. Communication is estimated by simulating the behavior of a communication library used by the underlying compiler. Computation times are predicted through pre-measured kernels on every target architecture of interest. We carefully model most critical architecture specific factors such as cache lines sizes, number of cache lines available, startup times, message transfer time per byte, etc. P3T+ has been implemented and is closely integrated with the Vienna High Performance Compiler (VFC to support programmers develop parallel and distributed applications. Experimental results for realistic kernel codes taken from real-world applications are presented to demonstrate both accuracy and usefulness of P3T+.
Methodologies and Tools for Tuning Parallel Programs: 80% Art, 20% Science, and 10% Luck
Yan, Jerry C.; Bailey, David (Technical Monitor)
1996-01-01
The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. In the past few years, the ubiquitous introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CRI's Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance instrumentation/monitor/tuning technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g. AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.
A new shared-memory programming paradigm for molecular dynamics simulations on the Intel Paragon
D'Azevedo, E.F.; Romine, C.H.
1994-12-01
This report describes the use of shared memory emulation with DOLIB (Distributed Object Library) to simplify parallel programming on the Intel Paragon. A molecular dynamics application is used as an example to illustrate the use of the DOLIB shared memory library. SOTON-PAR, a parallel molecular dynamics code with explicit message-passing using a Lennard-Jones 6-12 potential, is rewritten using DOLIB primitives. The resulting code has no explicit message primitives and resembles a serial code. The new code can perform dynamic load balancing and achieves better performance than the original parallel code with explicit message-passing
Quealy, Angela; Cole, Gary L.; Blech, Richard A.
1993-01-01
The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.
Khoshfetrat Pakazad, Sina; Hansson, Anders; Andersen, Martin S.
2017-01-01
In this paper, we propose a distributed algorithm for solving coupled problems with chordal sparsity or an inherent tree structure which relies on primal–dual interior-point methods. We achieve this by distributing the computations at each iteration, using message-passing. In comparison to existi...
Sakata, Ayaka; Xu, Yingying
2018-03-01
We analyse a linear regression problem with nonconvex regularization called smoothly clipped absolute deviation (SCAD) under an overcomplete Gaussian basis for Gaussian random data. We propose an approximate message passing (AMP) algorithm considering nonconvex regularization, namely SCAD-AMP, and analytically show that the stability condition corresponds to the de Almeida-Thouless condition in spin glass literature. Through asymptotic analysis, we show the correspondence between the density evolution of SCAD-AMP and the replica symmetric (RS) solution. Numerical experiments confirm that for a sufficiently large system size, SCAD-AMP achieves the optimal performance predicted by the replica method. Through replica analysis, a phase transition between replica symmetric and replica symmetry breaking (RSB) region is found in the parameter space of SCAD. The appearance of the RS region for a nonconvex penalty is a significant advantage that indicates the region of smooth landscape of the optimization problem. Furthermore, we analytically show that the statistical representation performance of the SCAD penalty is better than that of \
Using Partial Reconfiguration and Message Passing to Enable FPGA-Based Generic Computing Platforms
Manuel Saldaña
2012-01-01
Full Text Available Partial reconfiguration (PR is an FPGA feature that allows the modification of certain parts of an FPGA while the rest of the system continues to operate without disruption. This distinctive characteristic of FPGAs has many potential benefits but also challenges. The lack of good CAD tools and the deep hardware knowledge requirement result in a hard-to-use feature. In this paper, the new partition-based Xilinx PR flow is used to incorporate PR within our MPI-based message-passing framework to allow hardware designers to create template bitstreams, which are predesigned, prerouted, generic bitstreams that can be reused for multiple applications. As an example of the generality of this approach, four different applications that use the same template bitstream are run consecutively, with a PR operation performed at the beginning of each application to instantiate the desired application engine. We demonstrate a simplified, reusable, high-level, and portable PR interface for X86-FPGA hybrid machines. PR issues such as local resets of reconfigurable modules and context saving and restoring are addressed in this paper followed by some examples and preliminary PR overhead measurements.
Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems. Ph.D. Thesis
Wang, Yi-Min
1993-01-01
Checkpointing and rollback recovery are techniques that can provide efficient recovery from transient process failures. In a message-passing system, the rollback of a message sender may cause the rollback of the corresponding receiver, and the system needs to roll back to a consistent set of checkpoints called recovery line. If the processes are allowed to take uncoordinated checkpoints, the above rollback propagation may result in the domino effect which prevents recovery line progression. Traditionally, only obsolete checkpoints before the global recovery line can be discarded, and the necessary and sufficient condition for identifying all garbage checkpoints has remained an open problem. A necessary and sufficient condition for achieving optimal garbage collection is derived and it is proved that the number of useful checkpoints is bounded by N(N+1)/2, where N is the number of processes. The approach is based on the maximum-sized antichain model of consistent global checkpoints and the technique of recovery line transformation and decomposition. It is also shown that, for systems requiring message logging to record in-transit messages, the same approach can be used to achieve optimal message log reclamation. As a final topic, a unifying framework is described by considering checkpoint coordination and exploiting piecewise determinism as mechanisms for bounding rollback propagation, and the applicability of the optimal garbage collection algorithm to domino-free recovery protocols is demonstrated.
Refinement of Parallel and Reactive Programs
Back, R. J. R.
1992-01-01
We show how to apply the refinement calculus to stepwise refinement of parallel and reactive programs. We use action systems as our basic program model. Action systems are sequential programs which can be implemented in a parallel fashion. Hence refinement calculus methods, originally developed for sequential programs, carry over to the derivation of parallel programs. Refinement of reactive programs is handled by data refinement techniques originally developed for the sequential refinement c...
About Parallel Programming: Paradigms, Parallel Execution and Collaborative Systems
Loredana MOCEAN
2009-01-01
Full Text Available In the last years, there were made efforts for delineation of a stabile and unitary frame, where the problems of logical parallel processing must find solutions at least at the level of imperative languages. The results obtained by now are not at the level of the made efforts. This paper wants to be a little contribution at these efforts. We propose an overview in parallel programming, parallel execution and collaborative systems.
Structured Parallel Programming Patterns for Efficient Computation
McCool, Michael; Robison, Arch
2012-01-01
Programming is now parallel programming. Much as structured programming revolutionized traditional serial programming decades ago, a new kind of structured programming, based on patterns, is relevant to parallel programming today. Parallel computing experts and industry insiders Michael McCool, Arch Robison, and James Reinders describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give detailed concrete examples using multiple programming models. Examples are primarily given using two of th
Lawson, Gary; Sosonkina, Masha; Baurle, Robert; Hammond, Dana
2017-01-01
In many fields, real-world applications for High Performance Computing have already been developed. For these applications to stay up-to-date, new parallel strategies must be explored to yield the best performance; however, restructuring or modifying a real-world application may be daunting depending on the size of the code. In this case, a mini-app may be employed to quickly explore such options without modifying the entire code. In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23 was measured for MPI+SMPI, but only 11 was measured for MPI+OpenMP.
Parallel programming with Easy Java Simulations
Esquembre, F.; Christian, W.; Belloni, M.
2018-01-01
Nearly all of today's processors are multicore, and ideally programming and algorithm development utilizing the entire processor should be introduced early in the computational physics curriculum. Parallel programming is often not introduced because it requires a new programming environment and uses constructs that are unfamiliar to many teachers. We describe how we decrease the barrier to parallel programming by using a java-based programming environment to treat problems in the usual undergraduate curriculum. We use the easy java simulations programming and authoring tool to create the program's graphical user interface together with objects based on those developed by Kaminsky [Building Parallel Programs (Course Technology, Boston, 2010)] to handle common parallel programming tasks. Shared-memory parallel implementations of physics problems, such as time evolution of the Schrödinger equation, are available as source code and as ready-to-run programs from the AAPT-ComPADRE digital library.
Parallel community climate model: Description and user`s guide
Drake, J.B.; Flanery, R.E.; Semeraro, B.D.; Worley, P.H. [and others
1996-07-15
This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain into geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.
Pthreads vs MPI Parallel Performance of Angular-Domain Decomposed S
Azmy, Y.Y.; Barnett, D.A.
2000-01-01
Two programming models for parallelizing the Angular Domain Decomposition (ADD) of the discrete ordinates (S n ) approximation of the neutron transport equation are examined. These are the shared memory model based on the POSIX threads (Pthreads) standard, and the message passing model based on the Message Passing Interface (MPI) standard. These standard libraries are available on most multiprocessor platforms thus making the resulting parallel codes widely portable. The question is: on a fixed platform, and for a particular code solving a given test problem, which of the two programming models delivers better parallel performance? Such comparison is possible on Symmetric Multi-Processors (SMP) architectures in which several CPUs physically share a common memory, and in addition are capable of emulating message passing functionality. Implementation of the two-dimensional,(S n ), Arbitrarily High Order Transport (AHOT) code for solving neutron transport problems using these two parallelization models is described. Measured parallel performance of each model on the COMPAQ AlphaServer 8400 and the SGI Origin 2000 platforms is described, and comparison of the observed speedup for the two programming models is reported. For the case presented in this paper it appears that the MPI implementation scales better than the Pthreads implementation on both platforms
Compiling Scientific Programs for Scalable Parallel Systems
Kennedy, Ken
2001-01-01
...). The research performed in this project included new techniques for recognizing implicit parallelism in sequential programs, a powerful and precise set-based framework for analysis and transformation...
Towards deductive verification of MPI programs against session types
Eduardo R. B. Marques
2013-12-01
Full Text Available The Message Passing Interface (MPI is the de facto standard message-passing infrastructure for developing parallel applications. Two decades after the first version of the library specification, MPI-based applications are nowadays routinely deployed on super and cluster computers. These applications, written in C or Fortran, exhibit intricate message passing behaviours, making it hard to statically verify important properties such as the absence of deadlocks. Our work builds on session types, a theory for describing protocols that provides for correct-by-construction guarantees in this regard. We annotate MPI primitives and C code with session type contracts, written in the language of a software verifier for C. Annotated code is then checked for correctness with the software verifier. We present preliminary results and discuss the challenges that lie ahead for verifying realistic MPI program compliance against session types.
PDDP, A Data Parallel Programming Model
Karen H. Warren
1996-01-01
Full Text Available PDDP, the parallel data distribution preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP implements high-performance Fortran-compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the WHERE construct. Distributed data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.
Massively Parallel Finite Element Programming
Heister, Timo
2010-01-01
Today\\'s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.
Massively Parallel Finite Element Programming
Heister, Timo; Kronbichler, Martin; Bangerth, Wolfgang
2010-01-01
Today's large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.
Language constructs for modular parallel programs
Foster, I.
1996-03-01
We describe programming language constructs that facilitate the application of modular design techniques in parallel programming. These constructs allow us to isolate resource management and processor scheduling decisions from the specification of individual modules, which can themselves encapsulate design decisions concerned with concurrence, communication, process mapping, and data distribution. This approach permits development of libraries of reusable parallel program components and the reuse of these components in different contexts. In particular, alternative mapping strategies can be explored without modifying other aspects of program logic. We describe how these constructs are incorporated in two practical parallel programming languages, PCN and Fortran M. Compilers have been developed for both languages, allowing experimentation in substantial applications.
Shrimankar, D. D.; Sathe, S. R.
2016-01-01
Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today’s supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures. PMID:27932868
Shrimankar, D D; Sathe, S R
2016-01-01
Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today's supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.
Experiences in Data-Parallel Programming
Terry W. Clark
1997-01-01
Full Text Available To efficiently parallelize a scientific application with a data-parallel compiler requires certain structural properties in the source program, and conversely, the absence of others. A recent parallelization effort of ours reinforced this observation and motivated this correspondence. Specifically, we have transformed a Fortran 77 version of GROMOS, a popular dusty-deck program for molecular dynamics, into Fortran D, a data-parallel dialect of Fortran. During this transformation we have encountered a number of difficulties that probably are neither limited to this particular application nor do they seem likely to be addressed by improved compiler technology in the near future. Our experience with GROMOS suggests a number of points to keep in mind when developing software that may at some time in its life cycle be parallelized with a data-parallel compiler. This note presents some guidelines for engineering data-parallel applications that are compatible with Fortran D or High Performance Fortran compilers.
Productive Parallel Programming: The PCN Approach
Ian Foster
1992-01-01
Full Text Available We describe the PCN programming system, focusing on those features designed to improve the productivity of scientists and engineers using parallel supercomputers. These features include a simple notation for the concise specification of concurrent algorithms, the ability to incorporate existing Fortran and C code into parallel applications, facilities for reusing parallel program components, a portable toolkit that allows applications to be developed on a workstation or small parallel computer and run unchanged on supercomputers, and integrated debugging and performance analysis tools. We survey representative scientific applications and identify problem classes for which PCN has proved particularly useful.
High-energy physics software parallelization using database techniques
Argante, E.; Van der Stok, P.D.V.; Willers, I.
1997-01-01
A programming model for software parallelization, called CoCa, is introduced that copes with problems caused by typical features of high-energy physics software. By basing CoCa on the database transaction paradigm, the complexity induced by the parallelization is for a large part transparent to the programmer, resulting in a higher level of abstraction than the native message passing software. CoCa is implemented on a Meiko CS-2 and on a SUN SPARCcenter 2000 parallel computer. On the CS-2, the performance is comparable with the performance of native PVM and MPI. (orig.)
Parallel Monte Carlo simulation of aerosol dynamics
Zhou, K.; He, Z.; Xiao, M.; Zhang, Z.
2014-01-01
is simulated with a stochastic method (Marcus-Lushnikov stochastic process). Operator splitting techniques are used to synthesize the deterministic and stochastic parts in the algorithm. The algorithm is parallelized using the Message Passing Interface (MPI
Design strategies for irregularly adapting parallel applications
Oliker, Leonid; Biswas, Rupak; Shan, Hongzhang; Sing, Jaswinder Pal
2000-01-01
Achieving scalable performance for dynamic irregular applications is eminently challenging. Traditional message-passing approaches have been making steady progress towards this goal; however, they suffer from complex implementation requirements. The use of a global address space greatly simplifies the programming task, but can degrade the performance of dynamically adapting computations. In this work, we examine two major classes of adaptive applications, under five competing programming methodologies and four leading parallel architectures. Results indicate that it is possible to achieve message-passing performance using shared-memory programming techniques by carefully following the same high level strategies. Adaptive applications have computational work loads and communication patterns which change unpredictably at runtime, requiring dynamic load balancing to achieve scalable performance on parallel machines. Efficient parallel implementations of such adaptive applications are therefore a challenging task. This work examines the implementation of two typical adaptive applications, Dynamic Remeshing and N-Body, across various programming paradigms and architectural platforms. We compare several critical factors of the parallel code development, including performance, programmability, scalability, algorithmic development, and portability
Parallel processor programs in the Federal Government
Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.
1985-01-01
In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.
The kpx, a program analyzer for parallelization
Matsuyama, Yuji; Orii, Shigeo; Ota, Toshiro; Kume, Etsuo; Aikawa, Hiroshi.
1997-03-01
The kpx is a program analyzer, developed as a common technological basis for promoting parallel processing. The kpx consists of three tools. The first is ktool, that shows how much execution time is spent in program segments. The second is ptool, that shows parallelization overhead on the Paragon system. The last is xtool, that shows parallelization overhead on the VPP system. The kpx, designed to work for any FORTRAN cord on any UNIX computer, is confirmed to work well after testing on Paragon, SP2, SR2201, VPP500, VPP300, Monte-4, SX-4 and T90. (author)
Speedup predictions on large scientific parallel programs
Williams, E.; Bobrowicz, F.
1985-01-01
How much speedup can we expect for large scientific parallel programs running on supercomputers. For insight into this problem we extend the parallel processing environment currently existing on the Cray X-MP (a shared memory multiprocessor with at most four processors) to a simulated N-processor environment, where N greater than or equal to 1. Several large scientific parallel programs from Los Alamos National Laboratory were run in this simulated environment, and speedups were predicted. A speedup of 14.4 on 16 processors was measured for one of the three most used codes at the Laboratory
Automatic Parallelization Tool: Classification of Program Code for Parallel Computing
Mustafa Basthikodi
2016-04-01
Full Text Available Performance growth of single-core processors has come to a halt in the past decade, but was re-enabled by the introduction of parallelism in processors. Multicore frameworks along with Graphical Processing Units empowered to enhance parallelism broadly. Couples of compilers are updated to developing challenges forsynchronization and threading issues. Appropriate program and algorithm classifications will have advantage to a great extent to the group of software engineers to get opportunities for effective parallelization. In present work we investigated current species for classification of algorithms, in that related work on classification is discussed along with the comparison of issues that challenges the classification. The set of algorithms are chosen which matches the structure with different issues and perform given task. We have tested these algorithms utilizing existing automatic species extraction toolsalong with Bones compiler. We have added functionalities to existing tool, providing a more detailed characterization. The contributions of our work include support for pointer arithmetic, conditional and incremental statements, user defined types, constants and mathematical functions. With this, we can retain significant data which is not captured by original speciesof algorithms. We executed new theories into the device, empowering automatic characterization of program code.
HPC parallel programming model for gyrokinetic MHD simulation
Naitou, Hiroshi; Yamada, Yusuke; Tokuda, Shinji; Ishii, Yasutomo; Yagi, Masatoshi
2011-01-01
The 3-dimensional gyrokinetic PIC (particle-in-cell) code for MHD simulation, Gpic-MHD, was installed on SR16000 (“Plasma Simulator”), which is a scalar cluster system consisting of 8,192 logical cores. The Gpic-MHD code advances particle and field quantities in time. In order to distribute calculations over large number of logical cores, the total simulation domain in cylindrical geometry was broken up into N DD-r × N DD-z (number of radial decomposition times number of axial decomposition) small domains including approximately the same number of particles. The axial direction was uniformly decomposed, while the radial direction was non-uniformly decomposed. N RP replicas (copies) of each decomposed domain were used (“particle decomposition”). The hybrid parallelization model of multi-threads and multi-processes was employed: threads were parallelized by the auto-parallelization and N DD-r × N DD-z × N RP processes were parallelized by MPI (message-passing interface). The parallelization performance of Gpic-MHD was investigated for the medium size system of N r × N θ × N z = 1025 × 128 × 128 mesh with 4.196 or 8.192 billion particles. The highest speed for the fixed number of logical cores was obtained for two threads, the maximum number of N DD-z , and optimum combination of N DD-r and N RP . The observed optimum speeds demonstrated good scaling up to 8,192 logical cores. (author)
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Jost, Gabriele; Jin, Hao-Qiang; anMey, Dieter; Hatay, Ferhat F.
2003-01-01
Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming paradigm is the best will depend on the nature of the given problem, the hardware components of the cluster, the network, and the available software. In this study we compare the performance of different implementations of the same CFD benchmark application, using the same numerical algorithm but employing different programming paradigms.
Shared memory and message passing revisited in the many-core era
CERN. Geneva
2016-01-01
In the 70s, Edsgar Dijkstra, Per Brinch Hansen and C.A.R Hoare introduced the fundamental concepts for concurrent computing. It was clear that concrete communication mechanisms were required in order to achieve effective concurrency. Whether you're developing a multithreaded program running on a single node, or a distributed system spanning over hundreds of thousands cores, the choice of the communication mechanism for your system must be done intelligently because of the implicit programmability, performance and scalability trade-offs. With the emergence of many-core computing architectures many assumptions may not be true anymore. In this talk we will try to provide insight on the characteristics of these communication models by providing basic theoretical background and then focus on concrete practical examples based on indicative use case scenarios. The case studies of this presentation cover popular programming models, operating systems and concurrency frameworks in the context of many-core processors.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
1999-06-01
and M.A. Castillo. "Checkpointing through garbage collection." Technical report. Departamento de Ciencia de la Computation, Escuela de Ingenieria ...between consecutive checkpoints. It can be implemented by using the dirty-bit of the memory protection hardware or by emulating a dirty-bit in software [4...compare the program’s state with the previous checkpoint in software , and writing the difference in a new checkpoint [46]. The required storage and
From sequential to parallel programming with patterns
CERN. Geneva
2018-01-01
To increase in both performance and efficiency, our programming models need to adapt to better exploit modern processors. The classic idioms and patterns for programming such as loops, branches or recursion are the pillars of almost every code and are well known among all programmers. These patterns all have in common that they are sequential in nature. Embracing parallel programming patterns, which allow us to program for multi- and many-core hardware in a natural way, greatly simplifies the task of designing a program that scales and performs on modern hardware, independently of the used programming language, and in a generic way.
Kamatani Naoyuki
2011-05-01
Full Text Available Abstract Background Use of missing genotype imputations and haplotype reconstructions are valuable in genome-wide association studies (GWASs. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and used for GWASs. Since millions of single nucleotide polymorphisms need to be imputed in a GWAS, faster methods for genotype imputation and haplotype reconstruction are required. Results We developed a program package for parallel computation of genotype imputation and haplotype reconstruction. Our program package, ParaHaplo 3.0, is intended for use in workstation clusters using the Intel Message Passing Interface. We compared the performance of ParaHaplo 3.0 on the Japanese in Tokyo, Japan and Han Chinese in Beijing, and Chinese in the HapMap dataset. A parallel version of ParaHaplo 3.0 can conduct genotype imputation 20 times faster than a non-parallel version of ParaHaplo. Conclusions ParaHaplo 3.0 is an invaluable tool for conducting haplotype-based GWASs. The need for faster genotype imputation and haplotype reconstruction using parallel computing will become increasingly important as the data sizes of such projects continue to increase. ParaHaplo executable binaries and program sources are available at http://en.sourceforge.jp/projects/parallelgwas/releases/.
Bellerby, Tim
2015-04-01
PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks number of processors
Parallel Volunteer Learning during Youth Programs
Lesmeister, Marilyn K.; Green, Jeremy; Derby, Amy; Bothum, Candi
2012-01-01
Lack of time is a hindrance for volunteers to participate in educational opportunities, yet volunteer success in an organization is tied to the orientation and education they receive. Meeting diverse educational needs of volunteers can be a challenge for program managers. Scheduling a Volunteer Learning Track for chaperones that is parallel to a…
Contributions to computational stereology and parallel programming
Rasmusson, Allan
rotator, even without the need for isotropic sections. To meet the need for computational power to perform image restoration of virtual tissue sections, parallel programming on GPUs has also been part of the project. This has lead to a significant change in paradigm for a previously developed surgical...
Program For Parallel Discrete-Event Simulation
Beckman, Brian C.; Blume, Leo R.; Geiselman, John S.; Presley, Matthew T.; Wedel, John J., Jr.; Bellenot, Steven F.; Diloreto, Michael; Hontalas, Philip J.; Reiher, Peter L.; Weiland, Frederick P.
1991-01-01
User does not have to add any special logic to aid in synchronization. Time Warp Operating System (TWOS) computer program is special-purpose operating system designed to support parallel discrete-event simulation. Complete implementation of Time Warp mechanism. Supports only simulations and other computations designed for virtual time. Time Warp Simulator (TWSIM) subdirectory contains sequential simulation engine interface-compatible with TWOS. TWOS and TWSIM written in, and support simulations in, C programming language.
Moriakov, A.; Vasyukhno, V.; Netecha, M.; Khacheresov, G.
2003-01-01
Powerful supercomputers are available today. MBC-1000M is one of Russian supercomputers that may be used by distant way access. Programs LUCKY and LUCKY C were created to work for multi-processors systems. These programs have algorithms created especially for these computers and used MPI (message passing interface) service for exchanges between processors. LUCKY may resolved shielding tasks by multigroup discreet ordinate method. LUCKY C may resolve critical tasks by same method. Only XYZ orthogonal geometry is available. Under little space steps to approximate discreet operator this geometry may be used as universal one to describe complex geometrical structures. Cross section libraries are used up to P8 approximation by Legendre polynomials for nuclear data in GIT format. Programming language is Fortran-90. 'Vector' processors may be used that lets get a time profit up to 30 times. But unfortunately MBC-1000M has not these processors. Nevertheless sufficient value for efficiency of parallel calculations was obtained under 'space' (LUCKY) and 'space and energy' (LUCKY C ) paralleling. AUTOCAD program is used to control geometry after a treatment of input data. Programs have powerful geometry module, it is a beautiful tool to achieve any geometry. Output results may be processed by graphic programs on personal computer. (authors)
Programming massively parallel processors a hands-on approach
Kirk, David B
2010-01-01
Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs. Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA. The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who ...
PSHED: a simplified approach to developing parallel programs
Mahajan, S.M.; Ramesh, K.; Rajesh, K.; Somani, A.; Goel, M.
1992-01-01
This paper presents a simplified approach in the forms of a tree structured computational model for parallel application programs. An attempt is made to provide a standard user interface to execute programs on BARC Parallel Processing System (BPPS), a scalable distributed memory multiprocessor. The interface package called PSHED provides a basic framework for representing and executing parallel programs on different parallel architectures. The PSHED package incorporates concepts from a broad range of previous research in programming environments and parallel computations. (author). 6 refs
Study on MPI/OpenMP hybrid parallelism for Monte Carlo neutron transport code
Liang Jingang; Xu Qi; Wang Kan; Liu Shiwen
2013-01-01
Parallel programming with mixed mode of messages-passing and shared-memory has several advantages when used in Monte Carlo neutron transport code, such as fitting hardware of distributed-shared clusters, economizing memory demand of Monte Carlo transport, improving parallel performance, and so on. MPI/OpenMP hybrid parallelism was implemented based on a one dimension Monte Carlo neutron transport code. Some critical factors affecting the parallel performance were analyzed and solutions were proposed for several problems such as contention access, lock contention and false sharing. After optimization the code was tested finally. It is shown that the hybrid parallel code can reach good performance just as pure MPI parallel program, while it saves a lot of memory usage at the same time. Therefore hybrid parallel is efficient for achieving large-scale parallel of Monte Carlo neutron transport. (authors)
Ricci-Tersenghi, Federico; Zdeborova, Lenka; Zecchina, Riccardo; Tramel, Eric W; Cugliandolo, Leticia F
2015-01-01
This book contains a collection of the presentations that were given in October 2013 at the Les Houches Autumn School on statistical physics, optimization, inference, and message-passing algorithms. In the last decade, there has been increasing convergence of interest and methods between theoretical physics and fields as diverse as probability, machine learning, optimization, and inference problems. In particular, much theoretical and applied work in statistical physics and computer science has relied on the use of message-passing algorithms and their connection to the statistical physics of glasses and spin glasses. For example, both the replica and cavity methods have led to recent advances in compressed sensing, sparse estimation, and random constraint satisfaction, to name a few. This book’s detailed pedagogical lectures on statistical inference, computational complexity, the replica and cavity methods, and belief propagation are aimed particularly at PhD students, post-docs, and young researchers desir...
Altarelli, Fabrizio; Monasson, Remi; Zamponi, Francesco
2007-01-01
For large clause-to-variable ratios, typical K-SAT instances drawn from the uniform distribution have no solution. We argue, based on statistical mechanics calculations using the replica and cavity methods, that rare satisfiable instances from the uniform distribution are very similar to typical instances drawn from the so-called planted distribution, where instances are chosen uniformly between the ones that admit a given solution. It then follows, from a recent article by Feige, Mossel and Vilenchik (2006 Complete convergence of message passing algorithms for some satisfiability problems Proc. Random 2006 pp 339-50), that these rare instances can be easily recognized (in O(log N) time and with probability close to 1) by a simple message-passing algorithm
Evolution of a minimal parallel programming model
Lusk, Ewing; Butler, Ralph; Pieper, Steven C.
2017-01-01
Here, we take a historical approach to our presentation of self-scheduled task parallelism, a programming model with its origins in early irregular and nondeterministic computations encountered in automated theorem proving and logic programming. We show how an extremely simple task model has evolved into a system, asynchronous dynamic load balancing (ADLB), and a scalable implementation capable of supporting sophisticated applications on today’s (and tomorrow’s) largest supercomputers; and we illustrate the use of ADLB with a Green’s function Monte Carlo application, a modern, mature nuclear physics code in production use. Our lesson is that by surrendering a certain amount of generality and thus applicability, a minimal programming model (in terms of its basic concepts and the size of its application programmer interface) can achieve extreme scalability without introducing complexity.
Development of parallel/serial program analyzing tool
Watanabe, Hiroshi; Nagao, Saichi; Takigawa, Yoshio; Kumakura, Toshimasa
1999-03-01
Japan Atomic Energy Research Institute has been developing 'KMtool', a parallel/serial program analyzing tool, in order to promote the parallelization of the science and engineering computation program. KMtool analyzes the performance of program written by FORTRAN77 and MPI, and it reduces the effort for parallelization. This paper describes development purpose, design, utilization and evaluation of KMtool. (author)
A Tutorial on Parallel and Concurrent Programming in Haskell
Peyton Jones, Simon; Singh, Satnam
This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.
A portable implementation of ARPACK for distributed memory parallel architectures
Maschhoff, K.J.; Sorensen, D.C.
1996-12-31
ARPACK is a package of Fortran 77 subroutines which implement the Implicitly Restarted Arnoldi Method used for solving large sparse eigenvalue problems. A parallel implementation of ARPACK is presented which is portable across a wide range of distributed memory platforms and requires minimal changes to the serial code. The communication layers used for message passing are the Basic Linear Algebra Communication Subprograms (BLACS) developed for the ScaLAPACK project and Message Passing Interface(MPI).
Pattern-Driven Automatic Parallelization
Christoph W. Kessler
1996-01-01
Full Text Available This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.
Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications
Biswas, Rupak; Das, Sajal K.; Harvey, Daniel; Oliker, Leonid
1999-01-01
The ability to dynamically adapt an unstructured -rid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult, particularly from the view point of portability on various multiprocessor platforms We address this problem by developing PLUM, tin automatic anti architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on, an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.
Hisley, Dixie
1999-01-01
.... The goals of this report are: (1) to investigate the performance of message passing and loop level parallelization techniques, as they were implemented in the computational fluid dynamics (CFD...
Building Blocks for the Rapid Development of Parallel Simulations, Phase I
National Aeronautics and Space Administration — Scientists need to be able to quickly develop and run parallel simulations without paying the high price of writing low-level message passing codes using compiled...
Parallelization for first principles electronic state calculation program
Watanabe, Hiroshi; Oguchi, Tamio.
1997-03-01
In this report we study the parallelization for First principles electronic state calculation program. The target machines are NEC SX-4 for shared memory type parallelization and FUJITSU VPP300 for distributed memory type parallelization. The features of each parallel machine are surveyed, and the parallelization methods suitable for each are proposed. It is shown that 1.60 times acceleration is achieved with 2 CPU parallelization by SX-4 and 4.97 times acceleration is achieved with 12 PE parallelization by VPP 300. (author)
Step by step parallel programming method for molecular dynamics code
Orii, Shigeo; Ohta, Toshio
1996-07-01
Parallel programming for a numerical simulation program of molecular dynamics is carried out with a step-by-step programming technique using the two phase method. As a result, within the range of a certain computing parameters, it is found to obtain parallel performance by using the level of parallel programming which decomposes the calculation according to indices of do-loops into each processor on the vector parallel computer VPP500 and the scalar parallel computer Paragon. It is also found that VPP500 shows parallel performance in wider range computing parameters. The reason is that the time cost of the program parts, which can not be reduced by the do-loop level of the parallel programming, can be reduced to the negligible level by the vectorization. After that, the time consuming parts of the program are concentrated on less parts that can be accelerated by the do-loop level of the parallel programming. This report shows the step-by-step parallel programming method and the parallel performance of the molecular dynamics code on VPP500 and Paragon. (author)
Parallel phase model : a programming model for high-end parallel machines with manycores.
Wu, Junfeng (Syracuse University, Syracuse, NY); Wen, Zhaofang; Heroux, Michael Allen; Brightwell, Ronald Brian
2009-04-01
This paper presents a parallel programming model, Parallel Phase Model (PPM), for next-generation high-end parallel machines based on a distributed memory architecture consisting of a networked cluster of nodes with a large number of cores on each node. PPM has a unified high-level programming abstraction that facilitates the design and implementation of parallel algorithms to exploit both the parallelism of the many cores and the parallelism at the cluster level. The programming abstraction will be suitable for expressing both fine-grained and coarse-grained parallelism. It includes a few high-level parallel programming language constructs that can be added as an extension to an existing (sequential or parallel) programming language such as C; and the implementation of PPM also includes a light-weight runtime library that runs on top of an existing network communication software layer (e.g. MPI). Design philosophy of PPM and details of the programming abstraction are also presented. Several unstructured applications that inherently require high-volume random fine-grained data accesses have been implemented in PPM with very promising results.
Three dimensional Burn-up program parallelization using socket programming
Haliyati R, Evi; Su'ud, Zaki
2002-01-01
A computer parallelization process was built with a purpose to decrease execution time of a physics program. In this case, a multi computer system was built to be used to analyze burn-up process of a nuclear reactor. This multi computer system was design need using a protocol communication among sockets, i.e. TCP/IP. This system consists of computer as a server and the rest as clients. The server has a main control to all its clients. The server also divides the reactor core geometrically to in parts in accordance with the number of clients, each computer including the server has a task to conduct burn-up analysis of 1/n part of the total reactor core measure. This burn-up analysis was conducted simultaneously and in a parallel way by all computers, so a faster program execution time was achieved close to 1/n times that of one computer. Then an analysis was carried out and states that in order to calculate the density of atoms in a reactor of 91 cm x 91 cm x 116 cm, the usage of a parallel system of 2 computers has the highest efficiency
A. Averbuch
1994-01-01
Full Text Available Parallel elliptic single/multigrid solutions around an aligned and nonaligned body are presented and implemented on two multi-user and single-user shared memory multiprocessors (Sequent Symmetry and MOS and on a distributed memory multiprocessor (a Transputer network. Our parallel implementation uses the Virtual Machine for Muli-Processors (VMMP, a software package that provides a coherent set of services for explicitly parallel application programs running on diverse multiple instruction multiple data (MIMD multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementing the same services on all target multiprocessors. The performance of our algorithm is investigated in detail. It is seen to fit well the above architectures when the number of processors is less than the maximal number of grid points along the axes. In general, the efficiency in the nonaligned case is higher than in the aligned case. Alignment overhead is observed to be up to 200% in the shared-memory case and up to 65% in the message-passing case. We have demonstrated that when using VMMP, the portability of the algorithms is straightforward and efficient.
Ultrascalable petaflop parallel supercomputer
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Chiu, George [Cross River, NY; Cipolla, Thomas M [Katonah, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Hall, Shawn [Pleasantville, NY; Haring, Rudolf A [Cortlandt Manor, NY; Heidelberger, Philip [Cortlandt Manor, NY; Kopcsay, Gerard V [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Salapura, Valentina [Chappaqua, NY; Sugavanam, Krishnan [Mahopac, NY; Takken, Todd [Brewster, NY
2010-07-20
A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.
Programming parallel architectures - The BLAZE family of languages
Mehrotra, Piyush
1989-01-01
This paper gives an overview of the various approaches to programming multiprocessor architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive, since they remove much of the burden of exploiting parallel architectures from the user. This paper also describes recent work in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described.
Shipman, Galen M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
2016-06-13
These are the slides for a presentation on programming models in HPC, at the Los Alamos National Laboratory's Parallel Computing Summer School. The following topics are covered: Flynn's Taxonomy of computer architectures; single instruction single data; single instruction multiple data; multiple instruction multiple data; address space organization; definition of Trinity (Intel Xeon-Phi is a MIMD architecture); single program multiple data; multiple program multiple data; ExMatEx workflow overview; definition of a programming model, programming languages, runtime systems; programming model and environments; MPI (Message Passing Interface); OpenMP; Kokkos (Performance Portable Thread-Parallel Programming Model); Kokkos abstractions, patterns, policies, and spaces; RAJA, a systematic approach to node-level portability and tuning; overview of the Legion Programming Model; mapping tasks and data to hardware resources; interoperability: supporting task-level models; Legion S3D execution and performance details; workflow, integration of external resources into the programming model.
Parallel programming practical aspects, models and current limitations
Tarkov, Mikhail S
2014-01-01
Parallel programming is designed for the use of parallel computer systems for solving time-consuming problems that cannot be solved on a sequential computer in a reasonable time. These problems can be divided into two classes: 1. Processing large data arrays (including processing images and signals in real time)2. Simulation of complex physical processes and chemical reactions For each of these classes, prospective methods are designed for solving problems. For data processing, one of the most promising technologies is the use of artificial neural networks. Particles-in-cell method and cellular automata are very useful for simulation. Problems of scalability of parallel algorithms and the transfer of existing parallel programs to future parallel computers are very acute now. An important task is to optimize the use of the equipment (including the CPU cache) of parallel computers. Along with parallelizing information processing, it is essential to ensure the processing reliability by the relevant organization ...
High Performance Programming Using Explicit Shared Memory Model on Cray T3D1
Simon, Horst D.; Saini, Subhash; Grassi, Charles
1994-01-01
The Cray T3D system is the first-phase system in Cray Research, Inc.'s (CRI) three-phase massively parallel processing (MPP) program. This system features a heterogeneous architecture that closely couples DEC's Alpha microprocessors and CRI's parallel-vector technology, i.e., the Cray Y-MP and Cray C90. An overview of the Cray T3D hardware and available programming models is presented. Under Cray Research adaptive Fortran (CRAFT) model four programming methods (data parallel, work sharing, message-passing using PVM, and explicit shared memory model) are available to the users. However, at this time data parallel and work sharing programming models are not available to the user community. The differences between standard PVM and CRI's PVM are highlighted with performance measurements such as latencies and communication bandwidths. We have found that the performance of neither standard PVM nor CRI s PVM exploits the hardware capabilities of the T3D. The reasons for the bad performance of PVM as a native message-passing library are presented. This is illustrated by the performance of NAS Parallel Benchmarks (NPB) programmed in explicit shared memory model on Cray T3D. In general, the performance of standard PVM is about 4 to 5 times less than obtained by using explicit shared memory model. This degradation in performance is also seen on CM-5 where the performance of applications using native message-passing library CMMD on CM-5 is also about 4 to 5 times less than using data parallel methods. The issues involved (such as barriers, synchronization, invalidating data cache, aligning data cache etc.) while programming in explicit shared memory model are discussed. Comparative performance of NPB using explicit shared memory programming model on the Cray T3D and other highly parallel systems such as the TMC CM-5, Intel Paragon, Cray C90, IBM-SP1, etc. is presented.
Static and dynamic load-balancing strategies for parallel reservoir simulation
Anguille, L.; Killough, J.E.; Li, T.M.C.; Toepfer, J.L.
1995-01-01
Accurate simulation of the complex phenomena that occur in flow in porous media can tax even the most powerful serial computers. Emergence of new parallel computer architectures as a future efficient tool in reservoir simulation may overcome this difficulty. Unfortunately, major problems remain to be solved before using parallel computers commercially: production serial programs must be rewritten to be efficient in parallel environments and load balancing methods must be explored to evenly distribute the workload on each processor during the simulation. This study implements both a static load-balancing algorithm and a receiver-initiated dynamic load-sharing algorithm to achieve high parallel efficiencies on both the IBM SP2 and Intel IPSC/860 parallel computers. Significant speedup improvement was recorded for both methods. Further optimization of these algorithms yielded a technique with efficiencies as high as 90% and 70% on 8 and 32 nodes, respectively. The increased performance was the result of the minimization of message-passing overhead
On the Automatic Parallelization of Sparse and Irregular Fortran Programs
Yuan Lin
1999-01-01
Full Text Available Automatic parallelization is usually believed to be less effective at exploiting implicit parallelism in sparse/irregular programs than in their dense/regular counterparts. However, not much is really known because there have been few research reports on this topic. In this work, we have studied the possibility of using an automatic parallelizing compiler to detect the parallelism in sparse/irregular programs. The study with a collection of sparse/irregular programs led us to some common loop patterns. Based on these patterns new techniques were derived that produced good speedups when manually applied to our benchmark codes. More importantly, these parallelization methods can be implemented in a parallelizing compiler and can be applied automatically.
Survey on present status and trend of parallel programming environments
Takemiya, Hiroshi; Higuchi, Kenji; Honma, Ichiro; Ohta, Hirofumi; Kawasaki, Takuji; Imamura, Toshiyuki; Koide, Hiroshi; Akimoto, Masayuki.
1997-03-01
This report intends to provide useful information on software tools for parallel programming through the survey on parallel programming environments of the following six parallel computers, Fujitsu VPP300/500, NEC SX-4, Hitachi SR2201, Cray T94, IBM SP, and Intel Paragon, all of which are installed at Japan Atomic Energy Research Institute (JAERI), moreover, the present status of R and D's on parallel softwares of parallel languages, compilers, debuggers, performance evaluation tools, and integrated tools is reported. This survey has been made as a part of our project of developing a basic software for parallel programming environment, which is designed on the concept of STA (Seamless Thinking Aid to programmers). (author)
The BLAZE language - A parallel language for scientific programming
Mehrotra, Piyush; Van Rosendale, John
1987-01-01
A Pascal-like scientific programming language, BLAZE, is described. BLAZE contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow. A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of BLAZE are described and it is shown how this language would be used in typical scientific programming.
The BLAZE language: A parallel language for scientific programming
Mehrotra, P.; Vanrosendale, J.
1985-01-01
A Pascal-like scientific programming language, Blaze, is described. Blaze contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus Blaze should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with onceptually sequential control flow. A central goal in the design of Blaze is portability across a broad range of parallel architectures. The multiple levels of parallelism present in Blaze code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of Blaze are described and shows how this language would be used in typical scientific programming.
Professional Parallel Programming with C# Master Parallel Extensions with NET 4
Hillar, Gastón
2010-01-01
Expert guidance for those programming today's dual-core processors PCs As PC processors explode from one or two to now eight processors, there is an urgent need for programmers to master concurrent programming. This book dives deep into the latest technologies available to programmers for creating professional parallel applications using C#, .NET 4, and Visual Studio 2010. The book covers task-based programming, coordination data structures, PLINQ, thread pools, asynchronous programming model, and more. It also teaches other parallel programming techniques, such as SIMD and vectorization.Teach
An environment for parallel structuring of Fortran programs
Sridharan, K.; McShea, M.; Denton, C.; Eventoff, B.; Browne, J.C.; Newton, P.; Ellis, M.; Grossbard, D.; Wise, T.; Clemmer, D.
1990-01-01
The paper describes and illustrates an environment for interactive support of the detection and implementation of macro-level parallelism in Fortran programs. The approach couples algorithms for dependence analysis with both innovative techniques for complexity management and capabilities for the measurement and analysis of the parallel computation structures generated through use of the environment. The resulting environment is complementary to the more common approach of seeking local parallelism by loop unrolling, either by an automatic compiler or manually. (orig.)
Programming parallel architectures: The BLAZE family of languages
Mehrotra, Piyush
1988-01-01
Programming multiprocessor architectures is a critical research issue. An overview is given of the various approaches to programming these architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive since they remove much of the burden of exploiting parallel architectures from the user. Also described is recent work by the author in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described, as well as the relations of this work to other current language research projects.
Integrated Task And Data Parallel Programming: Language Design
Grimshaw, Andrew S.; West, Emily A.
1998-01-01
his research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers '95 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program m. Additional 1995 Activities During the fall I collaborated
Parallel object-oriented specification language
Florescu, O.; Voeten, J.P.M.; Theelen, B.D.; Geilen, M.C.W.; Corporaal, H.; Burns, Alan
2008-01-01
The Parallel Object-Oriented Specification Language (POOSL) is an expressive modelling language for hardware/software systems [10]. It was originally defined in [7] as an object-oriented extension of process algebra CCS [6], supporting (conditional) synchronous message passing between
The FORCE: A highly portable parallel programming language
Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger
1989-01-01
Here, it is explained why the FORCE parallel programming language is easily portable among six different shared-memory microprocessors, and how a two-level macro preprocessor makes it possible to hide low level machine dependencies and to build machine-independent high level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared memory multiprocessor executing them.
The FORCE - A highly portable parallel programming language
Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger
1989-01-01
This paper explains why the FORCE parallel programming language is easily portable among six different shared-memory multiprocessors, and how a two-level macro preprocessor makes it possible to hide low-level machine dependencies and to build machine-independent high-level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared-memory multiprocessor executing them.
A Programming Environment for Parallel Vision Algorithms
1990-04-11
industrial arm on the market , while the unique head was designed by Rochester’s Computer Science and Mechanical Engineering Departments. 9a 4.1 Introduction...R. Constraining-Unification and the Programming Language Unicorn . In Logic Programming, Functions, Relations, and Equations, Degroot and Lind- strom
Characterizing and Mitigating Work Time Inflation in Task Parallel Programs
Stephen L. Olivier
2013-01-01
Full Text Available Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation – additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.
Declarative Parallel Programming in Spreadsheet End-User Development
Biermann, Florian
2016-01-01
Spreadsheets are first-order functional languages and are widely used in research and industry as a tool to conveniently perform all kinds of computations. Because cells on a spreadsheet are immutable, there are possibilities for implicit parallelization of spreadsheet computations. In this liter...... can directly apply results from functional array programming to a spreadsheet model of computations.......Spreadsheets are first-order functional languages and are widely used in research and industry as a tool to conveniently perform all kinds of computations. Because cells on a spreadsheet are immutable, there are possibilities for implicit parallelization of spreadsheet computations....... In this literature study, we provide an overview of the publications on spreadsheet end-user programming and declarative array programming to inform further research on parallel programming in spreadsheets. Our results show that there is a clear overlap between spreadsheet programming and array programming and we...
Parallel adaptation of a vectorised quantumchemical program system
Van Corler, L.C.H.; Van Lenthe, J.H.
1987-01-01
Supercomputers, like the CRAY 1 or the Cyber 205, have had, and still have, a marked influence on Quantum Chemistry. Vectorization has led to a considerable increase in the performance of Quantum Chemistry programs. However, clockcycle times more than a factor 10 smaller than those of the present supercomputers are not to be expected. Therefore future supercomputers will have to depend on parallel structures. Recently, the first examples of such supercomputers have been installed. To be prepared for this new generation of (parallel) supercomputers one should consider the concepts one wants to use and the kind of problems one will encounter during implementation of existing vectorized programs on those parallel systems. The authors implemented four important parts of a large quantumchemical program system (ATMOL), i.e. integrals, SCF, 4-index and Direct-CI in the parallel environment at ECSEC (Rome, Italy). This system offers simulated parallellism on the host computer (IBM 4381) and real parallellism on at most 10 attached processors (FPS-164). Quantumchemical programs usually handle large amounts of data and very large, often sparse matrices. The transfer of that many data can cause problems concerning communication and overhead, in view of which shared memory and shared disks must be considered. The strategy and the tools that were used to parallellise the programs are shown. Also, some examples are presented to illustrate effectiveness and performance of the system in Rome for these type of calculations
Program Transformation to Identify List-Based Parallel Skeletons
Venkatesh Kannan
2016-07-01
Full Text Available Algorithmic skeletons are used as building-blocks to ease the task of parallel programming by abstracting the details of parallel implementation from the developer. Most existing libraries provide implementations of skeletons that are defined over flat data types such as lists or arrays. However, skeleton-based parallel programming is still very challenging as it requires intricate analysis of the underlying algorithm and often uses inefficient intermediate data structures. Further, the algorithmic structure of a given program may not match those of list-based skeletons. In this paper, we present a method to automatically transform any given program to one that is defined over a list and is more likely to contain instances of list-based skeletons. This facilitates the parallel execution of a transformed program using existing implementations of list-based parallel skeletons. Further, by using an existing transformation called distillation in conjunction with our method, we produce transformed programs that contain fewer inefficient intermediate data structures.
Development of massively parallel quantum chemistry program SMASH
Ishimura, Kazuya [Department of Theoretical and Computational Molecular Science, Institute for Molecular Science 38 Nishigo-Naka, Myodaiji, Okazaki, Aichi 444-8585 (Japan)
2015-12-31
A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C{sub 150}H{sub 30}){sub 2} with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer.
Development of massively parallel quantum chemistry program SMASH
Ishimura, Kazuya
2015-01-01
A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C 150 H 30 ) 2 with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer
Parallelization for X-ray crystal structural analysis program
Watanabe, Hiroshi [Japan Atomic Energy Research Inst., Tokyo (Japan); Minami, Masayuki; Yamamoto, Akiji
1997-10-01
In this report we study vectorization and parallelization for X-ray crystal structural analysis program. The target machine is NEC SX-4 which is a distributed/shared memory type vector parallel supercomputer. X-ray crystal structural analysis is surveyed, and a new multi-dimensional discrete Fourier transform method is proposed. The new method is designed to have a very long vector length, so that it enables to obtain the 12.0 times higher performance result that the original code. Besides the above-mentioned vectorization, the parallelization by micro-task functions on SX-4 reaches 13.7 times acceleration in the part of multi-dimensional discrete Fourier transform with 14 CPUs, and 3.0 times acceleration in the whole program. Totally 35.9 times acceleration to the original 1CPU scalar version is achieved with vectorization and parallelization on SX-4. (author)
On program restructuring, scheduling, and communication for parallel processor systems
Polychronopoulos, Constantine D. [Univ. of Illinois, Urbana, IL (United States)
1986-08-01
This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, these algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented.
Basic design of parallel computational program for probabilistic structural analysis
Kaji, Yoshiyuki; Arai, Taketoshi; Gu, Wenwei; Nakamura, Hitoshi
1999-06-01
In our laboratory, for 'development of damage evaluation method of structural brittle materials by microscopic fracture mechanics and probabilistic theory' (nuclear computational science cross-over research) we examine computational method related to super parallel computation system which is coupled with material strength theory based on microscopic fracture mechanics for latent cracks and continuum structural model to develop new structural reliability evaluation methods for ceramic structures. This technical report is the review results regarding probabilistic structural mechanics theory, basic terms of formula and program methods of parallel computation which are related to principal terms in basic design of computational mechanics program. (author)
Basic design of parallel computational program for probabilistic structural analysis
Kaji, Yoshiyuki; Arai, Taketoshi [Japan Atomic Energy Research Inst., Tokai, Ibaraki (Japan). Tokai Research Establishment; Gu, Wenwei; Nakamura, Hitoshi
1999-06-01
In our laboratory, for `development of damage evaluation method of structural brittle materials by microscopic fracture mechanics and probabilistic theory` (nuclear computational science cross-over research) we examine computational method related to super parallel computation system which is coupled with material strength theory based on microscopic fracture mechanics for latent cracks and continuum structural model to develop new structural reliability evaluation methods for ceramic structures. This technical report is the review results regarding probabilistic structural mechanics theory, basic terms of formula and program methods of parallel computation which are related to principal terms in basic design of computational mechanics program. (author)
A Programming Model for Massive Data Parallelism with Data Dependencies
Cui, Xiaohui; Mueller, Frank; Potok, Thomas E.; Zhang, Yongpeng
2009-01-01
Accelerating processors can often be more cost and energy effective for a wide range of data-parallel computing problems than general-purpose processors. For graphics processor units (GPUs), this is particularly the case when program development is aided by environments such as NVIDIA s Compute Unified Device Architecture (CUDA), which dramatically reduces the gap between domain-specific architectures and general purpose programming. Nonetheless, general-purpose GPU (GPGPU) programming remains subject to several restrictions. Most significantly, the separation of host (CPU) and accelerator (GPU) address spaces requires explicit management of GPU memory resources, especially for massive data parallelism that well exceeds the memory capacity of GPUs. One solution to this problem is to transfer data between the GPU and host memories frequently. In this work, we investigate another approach. We run massively data-parallel applications on GPU clusters. We further propose a programming model for massive data parallelism with data dependencies for this scenario. Experience from micro benchmarks and real-world applications shows that our model provides not only ease of programming but also significant performance gains
Optimisation of a parallel ocean general circulation model
M. I. Beare; D. P. Stevens
1997-01-01
International audience; This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by...
Heterogeneous Multicore Parallel Programming for Graphics Processing Units
Francois Bodin
2009-01-01
Full Text Available Hybrid parallel multicore architectures based on graphics processing units (GPUs can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting GPUs from existing applications is a difficult task that requires non-portable rewriting of the code. In this paper, we present HMPP, a Heterogeneous Multicore Parallel Programming workbench with compilers, developed by CAPS entreprise, that allows the integration of heterogeneous hardware accelerators in a unintrusive manner while preserving the legacy code.
Overview of Parallel Platforms for Common High Performance Computing
T. Fryza
2012-04-01
Full Text Available The paper deals with various parallel platforms used for high performance computing in the signal processing domain. More precisely, the methods exploiting the multicores central processing units such as message passing interface and OpenMP are taken into account. The properties of the programming methods are experimentally proved in the application of a fast Fourier transform and a discrete cosine transform and they are compared with the possibilities of MATLAB's built-in functions and Texas Instruments digital signal processors with very long instruction word architectures. New FFT and DCT implementations were proposed and tested. The implementation phase was compared with CPU based computing methods and with possibilities of the Texas Instruments digital signal processing library on C6747 floating-point DSPs. The optimal combination of computing methods in the signal processing domain and new, fast routines' implementation is proposed as well.
Performance studies of the parallel VIM code
Shi, B.; Blomquist, R.N.
1996-01-01
In this paper, the authors evaluate the performance of the parallel version of the VIM Monte Carlo code on the IBM SPx at the High Performance Computing Research Facility at ANL. Three test problems with contrasting computational characteristics were used to assess effects in performance. A statistical method for estimating the inefficiencies due to load imbalance and communication is also introduced. VIM is a large scale continuous energy Monte Carlo radiation transport program and was parallelized using history partitioning, the master/worker approach, and p4 message passing library. Dynamic load balancing is accomplished when the master processor assigns chunks of histories to workers that have completed a previously assigned task, accommodating variations in the lengths of histories, processor speeds, and worker loads. At the end of each batch (generation), the fission sites and tallies are sent from each worker to the master process, contributing to the parallel inefficiency. All communications are between master and workers, and are serial. The SPx is a scalable 128-node parallel supercomputer with high-performance Omega switches of 63 microsec latency and 35 MBytes/sec bandwidth. For uniform and reproducible performance, they used only the 120 identical regular processors (IBM RS/6000) and excluded the remaining eight planet nodes, which may be loaded by other's jobs
Scientific programming on massively parallel processor CP-PACS
Boku, Taisuke
1998-01-01
The massively parallel processor CP-PACS takes various problems of calculation physics as the object, and it has been designed so that its architecture has been devised to do various numerical processings. In this report, the outline of the CP-PACS and the example of programming in the Kernel CG benchmark in NAS Parallel Benchmarks, version 1, are shown, and the pseudo vector processing mechanism and the parallel processing tuning of scientific and technical computation utilizing the three-dimensional hyper crossbar net, which are two great features of the architecture of the CP-PACS are described. As for the CP-PACS, the PUs based on RISC processor and added with pseudo vector processor are used. Pseudo vector processing is realized as the loop processing by scalar command. The features of the connection net of PUs are explained. The algorithm of the NPB version 1 Kernel CG is shown. The part that takes the time for processing most in the main loop is the product of matrix and vector (matvec), and the parallel processing of the matvec is explained. The time for the computation by the CPU is determined. As the evaluation of the performance, the evaluation of the time for execution, the short vector processing of pseudo vector processor based on slide window, and the comparison with other parallel computers are reported. (K.I.)
Parallel sparse direct solvers for Poisson's equation in streamer discharges
M. Nool (Margreet); M. Genseberger (Menno); U. M. Ebert (Ute)
2017-01-01
textabstractThe aim of this paper is to examine whether a hybrid approach of parallel computing, a combination of the message passing model (MPI) with the threads model (OpenMP) can deliver good performance in streamer discharge simulations. Since one of the bottlenecks of almost all streamer
Performance modeling of parallel algorithms for solving neutron diffusion problems
Azmy, Y.Y.; Kirk, B.L.
1995-01-01
Neutron diffusion calculations are the most common computational methods used in the design, analysis, and operation of nuclear reactors and related activities. Here, mathematical performance models are developed for the parallel algorithm used to solve the neutron diffusion equation on message passing and shared memory multiprocessors represented by the Intel iPSC/860 and the Sequent Balance 8000, respectively. The performance models are validated through several test problems, and these models are used to estimate the performance of each of the two considered architectures in situations typical of practical applications, such as fine meshes and a large number of participating processors. While message passing computers are capable of producing speedup, the parallel efficiency deteriorates rapidly as the number of processors increases. Furthermore, the speedup fails to improve appreciably for massively parallel computers so that only small- to medium-sized message passing multiprocessors offer a reasonable platform for this algorithm. In contrast, the performance model for the shared memory architecture predicts very high efficiency over a wide range of number of processors reasonable for this architecture. Furthermore, the model efficiency of the Sequent remains superior to that of the hypercube if its model parameters are adjusted to make its processors as fast as those of the iPSC/860. It is concluded that shared memory computers are better suited for this parallel algorithm than message passing computers
Parallelism at Cern: real-time and off-line applications in the GP-MIMD2 project
Calafiura, P.
1997-01-01
A wide range of general purpose high-energy physics applications, ranging from Monte Carlo simulation to data acquisition, from interactive data analysis to on-line filtering, have been ported, or developed, and run in parallel on IBM SP-2 and Meiko CS-2 CERN large multi-processor machines. The ESPRIT project GP-MIMD2 has been a catalyst for the interest in parallel computing at CERN. The project provided the 128 processor Meiko CS-2 system that is now succesfully integrated in the CERN computing environment. The CERN experiment NA48 was involved in the GP-MIMD2 project since the beginning. NA48 physicists run, as part of their day-to-day work, simulation and analysis programs parallelized using the message passing interface MPI. The CS-2 is also a vital component of the experiment data acquisition system and will be used to calibrate in real-time the 13000 channels liquid krypton calorimeter. (orig.)
CUBESIM, Hypercube and Denelcor Hep Parallel Computer Simulation
Dunigan, T.H.
1988-01-01
1 - Description of program or function: CUBESIM is a set of subroutine libraries and programs for the simulation of message-passing parallel computers and shared-memory parallel computers. Subroutines are supplied to simulate the Intel hypercube and the Denelcor HEP parallel computers. The system permits a user to develop and test parallel programs written in C or FORTRAN on a single processor. The user may alter such hypercube parameters as message startup times, packet size, and the computation-to-communication ratio. The simulation generates a trace file that can be used for debugging, performance analysis, or graphical display. 2 - Method of solution: The CUBESIM simulator is linked with the user's parallel application routines to run as a single UNIX process. The simulator library provides a small operating system to perform process and message management. 3 - Restrictions on the complexity of the problem: Up to 128 processors can be simulated with a virtual memory limit of 6 million bytes. Up to 1000 processes can be simulated
Final Report: Center for Programming Models for Scalable Parallel Computing
Mellor-Crummey, John [William Marsh Rice University
2011-09-13
As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.
Kemari: A Portable High Performance Fortran System for Distributed Memory Parallel Processors
T. Kamachi
1997-01-01
Full Text Available We have developed a compilation system which extends High Performance Fortran (HPF in various aspects. We support the parallelization of well-structured problems with loop distribution and alignment directives similar to HPF's data distribution directives. Such directives give both additional control to the user and simplify the compilation process. For the support of unstructured problems, we provide directives for dynamic data distribution through user-defined mappings. The compiler also allows integration of message-passing interface (MPI primitives. The system is part of a complete programming environment which also comprises a parallel debugger and a performance monitor and analyzer. After an overview of the compiler, we describe the language extensions and related compilation mechanisms in detail. Performance measurements demonstrate the compiler's applicability to a variety of application classes.
Object-Oriented Parallel Particle-in-Cell Code for Beam Dynamics Simulation in Linear Accelerators
Qiang, J.; Ryne, R.D.; Habib, S.; Decky, V.
1999-01-01
In this paper, we present an object-oriented three-dimensional parallel particle-in-cell code for beam dynamics simulation in linear accelerators. A two-dimensional parallel domain decomposition approach is employed within a message passing programming paradigm along with a dynamic load balancing. Implementing object-oriented software design provides the code with better maintainability, reusability, and extensibility compared with conventional structure based code. This also helps to encapsulate the details of communications syntax. Performance tests on SGI/Cray T3E-900 and SGI Origin 2000 machines show good scalability of the object-oriented code. Some important features of this code also include employing symplectic integration with linear maps of external focusing elements and using z as the independent variable, typical in accelerators. A successful application was done to simulate beam transport through three superconducting sections in the APT linac design
An efficient implementation of parallel molecular dynamics method on SMP cluster architecture
Suzuki, Masaaki; Okuda, Hiroshi; Yagawa, Genki
2003-01-01
The authors have applied MPI/OpenMP hybrid parallel programming model to parallelize a molecular dynamics (MD) method on a symmetric multiprocessor (SMP) cluster architecture. In that architecture, it can be expected that the hybrid parallel programming model, which uses the message passing library such as MPI for inter-SMP node communication and the loop directive such as OpenMP for intra-SNP node parallelization, is the most effective one. In this study, the parallel performance of the hybrid style has been compared with that of conventional flat parallel programming style, which uses only MPI, both in cases the fast multipole method (FMM) is employed for computing long-distance interactions and that is not employed. The computer environments used here are Hitachi SR8000/MPP placed at the University of Tokyo. The results of calculation are as follows. Without FMM, the parallel efficiency using 16 SMP nodes (128 PEs) is: 90% with the hybrid style, 75% with the flat-MPI style for MD simulation with 33,402 atoms. With FMM, the parallel efficiency using 16 SMP nodes (128 PEs) is: 60% with the hybrid style, 48% with the flat-MPI style for MD simulation with 117,649 atoms. (author)
Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
Hasanov, Khalid; Quintin, Jean-Noë l; Lastovetsky, Alexey
2014-01-01
-scale parallelism in mind. Indeed, while in 1990s a system with few hundred cores was considered a powerful supercomputer, modern top supercomputers have millions of cores. In this paper, we present a hierarchical approach to optimization of message-passing parallel
Feedback Driven Annotation and Refactoring of Parallel Programs
Larsen, Per
and communication in embedded programs. Runtime checks are developed to ensure that annotations correctly describe observable program behavior. The performance impact of runtime checking is evaluated on several benchmark kernels and is negligible in all cases. The second aspect is compilation feedback. Annotations...... are not effective unless programmers are told how and when they are benecial. A prototype compilation feedback system was developed in collaboration with IBM Haifa Research Labs. It reports issues that prevent further analysis to the programmer. Performance evaluation shows that three programs performes signicantly......This thesis combines programmer knowledge and feedback to improve modeling and optimization of software. The research is motivated by two observations. First, there is a great need for automatic analysis of software for embedded systems - to expose and model parallelism inherent in programs. Second...
Parallelization of a Monte Carlo particle transport simulation code
Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.
2010-05-01
We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Parallel-Processing Test Bed For Simulation Software
Blech, Richard; Cole, Gary; Townsend, Scott
1996-01-01
Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).
A multitransputer parallel processing system (MTPPS)
Jethra, A.K.; Pande, S.S.; Borkar, S.P.; Khare, A.N.; Ghodgaonkar, M.D.; Bairi, B.R.
1993-01-01
This report describes the design and implementation of a 16 node Multi Transputer Parallel Processing System(MTPPS) which is a platform for parallel program development. It is a MIMD machine based on message passing paradigm. The basic compute engine is an Inmos Transputer Ims T800-20. Transputer with local memory constitutes the processing element (NODE) of this MIMD architecture. Multiple NODES can be connected to each other in an identifiable network topology through the high speed serial links of the transputer. A Network Configuration Unit (NCU) incorporates the necessary hardware to provide software controlled network configuration. System is modularly expandable and more NODES can be added to the system to achieve the required processing power. The system is backend to the IBM-PC which has been integrated into the system to provide user I/O interface. PC resources are available to the programmer. Interface hardware between the PC and the network of transputers is INMOS compatible. Therefore, all the commercially available development software compatible to INMOS products can run on this system. While giving the details of design and implementation, this report briefly summarises MIMD Architectures, Transputer Architecture and Parallel Processing Software Development issues. LINPACK performance evaluation of the system and solutions of neutron physics and plasma physics problem have been discussed along with results. (author). 12 refs., 22 figs., 3 tabs., 3 appendixes
Automatic Migration from PARMACS to MPI in Parallel Fortran Applications
Rolf Hempel
1999-01-01
Full Text Available The PARMACS message passing interface has been in widespread use by application projects, especially in Europe. With the new MPI standard for message passing, many projects face the problem of replacing PARMACS with MPI. An automatic translation tool has been developed which replaces all PARMACS 6.0 calls in an application program with their corresponding MPI calls. In this paper we describe the mapping of the PARMACS programming model onto MPI. We then present some implementation details of the converter tool.
A scalable parallel algorithm for multiple objective linear programs
Wiecek, Malgorzata M.; Zhang, Hong
1994-01-01
This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.
Parallelization and checkpointing of GPU applications through program transformation
Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
Parallelization of applications for networks with homogeneous and heterogeneous processors
Colombet, L.
1994-01-01
The aim of this thesis is to study and develop efficient methods for parallelization of scientific applications on parallel computers with distributed memory. The first part presents two libraries of PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) communication tools. They allow implementation of programs on most parallel machines, but also on heterogeneous computer networks. This chapter illustrates the problems faced when trying to evaluate performances of networks with heterogeneous processors. To evaluate such performances, the concepts of speed-up and efficiency have been modified and adapted to account for heterogeneity. The second part deals with a study of parallel application libraries such as ScaLAPACK and with the development of communication masking techniques. The general concept is based on communication anticipation, in particular by pipelining message sending operations. Experimental results on Cray T3D and IBM SP1 machines validates the theoretical studies performed on basic algorithms of the libraries discussed above. Two examples of scientific applications are given: the first is a model of young stars for astrophysics and the other is a model of photon trajectories in the Compton effect. (J.S.). 83 refs., 65 figs., 24 tabs
Ashkan Tousimojarad
2013-12-01
Full Text Available We present the Glasgow Parallel Reduction Machine (GPRM, a novel, flexible framework for parallel task-composition based many-core programming. We allow the programmer to structure programs into task code, written as C++ classes, and communication code, written in a restricted subset of C++ with functional semantics and parallel evaluation. In this paper we discuss the GPRM, the virtual machine framework that enables the parallel task composition approach. We focus the discussion on GPIR, the functional language used as the intermediate representation of the bytecode running on the GPRM. Using examples in this language we show the flexibility and power of our task composition framework. We demonstrate the potential using an implementation of a merge sort algorithm on a 64-core Tilera processor, as well as on a conventional Intel quad-core processor and an AMD 48-core processor system. We also compare our framework with OpenMP tasks in a parallel pointer chasing algorithm running on the Tilera processor. Our results show that the GPRM programs outperform the corresponding OpenMP codes on all test platforms, and can greatly facilitate writing of parallel programs, in particular non-data parallel algorithms such as reductions.
An object-oriented programming paradigm for parallelization of computational fluid dynamics
Ohta, Takashi.
1997-03-01
We propose an object-oriented programming paradigm for parallelization of scientific computing programs, and show that the approach can be a very useful strategy. Generally, parallelization of scientific programs tends to be complicated and unportable due to the specific requirements of each parallel computer or compiler. In this paper, we show that the object-oriented programming design, which separates the parallel processing parts from the solver of the applications, can achieve the large improvement in the maintenance of the codes, as well as the high portability. We design the program for the two-dimensional Euler equations according to the paradigm, and evaluate the parallel performance on IBM SP2. (author)
Parallel Object-Oriented Computation Applied to a Finite Element Problem
Jon B. Weissman
1993-01-01
Full Text Available The conventional wisdom in the scientific computing community is that the best way to solve large-scale numerically intensive scientific problems on today's parallel MIMD computers is to use Fortran or C programmed in a data-parallel style using low-level message-passing primitives. This approach inevitably leads to nonportable codes and extensive development time, and restricts parallel programming to the domain of the expert programmer. We believe that these problems are not inherent to parallel computing but are the result of the programming tools used. We will show that comparable performance can be achieved with little effort if better tools that present higher level abstractions are used. The vehicle for our demonstration is a 2D electromagnetic finite element scattering code we have implemented in Mentat, an object-oriented parallel processing system. We briefly describe the application. Mentat, the implementation, and present performance results for both a Mentat and a hand-coded parallel Fortran version.
Hybrid programming model for implicit PDE simulations on multicore architectures
Kaushik, Dinesh; Keyes, David E.; Balay, Satish; Smith, Barry F.
2011-01-01
The complexity of programming modern multicore processor based clusters is rapidly rising, with GPUs adding further demand for fine-grained parallelism. This paper analyzes the performance of the hybrid (MPI+OpenMP) programming model in the context of an implicit unstructured mesh CFD code. At the implementation level, the effects of cache locality, update management, work division, and synchronization frequency are studied. The hybrid model presents interesting algorithmic opportunities as well: the convergence of linear system solver is quicker than the pure MPI case since the parallel preconditioner stays stronger when hybrid model is used. This implies significant savings in the cost of communication and synchronization (explicit and implicit). Even though OpenMP based parallelism is easier to implement (with in a subdomain assigned to one MPI process for simplicity), getting good performance needs attention to data partitioning issues similar to those in the message-passing case. © 2011 Springer-Verlag.
Parallel Numerical Simulations of Water Reservoirs
Torres, Pedro; Mangiavacchi, Norberto
2010-11-01
The study of the water flow and scalar transport in water reservoirs is important for the determination of the water quality during the initial stages of the reservoir filling and during the life of the reservoir. For this scope, a parallel 2D finite element code for solving the incompressible Navier-Stokes equations coupled with scalar transport was implemented using the message-passing programming model, in order to perform simulations of hidropower water reservoirs in a computer cluster environment. The spatial discretization is based on the MINI element that satisfies the Babuska-Brezzi (BB) condition, which provides sufficient conditions for a stable mixed formulation. All the distributed data structures needed in the different stages of the code, such as preprocessing, solving and post processing, were implemented using the PETSc library. The resulting linear systems for the velocity and the pressure fields were solved using the projection method, implemented by an approximate block LU factorization. In order to increase the parallel performance in the solution of the linear systems, we employ the static condensation method for solving the intermediate velocity at vertex and centroid nodes separately. We compare performance results of the static condensation method with the approach of solving the complete system. In our tests the static condensation method shows better performance for large problems, at the cost of an increased memory usage. Performance results for other intensive parts of the code in a computer cluster are also presented.
Parallel Framework for Cooperative Processes
Mitică Craus
2005-01-01
Full Text Available This paper describes the work of an object oriented framework designed to be used in the parallelization of a set of related algorithms. The idea behind the system we are describing is to have a re-usable framework for running several sequential algorithms in a parallel environment. The algorithms that the framework can be used with have several things in common: they have to run in cycles and the work should be possible to be split between several "processing units". The parallel framework uses the message-passing communication paradigm and is organized as a master-slave system. Two applications are presented: an Ant Colony Optimization (ACO parallel algorithm for the Travelling Salesman Problem (TSP and an Image Processing (IP parallel algorithm for the Symmetrical Neighborhood Filter (SNF. The implementations of these applications by means of the parallel framework prove to have good performances: approximatively linear speedup and low communication cost.
Hoisie, A.; Lubeck, O.; Wasserman, H.
1998-01-01
The authors develop a model for the parallel performance of algorithms that consist of concurrent, two-dimensional wavefronts implemented in a message passing environment. The model, based on a LogGP machine parameterization, combines the separate contributions of computation and communication wavefronts. They validate the model on three important supercomputer systems, on up to 500 processors. They use data from a deterministic particle transport application taken from the ASCI workload, although the model is general to any wavefront algorithm implemented on a 2-D processor domain. They also use the validated model to make estimates of performance and scalability of wavefront algorithms on 100-TFLOPS computer systems expected to be in existence within the next decade as part of the ASCI program and elsewhere. In this context, they analyze two problem sizes. The model shows that on the largest such problem (1 billion cells), inter-processor communication performance is not the bottleneck. Single-node efficiency is the dominant factor
User-Defined Data Distributions in High-Level Programming Languages
Diaconescu, Roxana E.; Zima, Hans P.
2006-01-01
One of the characteristic features of today s high performance computing systems is a physically distributed memory. Efficient management of locality is essential for meeting key performance requirements for these architectures. The standard technique for dealing with this issue has involved the extension of traditional sequential programming languages with explicit message passing, in the context of a processor-centric view of parallel computation. This has resulted in complex and error-prone assembly-style codes in which algorithms and communication are inextricably interwoven. This paper presents a high-level approach to the design and implementation of data distributions. Our work is motivated by the need to improve the current parallel programming methodology by introducing a paradigm supporting the development of efficient and reusable parallel code. This approach is currently being implemented in the context of a new programming language called Chapel, which is designed in the HPCS project Cascade.
The CRAFT Fortran Programming Model
Douglas M. Pase
1994-01-01
Full Text Available Many programming models for massively parallel machines exist, and each has its advantages and disadvantages. In this article we present a programming model that combines features from other programming models that (1 can be efficiently implemented on present and future Cray Research massively parallel processor (MPP systems and (2 are useful in constructing highly parallel programs. The model supports several styles of programming: message-passing, data parallel, global address (shared data, and work-sharing. These styles may be combined within the same program. The model includes features that allow a user to define a program in terms of the behavior of the system as a whole, where the behavior of individual tasks is implicit from this systemic definition. (In general, features marked as shared are designed to support this perspective. It also supports an opposite perspective, where a program may be defined in terms of the behaviors of individual tasks, and a program is implicitly the sum of the behaviors of all tasks. (Features marked as private are designed to support this perspective. Users can exploit any combination of either set of features without ambiguity and thus are free to define a program from whatever perspective is most appropriate to the problem at hand.
Real-time trajectory optimization on parallel processors
Psiaki, Mark L.
1993-01-01
A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.
Li Hanyu; Zhou Haijing; Dong Zhiwei; Liao Cheng; Chang Lei; Cao Xiaolin; Xiao Li
2010-01-01
A large-scale parallel electromagnetic field simulation program JEMS-FDTD(J Electromagnetic Solver-Finite Difference Time Domain) is designed and implemented on JASMIN (J parallel Adaptive Structured Mesh applications INfrastructure). This program can simulate propagation, radiation, couple of electromagnetic field by solving Maxwell equations on structured mesh explicitly with FDTD method. JEMS-FDTD is able to simulate billion-mesh-scale problems on thousands of processors. In this article, the program is verified by simulating the radiation of an electric dipole. A beam waveguide is simulated to demonstrate the capability of large scale parallel computation. A parallel performance test indicates that a high parallel efficiency is obtained. (authors)
Ding Yu; Qi Yujin; Zhang Xuezhu; Zhao Cuilan
2011-01-01
In this paper, we report the development of a high-performance image processing platform, which is based on CPU-GPU heterogeneous cluster. Currently, it consists of a Dell Precision T7500 and HP XW8600 workstations with parallel programming and runtime environment, using the message-passing interface (MPI) and CUDA (Compute Unified Device Architecture). We succeeded in developing parallel image processing techniques for 3D image reconstruction of X-ray micro-CT imaging. The results show that a GPU provides a computing efficiency of about 194 times faster than a single CPU, and the CPU-GPU clusters provides a computing efficiency of about 46 times faster than the CPU clusters. These meet the requirements of rapid 3D image reconstruction and real time image display. In conclusion, the use of CPU-GPU heterogeneous cluster is an effective way to build high-performance image processing platform. (authors)
Exploiting variability for energy optimization of parallel programs
Lavrijsen, Wim [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Iancu, Costin [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); de Jong, Wibe [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Chen, Xin [Georgia Inst. of Technology, Atlanta, GA (United States); Schwan, Karsten [Georgia Inst. of Technology, Atlanta, GA (United States)
2016-04-18
Here in this paper we present optimizations that use DVFS mechanisms to reduce the total energy usage in scientific applications. Our main insight is that noise is intrinsic to large scale parallel executions and it appears whenever shared resources are contended. The presence of noise allows us to identify and manipulate any program regions amenable to DVFS. When compared to previous energy optimizations that make per core decisions using predictions of the running time, our scheme uses a qualitative approach to recognize the signature of executions amenable to DVFS. By recognizing the "shape of variability" we can optimize codes with highly dynamic behavior, which pose challenges to all existing DVFS techniques. We validate our approach using offline and online analyses for one-sided and two-sided communication paradigms. We have applied our methods to NWChem, and we show best case improvements in energy use of 12% at no loss in performance when using online optimizations running on 720 Haswell cores with one-sided communication. With NWChem on MPI two-sided and offline analysis, capturing the initialization, we find energy savings of up to 20%, with less than 1% performance cost.
MulticoreBSP for C : A high-performance library for shared-memory parallel programming
Yzelman, A. N.; Bisseling, R. H.; Roose, D.; Meerbergen, K.
2014-01-01
The bulk synchronous parallel (BSP) model, as well as parallel programming interfaces based on BSP, classically target distributed-memory parallel architectures. In earlier work, Yzelman and Bisseling designed a MulticoreBSP for Java library specifically for shared-memory architectures. In the
Parallel computing in cluster of GPU applied to a problem of nuclear engineering
Moraes, Sergio Ricardo S.; Heimlich, Adino; Resende, Pedro
2013-01-01
Cluster computing has been widely used as a low cost alternative for parallel processing in scientific applications. With the use of Message-Passing Interface (MPI) protocol development became even more accessible and widespread in the scientific community. A more recent trend is the use of Graphic Processing Unit (GPU), which is a powerful co-processor able to perform hundreds of instructions in parallel, reaching a capacity of hundreds of times the processing of a CPU. However, a standard PC does not allow, in general, more than two GPUs. Hence, it is proposed in this work development and evaluation of a hybrid low cost parallel approach to the solution to a nuclear engineering typical problem. The idea is to use clusters parallelism technology (MPI) together with GPU programming techniques (CUDA - Compute Unified Device Architecture) to simulate neutron transport through a slab using Monte Carlo method. By using a cluster comprised by four quad-core computers with 2 GPU each, it has been developed programs using MPI and CUDA technologies. Experiments, applying different configurations, from 1 to 8 GPUs has been performed and results were compared with the sequential (non-parallel) version. A speed up of about 2.000 times has been observed when comparing the 8-GPU with the sequential version. Results here presented are discussed and analyzed with the objective of outlining gains and possible limitations of the proposed approach. (author)
Development of an MPI benchmark program library
Uehara, Hitoshi
2001-03-01
Distributed parallel simulation software with message passing interfaces has been developed to realize large-scale and high performance numerical simulations. The most popular API for message communication is an MPI. The MPI will be provided on the Earth Simulator. It is known that performance of message communication using the MPI libraries gives a significant influence on a whole performance of simulation programs. We developed an MPI benchmark program library named MBL in order to measure the performance of message communication precisely. The MBL measures the performance of major MPI functions such as point-to-point communications and collective communications and the performance of major communication patterns which are often found in application programs. In this report, the description of the MBL and the performance analysis of the MPI/SX measured on the SX-4 are presented. (author)
Parallel integer sorting with medium and fine-scale parallelism
Dagum, Leonardo
1993-01-01
Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
Parallel algorithms for numerical linear algebra
van der Vorst, H
1990-01-01
This is the first in a new series of books presenting research results and developments concerning the theory and applications of parallel computers, including vector, pipeline, array, fifth/future generation computers, and neural computers.All aspects of high-speed computing fall within the scope of the series, e.g. algorithm design, applications, software engineering, networking, taxonomy, models and architectural trends, performance, peripheral devices.Papers in Volume One cover the main streams of parallel linear algebra: systolic array algorithms, message-passing systems, algorithms for p
A PARALLEL EXTENSION OF THE UAL ENVIRONMENT
MALITSKY, N.; SHISHLO, A.
2001-01-01
The deployment of the Unified Accelerator Library (UAL) environment on the parallel cluster is presented. The approach is based on the Message-Passing Interface (MPI) library and the Perl adapter that allows one to control and mix together the existing conventional UAL components with the new MPI-based parallel extensions. In the paper, we provide timing results and describe the application of the new environment to the SNS Ring complex beam dynamics studies, particularly, simulations of several physical effects, such as space charge, field errors, fringe fields, and others
A Parallel Workload Model and its Implications for Processor Allocation
1996-11-01
with SEV or AVG, both of which can tolerate c = 0.4 { 0.6 before their performance deteriorates signi cantly. On the other hand, Setia [10] has...Sanjeev. K Setia . The interaction between memory allocation and adaptive partitioning in message-passing multicomputers. In IPPS Workshop on Job...Scheduling Strategies for Parallel Processing, pages 89{99, 1995. [11] Sanjeev K. Setia and Satish K. Tripathi. An analysis of several processor
Application of parallel computing to seismic damage process simulation of an arch dam
Zhong Hong; Lin Gao; Li Jianbo
2010-01-01
The simulation of damage process of high arch dam subjected to strong earthquake shocks is significant to the evaluation of its performance and seismic safety, considering the catastrophic effect of dam failure. However, such numerical simulation requires rigorous computational capacity. Conventional serial computing falls short of that and parallel computing is a fairly promising solution to this problem. The parallel finite element code PDPAD was developed for the damage prediction of arch dams utilizing the damage model with inheterogeneity of concrete considered. Developed with programming language Fortran, the code uses a master/slave mode for programming, domain decomposition method for allocation of tasks, MPI (Message Passing Interface) for communication and solvers from AZTEC library for solution of large-scale equations. Speedup test showed that the performance of PDPAD was quite satisfactory. The code was employed to study the damage process of a being-built arch dam on a 4-node PC Cluster, with more than one million degrees of freedom considered. The obtained damage mode was quite similar to that of shaking table test, indicating that the proposed procedure and parallel code PDPAD has a good potential in simulating seismic damage mode of arch dams. With the rapidly growing need for massive computation emerged from engineering problems, parallel computing will find more and more applications in pertinent areas.
Optimisation of a parallel ocean general circulation model
Beare, M. I.; Stevens, D. P.
1997-10-01
This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers.
Parallel MCNP Monte Carlo transport calculations with MPI
Wagner, J.C.; Haghighat, A.
1996-01-01
The steady increase in computational performance has made Monte Carlo calculations for large/complex systems possible. However, in order to make these calculations practical, order of magnitude increases in performance are necessary. The Monte Carlo method is inherently parallel (particles are simulated independently) and thus has the potential for near-linear speedup with respect to the number of processors. Further, the ever-increasing accessibility of parallel computers, such as workstation clusters, facilitates the practical use of parallel Monte Carlo. Recognizing the nature of the Monte Carlo method and the trends in available computing, the code developers at Los Alamos National Laboratory implemented the message-passing general-purpose Monte Carlo radiation transport code MCNP (version 4A). The PVM package was chosen by the MCNP code developers because it supports a variety of communication networks, several UNIX platforms, and heterogeneous computer systems. This PVM version of MCNP has been shown to produce speedups that approach the number of processors and thus, is a very useful tool for transport analysis. Due to software incompatibilities on the local IBM SP2, PVM has not been available, and thus it is not possible to take advantage of this useful tool. Hence, it became necessary to implement an alternative message-passing library package into MCNP. Because the message-passing interface (MPI) is supported on the local system, takes advantage of the high-speed communication switches in the SP2, and is considered to be the emerging standard, it was selected
Parallel Libraries to support High-Level Programming
Larsen, Morten Nørgaard
and the Microsoft .NET iv framework. Normally, one would not directly think of the .NET framework when talking scientific applications, but Microsoft has in the last couple of versions of .NET introduce a number of tools for writing parallel and high performance code. The first section examines how programmers can...
On the effective parallel programming of multi-core processors
Varbanescu, A.L.
2010-01-01
Multi-core processors are considered now the only feasible alternative to the large single-core processors which have become limited by technological aspects such as power consumption and heat dissipation. However, due to their inherent parallel structure and their diversity, multi-cores are
Parallelizing AT with MatlabMPI
2011-01-01
The Accelerator Toolbox (AT) is a high-level collection of tools and scripts specifically oriented toward solving problems dealing with computational accelerator physics. It is integrated into the MATLAB environment, which provides an accessible, intuitive interface for accelerator physicists, allowing researchers to focus the majority of their efforts on simulations and calculations, rather than programming and debugging difficulties. Efforts toward parallelization of AT have been put in place to upgrade its performance to modern standards of computing. We utilized the packages MatlabMPI and pMatlab, which were developed by MIT Lincoln Laboratory, to set up a message-passing environment that could be called within MATLAB, which set up the necessary pre-requisites for multithread processing capabilities. On local quad-core CPUs, we were able to demonstrate processor efficiencies of roughly 95% and speed increases of nearly 380%. By exploiting the efficacy of modern-day parallel computing, we were able to demonstrate incredibly efficient speed increments per processor in AT's beam-tracking functions. Extrapolating from prediction, we can expect to reduce week-long computation runtimes to less than 15 minutes. This is a huge performance improvement and has enormous implications for the future computing power of the accelerator physics group at SSRL. However, one of the downfalls of parringpass is its current lack of transparency; the pMatlab and MatlabMPI packages must first be well-understood by the user before the system can be configured to run the scripts. In addition, the instantiation of argument parameters requires internal modification of the source code. Thus, parringpass, cannot be directly run from the MATLAB command line, which detracts from its flexibility and user-friendliness. Future work in AT's parallelization will focus on development of external functions and scripts that can be called from within MATLAB and configured on multiple nodes, while
Adapting high-level language programs for parallel processing using data flow
Standley, Hilda M.
1988-01-01
EASY-FLOW, a very high-level data flow language, is introduced for the purpose of adapting programs written in a conventional high-level language to a parallel environment. The level of parallelism provided is of the large-grained variety in which parallel activities take place between subprograms or processes. A program written in EASY-FLOW is a set of subprogram calls as units, structured by iteration, branching, and distribution constructs. A data flow graph may be deduced from an EASY-FLOW program.
Communications oriented programming of parallel iterative solutions of sparse linear systems
Patrick, M. L.; Pratt, T. W.
1986-01-01
Parallel algorithms are developed for a class of scientific computational problems by partitioning the problems into smaller problems which may be solved concurrently. The effectiveness of the resulting parallel solutions is determined by the amount and frequency of communication and synchronization and the extent to which communication can be overlapped with computation. Three different parallel algorithms for solving the same class of problems are presented, and their effectiveness is analyzed from this point of view. The algorithms are programmed using a new programming environment. Run-time statistics and experience obtained from the execution of these programs assist in measuring the effectiveness of these algorithms.
Compiler and Runtime Support for Programming in Adaptive Parallel Environments
1998-10-15
noother job is waiting for resources, and use a smaller number of processors when other jobs needresources. Setia et al. [15, 20] have shown that such...15] Vijay K. Naik, Sanjeev Setia , and Mark Squillante. Performance analysis of job scheduling policiesin parallel supercomputing environments. In...on networks ofheterogeneous workstations. Technical Report CSE-94-012, Oregon Graduate Institute of Scienceand Technology, 1994.[20] Sanjeev Setia
Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model
Hamam, Alwaleed A.
2017-03-13
Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it\\'s time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.
Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model
Hamam, Alwaleed A.; Khan, Ayaz H.
2017-01-01
Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it's time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.
User's guide of parallel program development environment (PPDE). The 2nd edition
Ueno, Hirokazu; Takemiya, Hiroshi; Imamura, Toshiyuki; Koide, Hiroshi; Matsuda, Katsuyuki; Higuchi, Kenji; Hirayama, Toshio; Ohta, Hirofumi
2000-03-01
The STA basic system has been enhanced to accelerate support for parallel programming on heterogeneous parallel computers, through a series of R and D on the technology of parallel processing. The enhancement has been made through extending the function of the PPDF, Parallel Program Development Environment in the STA basic system. The extended PPDE has the function to make: 1) the automatic creation of a 'makefile' and a shell script file for its execution, 2) the multi-tools execution which makes the tools on heterogeneous computers to execute with one operation a task on a computer, and 3) the mirror composition to reflect editing results of a file on a computer into all related files on other computers. These additional functions will enhance the work efficiency for program development on some computers. More functions have been added to the PPDE to provide help for parallel program development. New functions were also designed to complement a HPF translator and a parallelizing support tool when working together so that a sequential program is efficiently converted to a parallel program. This report describes the use of extended PPDE. (author)
Feasibility studies for a high energy physics MC program on massive parallel platforms
Bertolotto, L.M.; Peach, K.J.; Apostolakis, J.; Bruschini, C.E.; Calafiura, P.; Gagliardi, F.; Metcalf, M.; Norton, A.; Panzer-Steindel, B.
1994-01-01
The parallelization of a Monte Carlo program for the NA48 experiment is presented. As a first step, a task farming structure was realized. Based on this, a further step, making use of a distributed database for showers in the electro-magnetic calorimeter, was implemented. Further possibilities for using parallel processing for a quasi-real time calibration of the calorimeter are described
Design considerations for parallel graphics libraries
Crockett, Thomas W.
1994-01-01
Applications which run on parallel supercomputers are often characterized by massive datasets. Converting these vast collections of numbers to visual form has proven to be a powerful aid to comprehension. For a variety of reasons, it may be desirable to provide this visual feedback at runtime. One way to accomplish this is to exploit the available parallelism to perform graphics operations in place. In order to do this, we need appropriate parallel rendering algorithms and library interfaces. This paper provides a tutorial introduction to some of the issues which arise in designing parallel graphics libraries and their underlying rendering algorithms. The focus is on polygon rendering for distributed memory message-passing systems. We illustrate our discussion with examples from PGL, a parallel graphics library which has been developed on the Intel family of parallel systems.
Cell verification of parallel burnup calculation program MCBMPI based on MPI
Yang Wankui; Liu Yaoguang; Ma Jimin; Wang Guanbo; Yang Xin; She Ding
2014-01-01
The parallel burnup calculation program MCBMPI was developed. The program was modularized. The parallel MCNP5 program MCNP5MPI was employed as neutron transport calculation module. And a composite of three solution methods was used to solve burnup equation, i.e. matrix exponential technique, TTA analytical solution, and Gauss Seidel iteration. MPI parallel zone decomposition strategy was concluded in the program. The program system only consists of MCNP5MPI and burnup subroutine. The latter achieves three main functions, i.e. zone decomposition, nuclide transferring and decaying, and data exchanging with MCNP5MPI. Also, the program was verified with the pressurized water reactor (PWR) cell burnup benchmark. The results show that it,s capable to apply the program to burnup calculation of multiple zones, and the computation efficiency could be significantly improved with the development of computer hardware. (authors)
Rizki, Permata Nur Miftahur; Lee, Heezin; Lee, Minsu; Oh, Sangyoon
2017-01-01
With the rapid advance of remote sensing technology, the amount of three-dimensional point-cloud data has increased extraordinarily, requiring faster processing in the construction of digital elevation models. There have been several attempts to accelerate the computation using parallel methods; however, little attention has been given to investigating different approaches for selecting the most suited parallel programming model for a given computing environment. We present our findings and insights identified by implementing three popular high-performance parallel approaches (message passing interface, MapReduce, and GPGPU) on time demanding but accurate kriging interpolation. The performances of the approaches are compared by varying the size of the grid and input data. In our empirical experiment, we demonstrate the significant acceleration by all three approaches compared to a C-implemented sequential-processing method. In addition, we also discuss the pros and cons of each method in terms of usability, complexity infrastructure, and platform limitation to give readers a better understanding of utilizing those parallel approaches for gridding purposes.
76 FR 62808 - Pilot Program for Parallel Review of Medical Products
2011-10-11
... voluntary participation in the pilot program, as well as the guiding principles the Agencies intend to... 57045), parallel review is intended to reduce the time between FDA marketing approval and CMS national...
F-Nets and Software Cabling: Deriving a Formal Model and Language for Portable Parallel Programming
DiNucci, David C.; Saini, Subhash (Technical Monitor)
1998-01-01
Parallel programming is still being based upon antiquated sequence-based definitions of the terms "algorithm" and "computation", resulting in programs which are architecture dependent and difficult to design and analyze. By focusing on obstacles inherent in existing practice, a more portable model is derived here, which is then formalized into a model called Soviets which utilizes a combination of imperative and functional styles. This formalization suggests more general notions of algorithm and computation, as well as insights into the meaning of structured programming in a parallel setting. To illustrate how these principles can be applied, a very-high-level graphical architecture-independent parallel language, called Software Cabling, is described, with many of the features normally expected from today's computer languages (e.g. data abstraction, data parallelism, and object-based programming constructs).
Parallelization of simulation code for liquid-gas model of lattice-gas fluid
Kawai, Wataru; Ebihara, Kenichi; Kume, Etsuo; Watanabe, Tadashi
2000-03-01
A simulation code for hydrodynamical phenomena which is based on the liquid-gas model of lattice-gas fluid is parallelized by using MPI (Message Passing Interface) library. The parallelized code can be applied to the larger size of the simulations than the non-parallelized code. The calculation times of the parallelized code on VPP500 (Vector-Parallel super computer with dispersed memory units), AP3000 (Scalar-parallel server with dispersed memory units), and a workstation cluster decreased in inverse proportion to the number of processors. (author)
PUMA: An Operating System for Massively Parallel Systems
Stephen R. Wheat
1994-01-01
Full Text Available This article presents an overview of PUMA (Performance-oriented, User-managed Messaging Architecture, a message-passing kernel for massively parallel systems. Message passing in PUMA is based on portals – an opening in the address space of an application process. Once an application process has established a portal, other processes can write values into the portal using a simple send operation. Because messages are written directly into the address space of the receiving process, there is no need to buffer messages in the PUMA kernel and later copy them into the applications address space. PUMA consists of two components: the quintessential kernel (Q-Kernel and the process control thread (PCT. Although the PCT provides management decisions, the Q-Kernel controls access and implements the policies specified by the PCT.
Rosa, L.P.
1990-01-01
It is discussed the hole of the Parallel Nuclear Program is Brazil and the feasibility of uranium enrichment for nuclear bomb construction. This program involves two research centers, one belonging to the brazilian navy and another to the aeronautics. Some other brazilian institutes like CTA, IPEN, COPESP and CETEX and also taking part in the program. (A.C.A.S.)
Yang Wankui; Liu Yaoguang; Ma Jimin; Yang Xin; Wang Guanbo
2014-01-01
MCBMPI, a parallelized burnup calculation program, was developed. The program is modularized. Neutron transport calculation module employs the parallelized MCNP5 program MCNP5MPI, and burnup calculation module employs ORIGEN2, with the MPI parallel zone decomposition strategy. The program system only consists of MCNP5MPI and an interface subroutine. The interface subroutine achieves three main functions, i.e. zone decomposition, nuclide transferring and decaying, data exchanging with MCNP5MPI. Also, the program was verified with the Pressurized Water Reactor (PWR) cell burnup benchmark, the results showed that it's capable to apply the program to burnup calculation of multiple zones, and the computation efficiency could be significantly improved with the development of computer hardware. (authors)
Vdebug: debugging tool for parallel scientific programs. Design report on vdebug
Matsuda, Katsuyuki; Takemiya, Hiroshi
2000-02-01
We report on a debugging tool called vdebug which supports debugging work for parallel scientific simulation programs. It is difficult to debug scientific programs with an existing debugger, because the volume of data generated by the programs is too large for users to check data in characters. Usually, the existing debugger shows data values in characters. To alleviate it, we have developed vdebug which enables to check the validity of large amounts of data by showing these data values visually. Although targets of vdebug have been restricted to sequential programs, we have made it applicable to parallel programs by realizing the function of merging and visualizing data distributed on programs on each computer node. Now, vdebug works on seven kinds of parallel computers. In this report, we describe the design of vdebug. (author)
Hoertnagl, Ch.
1997-12-01
Experiments at the future Large Hadron Collider (LHC) at CERN will be faced with an extraordinary challenge of event selection in real time. The primary event rate, equal to the bunch crossing frequency of 40 MHz, will have to be reduced by a factor of almost one-in-a-million in order to reveal traces of rare physics processes from an abundant background. This work presents various contributions to ongoing feasibility studies concerning the possible use of commercial technologies from the proximities of parallel computers and their communication networks for the second trigger stage, which faces an average data input rate of 100 kHz. Studies in this thesis apply a combination of methodologies, namely the build-up of lab-scale prototype implementations (including their exposition to test beam runs), algorithm development, technology tracking and benchmarking, as well as discrete event simulation. The main contribution consists of several technology case studies, which are based on the exploration of a set of standard benchmark programs for revealing simple parameters for characterizing delays during communication. Studied technologies include the communication sub-system of the Meiko CS-2, Asynchronous Transfer Mode (ATM), MEMORY CHANNEL, and Scalable Coherent Interface (SCI); all could be considered typical for candidate technologies. The discussion sheds light on the relative benefits and costs associated with different parallel programming models, in general, and with the use of message-passing libraries, such as Message Passing Interface (MPI), in particular. Best observed end-user-to-end-user latencies were ∼ 10 μs, best asymptotic bandwidths were ∼ 70 MByte/s. Typical sub-patterns of communication that have to be applied in the second trigger stage were sustained at ∼ 13 kHz, using today's technologies in realistic embeddings. (author)
User's guide of parallel program development environment (PPDE). The 2nd edition
Ueno, Hirokazu; Takemiya, Hiroshi; Imamura, Toshiyuki; Koide, Hiroshi; Matsuda, Katsuyuki; Higuchi, Kenji; Hirayama, Toshio [Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute, Tokyo (Japan); Ohta, Hirofumi [Hitachi Ltd., Tokyo (Japan)
2000-03-01
The STA basic system has been enhanced to accelerate support for parallel programming on heterogeneous parallel computers, through a series of R and D on the technology of parallel processing. The enhancement has been made through extending the function of the PPDF, Parallel Program Development Environment in the STA basic system. The extended PPDE has the function to make: 1) the automatic creation of a 'makefile' and a shell script file for its execution, 2) the multi-tools execution which makes the tools on heterogeneous computers to execute with one operation a task on a computer, and 3) the mirror composition to reflect editing results of a file on a computer into all related files on other computers. These additional functions will enhance the work efficiency for program development on some computers. More functions have been added to the PPDE to provide help for parallel program development. New functions were also designed to complement a HPF translator and a paralleilizing support tool when working together so that a sequential program is efficiently converted to a parallel program. This report describes the use of extended PPDE. (author)
User's guide of parallel program development environment (PPDE). The 2nd edition
Ueno, Hirokazu; Takemiya, Hiroshi; Imamura, Toshiyuki; Koide, Hiroshi; Matsuda, Katsuyuki; Higuchi, Kenji; Hirayama, Toshio [Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute, Tokyo (Japan); Ohta, Hirofumi [Hitachi Ltd., Tokyo (Japan)
2000-03-01
The STA basic system has been enhanced to accelerate support for parallel programming on heterogeneous parallel computers, through a series of R and D on the technology of parallel processing. The enhancement has been made through extending the function of the PPDF, Parallel Program Development Environment in the STA basic system. The extended PPDE has the function to make: 1) the automatic creation of a 'makefile' and a shell script file for its execution, 2) the multi-tools execution which makes the tools on heterogeneous computers to execute with one operation a task on a computer, and 3) the mirror composition to reflect editing results of a file on a computer into all related files on other computers. These additional functions will enhance the work efficiency for program development on some computers. More functions have been added to the PPDE to provide help for parallel program development. New functions were also designed to complement a HPF translator and a paralleilizing support tool when working together so that a sequential program is efficiently converted to a parallel program. This report describes the use of extended PPDE. (author)
A Parallel Encryption Algorithm Based on Piecewise Linear Chaotic Map
Xizhong Wang
2013-01-01
Full Text Available We introduce a parallel chaos-based encryption algorithm for taking advantage of multicore processors. The chaotic cryptosystem is generated by the piecewise linear chaotic map (PWLCM. The parallel algorithm is designed with a master/slave communication model with the Message Passing Interface (MPI. The algorithm is suitable not only for multicore processors but also for the single-processor architecture. The experimental results show that the chaos-based cryptosystem possesses good statistical properties. The parallel algorithm provides much better performance than the serial ones and would be useful to apply in encryption/decryption file with large size or multimedia.
CALTRANS: A parallel, deterministic, 3D neutronics code
Carson, L.; Ferguson, J.; Rogers, J.
1994-04-01
Our efforts to parallelize the deterministic solution of the neutron transport equation has culminated in a new neutronics code CALTRANS, which has full 3D capability. In this article, we describe the layout and algorithms of CALTRANS and present performance measurements of the code on a variety of platforms. Explicit implementation of the parallel algorithms of CALTRANS using both the function calls of the Parallel Virtual Machine software package (PVM 3.2) and the Meiko CS-2 tagged message passing library (based on the Intel NX/2 interface) are provided in appendices.
Stankovski, Z.
1995-01-01
The collision probability method in neutron transport, as applied to 2D geometries, consume a great amount of computer time, for a typical 2D assembly calculation evaluations. Consequently RZ or 3D calculations became prohibitive. In this paper we present a simple but efficient parallel algorithm based on the message passing host/node programing model. Parallelization was applied to the energy group treatment. Such approach permits parallelization of the existing code, requiring only limited modifications. Sequential/parallel computer portability is preserved, witch is a necessary condition for a industrial code. Sequential performances are also preserved. The algorithm is implemented on a CRAY 90 coupled to a 128 processor T3D computer, a 16 processor IBM SP1 and a network of workstations, using the Public Domain PVM library. The tests were executed for a 2D geometry with the standard 99-group library. All results were very satisfactory, the best ones with IBM SP1. Because of heterogeneity of the workstation network, we did ask high performances for this architecture. The same source code was used for all computers. A more impressive advantage of this algorithm will appear in the calculations of the SAPHYR project (with the future fine multigroup library of about 8000 groups) with a massively parallel computer, using several hundreds of processors. (author). 5 refs., 6 figs., 2 tabs
Xing Cai
2005-01-01
Full Text Available This article addresses the performance of scientific applications that use the Python programming language. First, we investigate several techniques for improving the computational efficiency of serial Python codes. Then, we discuss the basic programming techniques in Python for parallelizing serial scientific applications. It is shown that an efficient implementation of the array-related operations is essential for achieving good parallel performance, as for the serial case. Once the array-related operations are efficiently implemented, probably using a mixed-language implementation, good serial and parallel performance become achievable. This is confirmed by a set of numerical experiments. Python is also shown to be well suited for writing high-level parallel programs.
Differences Between Distributed and Parallel Systems
Brightwell, R.; Maccabe, A.B.; Rissen, R.
1998-10-01
Distributed systems have been studied for twenty years and are now coming into wider use as fast networks and powerful workstations become more readily available. In many respects a massively parallel computer resembles a network of workstations and it is tempting to port a distributed operating system to such a machine. However, there are significant differences between these two environments and a parallel operating system is needed to get the best performance out of a massively parallel system. This report characterizes the differences between distributed systems, networks of workstations, and massively parallel systems and analyzes the impact of these differences on operating system design. In the second part of the report, we introduce Puma, an operating system specifically developed for massively parallel systems. We describe Puma portals, the basic building blocks for message passing paradigms implemented on top of Puma, and show how the differences observed in the first part of the report have influenced the design and implementation of Puma.
Process-Oriented Parallel Programming with an Application to Data-Intensive Computing
Givelberg, Edward
2014-01-01
We introduce process-oriented programming as a natural extension of object-oriented programming for parallel computing. It is based on the observation that every class of an object-oriented language can be instantiated as a process, accessible via a remote pointer. The introduction of process pointers requires no syntax extension, identifies processes with programming objects, and enables processes to exchange information simply by executing remote methods. Process-oriented programming is a h...
Programming a massively parallel, computation universal system: static behavior
Lapedes, A.; Farber, R.
1986-01-01
In previous work by the authors, the ''optimum finding'' properties of Hopfield neural nets were applied to the nets themselves to create a ''neural compiler.'' This was done in such a way that the problem of programming the attractors of one neural net (called the Slave net) was expressed as an optimization problem that was in turn solved by a second neural net (the Master net). In this series of papers that approach is extended to programming nets that contain interneurons (sometimes called ''hidden neurons''), and thus deals with nets capable of universal computation. 22 refs.
Concurrent Programming Using Actors: Exploiting Large-Scale Parallelism,
1985-10-07
ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT. TASK* Artificial Inteligence Laboratory AREA Is WORK UNIT NUMBERS 545 Technology Square...D-R162 422 CONCURRENT PROGRMMIZNG USING f"OS XL?ITP TEH l’ LARGE-SCALE PARALLELISH(U) NASI AC E Al CAMBRIDGE ARTIFICIAL INTELLIGENCE L. G AGHA ET AL...RESOLUTION TEST CHART N~ATIONAL BUREAU OF STANDA.RDS - -96 A -E. __ _ __ __’ .,*- - -- •. - MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL
COMPSs-Mobile: parallel programming for mobile-cloud computing
Lordan Gomis, Francesc-Josep; Badia Sala, Rosa Maria
2016-01-01
The advent of Cloud and the popularization of mobile devices have led us to a shift in computing access. Computing users will have an interaction display while the real computation will be performed remotely, in the Cloud. COMPSs-Mobile is a framework that aims to ease the development of energy-efficient and high-performing applications for this environment. The framework provides an infrastructure-unaware programming model that allows developers to code regular Android applications that, ...
Compiling the parallel programming language NestStep to the CELL processor
Holm, Magnus
2010-01-01
The goal of this project is to create a source-to-source compiler which will translate NestStep code to C code. The compiler's job is to replace NestStep constructs with a series of function calls to the NestStep runtime system. NestStep is a parallel programming language extension based on the BSP model. It adds constructs for parallel programming on top of an imperative programming language. For this project, only constructs extending the C language are relevant. The output code will compil...
76 FR 66309 - Pilot Program for Parallel Review of Medical Products; Correction
2011-10-26
... DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Medicare and Medicaid Services [CMS-3180-N2] Food and Drug Administration [Docket No. FDA-2010-N-0308] Pilot Program for Parallel Review of Medical... 11, 2011 (76 FR 62808). The document announced a pilot program for sponsors of innovative device...
A parallel implementation of 3-d CT image reconstruction on a hypercube multiprocessor
Chen, C.M.; Lee, S.Y.; Cho, Z.H.
1990-01-01
In this paper, the authors describe how image reconstruction in computerized tomography (CT) can be parallelized on a message-passing multiprocessor. In particular, the results obtained from parallel implementation of 3-D CT image reconstruction for parallel beam geometries on the Intel hypercube, iPSC/2, are presented. A two stage pipelining approach is employed for filtering (convolution) and backprojection. The conventional sequential convolution algorithm is modified such that the symmetry of the filter kernel is fully utilized for parallelization. In the backprojection stage, the 3-D incremental algorithm, the authors' recently developed backprojection scheme which is shown to be faster than conventional algorithm, is parallelized
A Tool for Performance Modeling of Parallel Programs
J.A. González
2003-01-01
Full Text Available Current performance prediction analytical models try to characterize the performance behavior of actual machines through a small set of parameters. In practice, substantial deviations are observed. These differences are due to factors as memory hierarchies or network latency. A natural approach is to associate a different proportionality constant with each basic block, and analogously, to associate different latencies and bandwidths with each "communication block". Unfortunately, to use this approach implies that the evaluation of parameters must be done for each algorithm. This is a heavy task, implying experiment design, timing, statistics, pattern recognition and multi-parameter fitting algorithms. Software support is required. We present a compiler that takes as source a C program annotated with complexity formulas and produces as output an instrumented code. The trace files obtained from the execution of the resulting code are analyzed with an interactive interpreter, giving us, among other information, the values of those parameters.
A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison.
Martins, W S; Del Cuvillo, J B; Useche, F J; Theobald, K B; Gao, G R
2001-01-01
This paper discusses the issues involved in implementing a dynamic programming algorithm for biological sequence comparison on a general-purpose parallel computing platform based on a fine-grain event-driven multithreaded program execution model. Fine-grain multithreading permits efficient parallelism exploitation in this application both by taking advantage of asynchronous point-to-point synchronizations and communication with low overheads and by effectively tolerating latency through the overlapping of computation and communication. We have implemented our scheme on EARTH, a fine-grain event-driven multithreaded execution and architecture model which has been ported to a number of parallel machines with off-the-shelf processors. Our experimental results show that the dynamic programming algorithm can be efficiently implemented on EARTH systems with high performance (e.g., speedup of 90 on 120 nodes), good programmability and reasonable cost.
High performance parallelism pearls 2 multicore and many-core programming approaches
Jeffers, Jim
2015-01-01
High Performance Parallelism Pearls Volume 2 offers another set of examples that demonstrate how to leverage parallelism. Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming - illustrating the most effective ways to combine Xeon Phi coprocessors with Xeon and other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as biomed, genetics, finance, manufacturing, imaging, and more. Each chapter in this edited work includes detailed explanations of t
Tsuji, Masashi; Chiba, Gou
2000-01-01
A hierarchical domain decomposition boundary element method (HDD-BEM) for solving the multiregion neutron diffusion equation (NDE) has been fully parallelized, both for numerical computations and for data communications, to accomplish a high parallel efficiency on distributed memory message passing parallel computers. Data exchanges between node processors that are repeated during iteration processes of HDD-BEM are implemented, without any intervention of the host processor that was used to supervise parallel processing in the conventional parallelized HDD-BEM (P-HDD-BEM). Thus, the parallel processing can be executed with only cooperative operations of node processors. The communication overhead was even the dominant time consuming part in the conventional P-HDD-BEM, and the parallelization efficiency decreased steeply with the increase of the number of processors. With the parallel data communication, the efficiency is affected only by the number of boundary elements assigned to decomposed subregions, and the communication overhead can be drastically reduced. This feature can be particularly advantageous in the analysis of three-dimensional problems where a large number of processors are required. The proposed P-HDD-BEM offers a promising solution to the deterioration problem of parallel efficiency and opens a new path to parallel computations of NDEs on distributed memory message passing parallel computers. (author)
Limpanuparb, Taweetham; Milthorpe, Josh; Rendell, Alistair P
2014-10-30
Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10's work stealing runtime, and report performance results for long-range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine. Copyright © 2014 Wiley Periodicals, Inc.
Nash, T.; Areti, H.; Atac, R.
1988-08-01
Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs
Bellucci, Michael A; Coker, David F
2011-07-28
We describe a new method for constructing empirical valence bond potential energy surfaces using a parallel multilevel genetic program (PMLGP). Genetic programs can be used to perform an efficient search through function space and parameter space to find the best functions and sets of parameters that fit energies obtained by ab initio electronic structure calculations. Building on the traditional genetic program approach, the PMLGP utilizes a hierarchy of genetic programming on two different levels. The lower level genetic programs are used to optimize coevolving populations in parallel while the higher level genetic program (HLGP) is used to optimize the genetic operator probabilities of the lower level genetic programs. The HLGP allows the algorithm to dynamically learn the mutation or combination of mutations that most effectively increase the fitness of the populations, causing a significant increase in the algorithm's accuracy and efficiency. The algorithm's accuracy and efficiency is tested against a standard parallel genetic program with a variety of one-dimensional test cases. Subsequently, the PMLGP is utilized to obtain an accurate empirical valence bond model for proton transfer in 3-hydroxy-gamma-pyrone in gas phase and protic solvent. © 2011 American Institute of Physics
Colombet, L
1994-10-07
The aim of this thesis is to study and develop efficient methods for parallelization of scientific applications on parallel computers with distributed memory. The first part presents two libraries of PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) communication tools. They allow implementation of programs on most parallel machines, but also on heterogeneous computer networks. This chapter illustrates the problems faced when trying to evaluate performances of networks with heterogeneous processors. To evaluate such performances, the concepts of speed-up and efficiency have been modified and adapted to account for heterogeneity. The second part deals with a study of parallel application libraries such as ScaLAPACK and with the development of communication masking techniques. The general concept is based on communication anticipation, in particular by pipelining message sending operations. Experimental results on Cray T3D and IBM SP1 machines validates the theoretical studies performed on basic algorithms of the libraries discussed above. Two examples of scientific applications are given: the first is a model of young stars for astrophysics and the other is a model of photon trajectories in the Compton effect. (J.S.). 83 refs., 65 figs., 24 tabs.
Goldstein, David
1991-01-01
Extensions to an architecture for real-time, distributed (parallel) knowledge-based systems called the Parallel Real-time Artificial Intelligence System (PRAIS) are discussed. PRAIS strives for transparently parallelizing production (rule-based) systems, even under real-time constraints. PRAIS accomplished these goals (presented at the first annual C Language Integrated Production System (CLIPS) conference) by incorporating a dynamic task scheduler, operating system extensions for fact handling, and message-passing among multiple copies of CLIPS executing on a virtual blackboard. This distributed knowledge-based system tool uses the portability of CLIPS and common message-passing protocols to operate over a heterogeneous network of processors. Results using the original PRAIS architecture over a network of Sun 3's, Sun 4's and VAX's are presented. Mechanisms using the producer-consumer model to extend the architecture for fault-tolerance and distributed truth maintenance initiation are also discussed.
Parallel programming of saccades during natural scene viewing: evidence from eye movement positions.
Wu, Esther X W; Gilani, Syed Omer; van Boxtel, Jeroen J A; Amihai, Ido; Chua, Fook Kee; Yen, Shih-Cheng
2013-10-24
Previous studies have shown that saccade plans during natural scene viewing can be programmed in parallel. This evidence comes mainly from temporal indicators, i.e., fixation durations and latencies. In the current study, we asked whether eye movement positions recorded during scene viewing also reflect parallel programming of saccades. As participants viewed scenes in preparation for a memory task, their inspection of the scene was suddenly disrupted by a transition to another scene. We examined whether saccades after the transition were invariably directed immediately toward the center or were contingent on saccade onset times relative to the transition. The results, which showed a dissociation in eye movement behavior between two groups of saccades after the scene transition, supported the parallel programming account. Saccades with relatively long onset times (>100 ms) after the transition were directed immediately toward the center of the scene, probably to restart scene exploration. Saccades with short onset times (programming of saccades during scene viewing. Additionally, results from the analyses of intersaccadic intervals were also consistent with the parallel programming hypothesis.
Pic, Marc Michel
1995-01-01
Parallel programming covers task-parallelism and data-parallelism. Many problems need both parallelisms. Multi-SIMD computers allow hierarchical approach of these parallelisms. The T++ language, based on C++, is dedicated to exploit Multi-SIMD computers using a programming paradigm which is an extension of array-programming to tasks managing. Our language introduced array of independent tasks to achieve separately (MIMD), on subsets of processors of identical behaviour (SIMD), in order to translate the hierarchical inclusion of data-parallelism in task-parallelism. To manipulate in a symmetrical way tasks and data we propose meta-operations which have the same behaviour on tasks arrays and on data arrays. We explain how to implement this language on our parallel computer SYMPHONIE in order to profit by the locally-shared memory, by the hardware virtualization, and by the multiplicity of communications networks. We analyse simultaneously a typical application of such architecture. Finite elements scheme for Fluid mechanic needs powerful parallel computers and requires large floating points abilities. Lattice gases is an alternative to such simulations. Boolean lattice bases are simple, stable, modular, need to floating point computation, but include numerical noise. Boltzmann lattice gases present large precision of computation, but needs floating points and are only locally stable. We propose a new scheme, called multi-bit, who keeps the advantages of each boolean model to which it is applied, with large numerical precision and reduced noise. Experiments on viscosity, physical behaviour, noise reduction and spurious invariants are shown and implementation techniques for parallel Multi-SIMD computers detailed. (author) [fr
How to Build an AppleSeed: A Parallel Macintosh Cluster for Numerically Intensive Computing
Decyk, V. K.; Dauger, D. E.
We have constructed a parallel cluster consisting of a mixture of Apple Macintosh G3 and G4 computers running the Mac OS, and have achieved very good performance on numerically intensive, parallel plasma particle-incell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. This enables us to move parallel computing from the realm of experts to the main stream of computing.
Parallel simulated annealing algorithms for cell placement on hypercube multiprocessors
Banerjee, Prithviraj; Jones, Mark Howard; Sargent, Jeff S.
1990-01-01
Two parallel algorithms for standard cell placement using simulated annealing are developed to run on distributed-memory message-passing hypercube multiprocessors. The cells can be mapped in a two-dimensional area of a chip onto processors in an n-dimensional hypercube in two ways, such that both small and large cell exchange and displacement moves can be applied. The computation of the cost function in parallel among all the processors in the hypercube is described, along with a distributed data structure that needs to be stored in the hypercube to support the parallel cost evaluation. A novel tree broadcasting strategy is used extensively for updating cell locations in the parallel environment. A dynamic parallel annealing schedule estimates the errors due to interacting parallel moves and adapts the rate of synchronization automatically. Two novel approaches in controlling error in parallel algorithms are described: heuristic cell coloring and adaptive sequence control.
Parallelization of ITOUGH2 using PVM
Finsterle, Stefan
1998-01-01
ITOUGH2 inversions are computationally intensive because the forward problem must be solved many times to evaluate the objective function for different parameter combinations or to numerically calculate sensitivity coefficients. Most of these forward runs are independent from each other and can therefore be performed in parallel. Message passing based on the Parallel Virtual Machine (PVM) system has been implemented into ITOUGH2 to enable parallel processing of ITOUGH2 jobs on a heterogeneous network of Unix workstations. This report describes the PVM system and its implementation into ITOUGH2. Instructions are given for installing PVM, compiling ITOUGH2-PVM for use on a workstation cluster, the preparation of an 1.TOUGH2 input file under PVM, and the execution of an ITOUGH2-PVM application. Examples are discussed, demonstrating the use of ITOUGH2-PVM
A 3D gyrokinetic particle-in-cell simulation of fusion plasma microturbulence on parallel computers
Williams, T. J.
1992-12-01
One of the grand challenge problems now supported by HPCC is the Numerical Tokamak Project. A goal of this project is the study of low-frequency micro-instabilities in tokamak plasmas, which are believed to cause energy loss via turbulent thermal transport across the magnetic field lines. An important tool in this study is gyrokinetic particle-in-cell (PIC) simulation. Gyrokinetic, as opposed to fully-kinetic, methods are particularly well suited to the task because they are optimized to study the frequency and wavelength domain of the microinstabilities. Furthermore, many researchers now employ low-noise delta(f) methods to greatly reduce statistical noise by modelling only the perturbation of the gyrokinetic distribution function from a fixed background, not the entire distribution function. In spite of the increased efficiency of these improved algorithms over conventional PIC algorithms, gyrokinetic PIC simulations of tokamak micro-turbulence are still highly demanding of computer power--even fully-vectorized codes on vector supercomputers. For this reason, we have worked for several years to redevelop these codes on massively parallel computers. We have developed 3D gyrokinetic PIC simulation codes for SIMD and MIMD parallel processors, using control-parallel, data-parallel, and domain-decomposition message-passing (DDMP) programming paradigms. This poster summarizes our earlier work on codes for the Connection Machine and BBN TC2000 and our development of a generic DDMP code for distributed-memory parallel machines. We discuss the memory-access issues which are of key importance in writing parallel PIC codes, with special emphasis on issues peculiar to gyrokinetic PIC. We outline the domain decompositions in our new DDMP code and discuss the interplay of different domain decompositions suited for the particle-pushing and field-solution components of the PIC algorithm.
A Parallel, Finite-Volume Algorithm for Large-Eddy Simulation of Turbulent Flows
Bui, Trong T.
1999-01-01
A parallel, finite-volume algorithm has been developed for large-eddy simulation (LES) of compressible turbulent flows. This algorithm includes piecewise linear least-square reconstruction, trilinear finite-element interpolation, Roe flux-difference splitting, and second-order MacCormack time marching. Parallel implementation is done using the message-passing programming model. In this paper, the numerical algorithm is described. To validate the numerical method for turbulence simulation, LES of fully developed turbulent flow in a square duct is performed for a Reynolds number of 320 based on the average friction velocity and the hydraulic diameter of the duct. Direct numerical simulation (DNS) results are available for this test case, and the accuracy of this algorithm for turbulence simulations can be ascertained by comparing the LES solutions with the DNS results. The effects of grid resolution, upwind numerical dissipation, and subgrid-scale dissipation on the accuracy of the LES are examined. Comparison with DNS results shows that the standard Roe flux-difference splitting dissipation adversely affects the accuracy of the turbulence simulation. For accurate turbulence simulations, only 3-5 percent of the standard Roe flux-difference splitting dissipation is needed.
Nakanishi, Sen-ichiro; Ishida, Hideaki; Himei, Toyoji
1981-01-01
The systematic analytical method is reqUired for the ac phase control circuit by means of an inverse parallel thyristor pair which has a series and parallel L-C resonant load, because the phase control action causes abnormal and interesting phenomena, such as an extreme increase of voltage and current, an unique increase and decrease of contained higher harmonics, and a wide variation of power factor, etc. In this paper, the program for the analysis of the thyristor phase control circuit with...
Academic training: From Evolution Theory to Parallel and Distributed Genetic Programming
2007-01-01
2006-2007 ACADEMIC TRAINING PROGRAMME LECTURE SERIES 15, 16 March From 11:00 to 12:00 - Main Auditorium, bldg. 500 From Evolution Theory to Parallel and Distributed Genetic Programming F. FERNANDEZ DE VEGA / Univ. of Extremadura, SP Lecture No. 1: From Evolution Theory to Evolutionary Computation Evolutionary computation is a subfield of artificial intelligence (more particularly computational intelligence) involving combinatorial optimization problems, which are based to some degree on the evolution of biological life in the natural world. In this tutorial we will review the source of inspiration for this metaheuristic and its capability for solving problems. We will show the main flavours within the field, and different problems that have been successfully solved employing this kind of techniques. Lecture No. 2: Parallel and Distributed Genetic Programming The successful application of Genetic Programming (GP, one of the available Evolutionary Algorithms) to optimization problems has encouraged an ...
Parallel computing and networking; Heiretsu keisanki to network
Asakawa, E; Tsuru, T [Japan National Oil Corp., Tokyo (Japan); Matsuoka, T [Japan Petroleum Exploration Co. Ltd., Tokyo (Japan)
1996-05-01
This paper describes the trend of parallel computers used in geophysical exploration. Around 1993 was the early days when the parallel computers began to be used for geophysical exploration. Classification of these computers those days was mainly MIMD (multiple instruction stream, multiple data stream), SIMD (single instruction stream, multiple data stream) and the like. Parallel computers were publicized in the 1994 meeting of the Geophysical Exploration Society as a `high precision imaging technology`. Concerning the library of parallel computers, there was a shift to PVM (parallel virtual machine) in 1993 and to MPI (message passing interface) in 1995. In addition, the compiler of FORTRAN90 was released with support implemented for data parallel and vector computers. In 1993, networks used were Ethernet, FDDI, CDDI and HIPPI. In 1995, the OC-3 products under ATM began to propagate. However, ATM remains to be an interoffice high speed network because the ATM service has not spread yet for the public network. 1 ref.
Förster, Michael
2014-01-01
Numerical programs often use parallel programming techniques such as OpenMP to compute the program's output values as efficient as possible. In addition, derivative values of these output values with respect to certain input values play a crucial role. To achieve code that computes not only the output values simultaneously but also the derivative values, this work introduces several source-to-source transformation rules. These rules are based on a technique called algorithmic differentiation. The main focus of this work lies on the important reverse mode of algorithmic differentiation. The inh
Run-Time and Compiler Support for Programming in Adaptive Parallel Environments
Guy Edjlali
1997-01-01
Full Text Available For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at run-time. In this article, we discuss run-time support for data-parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a run-time library to provide this support. We discuss how the run-time library can be used by compilers of high-performance Fortran (HPF-like languages to generate code for an adaptive environment. We present performance results for a Navier-Stokes solver and a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network of nondedicated workstations, which are likely to be an important resource for parallel programming in the future.
Towards Interactive Visual Exploration of Parallel Programs using a Domain-Specific Language
Klein, Tobias
2016-04-19
The use of GPUs and the massively parallel computing paradigm have become wide-spread. We describe a framework for the interactive visualization and visual analysis of the run-time behavior of massively parallel programs, especially OpenCL kernels. This facilitates understanding a program\\'s function and structure, finding the causes of possible slowdowns, locating program bugs, and interactively exploring and visually comparing different code variants in order to improve performance and correctness. Our approach enables very specific, user-centered analysis, both in terms of the recording of the run-time behavior and the visualization itself. Instead of having to manually write instrumented code to record data, simple code annotations tell the source-to-source compiler which code instrumentation to generate automatically. The visualization part of our framework then enables the interactive analysis of kernel run-time behavior in a way that can be very specific to a particular problem or optimization goal, such as analyzing the causes of memory bank conflicts or understanding an entire parallel algorithm.
Lee, Soon-Hwan; Chino, Masamichi
2000-01-01
The coupling between atmosphere and ocean model has physical and computational difficulties for short-term forecasting of weather and ocean current. In this research, a combination system between high-resolution meso-scale atmospheric model and ocean model has been constructed using a new message-passing library, called Stampi (Seamless Thinking Aid Message Passing Interface), for prediction of particle dispersion at emergency nuclear accident. Stampi, which is based on the MPI (Message Passing Interface) 2 specification, makes us carry out parallel calculations of combination system without parallelization skill to model code. And it realizes dynamic process creation on different machines and communication between spawned one within the scope of MPI semantics. The models included in this combination system are PHYSIC as an atmosphere model, and POM (Princeton Ocean Model) as an ocean model. We applied this combination system to predict sea surface current at Sea of Japan in winter season. Simulation results indicate that the wind stress near the sea surface tends to be a predominant factor to determine surface ocean currents and dispersion of radioactive contamination in the ocean. The surface ocean current is well correspondent with wind direction, induced by high mountains at North Korea. The satellite data of NSCAT (NASA-SCATterometer), which is an image of sea surface current, also agrees well with the results of this system. (author)
Hinchey, Michael G.; Rash, James L.; Rouff, Christopher A.
2005-01-01
The manual application of formal methods in system specification has produced successes, but in the end, despite any claims and assertions by practitioners, there is no provable relationship between a manually derived system specification or formal model and the customer's original requirements. Complex parallel and distributed system present the worst case implications for today s dearth of viable approaches for achieving system dependability. No avenue other than formal methods constitutes a serious contender for resolving the problem, and so recognition of requirements-based programming has come at a critical juncture. We describe a new, NASA-developed automated requirement-based programming method that can be applied to certain classes of systems, including complex parallel and distributed systems, to achieve a high degree of dependability.
Dung, Phan Anh; Hansen, Michael Reichhardt
2015-01-01
In this paper we investigate multicore parallelism in the context of functional programming by means of two quantifier-elimination procedures for Presburger Arithmetic: one is based on Cooper’s algorithm and the other is based on the Omega Test. We first develop correct-by-construction prototype...... platform executing on an 8-core machine. A speedup of approximately 4 was obtained for Cooper’s algorithm and a speedup of approximately 6 was obtained for the exact-shadow part of the Omega Test. The considered procedures are complex, memory-intense algorithms on huge formula trees and the case study...... reveals more general applicable techniques and guideline for deriving parallel algorithms from sequential ones in the context of data-intensive tree algorithms. The obtained insights should apply for any strict and impure functional programming language. Furthermore, the results obtained for the exact...
Parallelization and implementation of approximate root isolation for nonlinear system by Monte Carlo
Khosravi, Ebrahim
1998-12-01
This dissertation solves a fundamental problem of isolating the real roots of nonlinear systems of equations by Monte-Carlo that were published by Bush Jones. This algorithm requires only function values and can be applied readily to complicated systems of transcendental functions. The implementation of this sequential algorithm provides scientists with the means to utilize function analysis in mathematics or other fields of science. The algorithm, however, is so computationally intensive that the system is limited to a very small set of variables, and this will make it unfeasible for large systems of equations. Also a computational technique was needed for investigating a metrology of preventing the algorithm structure from converging to the same root along different paths of computation. The research provides techniques for improving the efficiency and correctness of the algorithm. The sequential algorithm for this technique was corrected and a parallel algorithm is presented. This parallel method has been formally analyzed and is compared with other known methods of root isolation. The effectiveness, efficiency, enhanced overall performance of the parallel processing of the program in comparison to sequential processing is discussed. The message passing model was used for this parallel processing, and it is presented and implemented on Intel/860 MIMD architecture. The parallel processing proposed in this research has been implemented in an ongoing high energy physics experiment: this algorithm has been used to track neutrinoes in a super K detector. This experiment is located in Japan, and data can be processed on-line or off-line locally or remotely.
Zhou, Yuan; Cheng, Xinyao; Xu, Xiangyang; Song, Enmin
2013-12-01
Segmentation of carotid artery intima-media in longitudinal ultrasound images for measuring its thickness to predict cardiovascular diseases can be simplified as detecting two nearly parallel boundaries within a certain distance range, when plaque with irregular shapes is not considered. In this paper, we improve the implementation of two dynamic programming (DP) based approaches to parallel boundary detection, dual dynamic programming (DDP) and piecewise linear dual dynamic programming (PL-DDP). Then, a novel DP based approach, dual line detection (DLD), which translates the original 2-D curve position to a 4-D parameter space representing two line segments in a local image segment, is proposed to solve the problem while maintaining efficiency and rotation invariance. To apply the DLD to ultrasound intima-media segmentation, it is imbedded in a framework that employs an edge map obtained from multiplication of the responses of two edge detectors with different scales and a coupled snake model that simultaneously deforms the two contours for maintaining parallelism. The experimental results on synthetic images and carotid arteries of clinical ultrasound images indicate improved performance of the proposed DLD compared to DDP and PL-DDP, with respect to accuracy and efficiency. Copyright © 2013 Elsevier B.V. All rights reserved.
Towards Interactive Visual Exploration of Parallel Programs using a Domain-Specific Language
Klein, Tobias; Bruckner, Stefan; Grö ller, M. Eduard; Hadwiger, Markus; Rautek, Peter
2016-01-01
The use of GPUs and the massively parallel computing paradigm have become wide-spread. We describe a framework for the interactive visualization and visual analysis of the run-time behavior of massively parallel programs, especially OpenCL kernels. This facilitates understanding a program's function and structure, finding the causes of possible slowdowns, locating program bugs, and interactively exploring and visually comparing different code variants in order to improve performance and correctness. Our approach enables very specific, user-centered analysis, both in terms of the recording of the run-time behavior and the visualization itself. Instead of having to manually write instrumented code to record data, simple code annotations tell the source-to-source compiler which code instrumentation to generate automatically. The visualization part of our framework then enables the interactive analysis of kernel run-time behavior in a way that can be very specific to a particular problem or optimization goal, such as analyzing the causes of memory bank conflicts or understanding an entire parallel algorithm.
A backtracking algorithm for the stream AND-parallel execution of logic programs
Somogyi, Z.; Ramamohanarao, K.; Vaghani, J. (Univ. of Melbourne, Parkville (Australia))
1988-06-01
The authors present the first backtracking algorithm for stream AND-parallel logic programs. It relies on compile-time knowledge of the data flow graph of each clause to let it figure out efficiently which goals to kill or restart when a goal fails. This crucial information, which they derive from mode declarations, was not available at compile-time in any previous stream AND-parallel system. They show that modes can increase the precision of the backtracking algorithm, though their algorithm allows this precision to be traded off against overhead on a procedure-by-procedure and call-by-call basis. The modes also allow their algorithm to handle efficiently programs that manipulate partially instantiated data structures and an important class of programs with circular dependency graphs. On code that does not need backtracking, the efficiency of their algorithm approaches that of the committed-choice languages; on code that does need backtracking its overhead is comparable to that of the independent AND-parallel backtracking algorithms.
Peformance Tuning and Evaluation of a Parallel Community Climate Model
Drake, J.B.; Worley, P.H.; Hammond, S.
1999-11-13
The Parallel Community Climate Model (PCCM) is a message-passing parallelization of version 2.1 of the Community Climate Model (CCM) developed by researchers at Argonne and Oak Ridge National Laboratories and at the National Center for Atmospheric Research in the early to mid 1990s. In preparation for use in the Department of Energy's Parallel Climate Model (PCM), PCCM has recently been updated with new physics routines from version 3.2 of the CCM, improvements to the parallel implementation, and ports to the SGIKray Research T3E and Origin 2000. We describe our experience in porting and tuning PCCM on these new platforms, evaluating the performance of different parallel algorithm options and comparing performance between the T3E and Origin 2000.
Parallel ray tracing for one-dimensional discrete ordinate computations
Jarvis, R.D.; Nelson, P.
1996-01-01
The ray-tracing sweep in discrete-ordinates, spatially discrete numerical approximation methods applied to the linear, steady-state, plane-parallel, mono-energetic, azimuthally symmetric, neutral-particle transport equation can be reduced to a parallel prefix computation. In so doing, the often severe penalty in convergence rate of the source iteration, suffered by most current parallel algorithms using spatial domain decomposition, can be avoided while attaining parallelism in the spatial domain to whatever extent desired. In addition, the reduction implies parallel algorithm complexity limits for the ray-tracing sweep. The reduction applies to all closed, linear, one-cell functional (CLOF) spatial approximation methods, which encompasses most in current popular use. Scalability test results of an implementation of the algorithm on a 64-node nCube-2S hypercube-connected, message-passing, multi-computer are described. (author)
Implementation of a parallel version of a regional climate model
Gerstengarbe, F.W. [ed.; Kuecken, M. [Potsdam-Institut fuer Klimafolgenforschung (PIK), Potsdam (Germany); Schaettler, U. [Deutscher Wetterdienst, Offenbach am Main (Germany). Geschaeftsbereich Forschung und Entwicklung
1997-10-01
A regional climate model developed by the Max Planck Institute for Meterology and the German Climate Computing Centre in Hamburg based on the `Europa` and `Deutschland` models of the German Weather Service has been parallelized and implemented on the IBM RS/6000 SP computer system of the Potsdam Institute for Climate Impact Research including parallel input/output processing, the explicit Eulerian time-step, the semi-implicit corrections, the normal-mode initialization and the physical parameterizations of the German Weather Service. The implementation utilizes Fortran 90 and the Message Passing Interface. The parallelization strategy used is a 2D domain decomposition. This report describes the parallelization strategy, the parallel I/O organization, the influence of different domain decomposition approaches for static and dynamic load imbalances and first numerical results. (orig.)
Numerical discrepancy between serial and MPI parallel computations
Sang Bong Lee
2016-09-01
Full Text Available Numerical simulations of 1D Burgers equation and 2D sloshing problem were carried out to study numerical discrepancy between serial and parallel computations. The numerical domain was decomposed into 2 and 4 subdomains for parallel computations with message passing interface. The numerical solution of Burgers equation disclosed that fully explicit boundary conditions used on subdomains of parallel computation was responsible for the numerical discrepancy of transient solution between serial and parallel computations. Two dimensional sloshing problems in a rectangular domain were solved using OpenFOAM. After a lapse of initial transient time sloshing patterns of water were significantly different in serial and parallel computations although the same numerical conditions were given. Based on the histograms of pressure measured at two points near the wall the statistical characteristics of numerical solution was not affected by the number of subdomains as much as the transient solution was dependent on the number of subdomains.
West, Jeff; Yang, H. Q.
2014-01-01
There are many instances involving liquid/gas interfaces and their dynamics in the design of liquid engine powered rockets such as the Space Launch System (SLS). Some examples of these applications are: Propellant tank draining and slosh, subcritical condition injector analysis for gas generators, preburners and thrust chambers, water deluge mitigation for launch induced environments and even solid rocket motor liquid slag dynamics. Commercially available CFD programs simulating gas/liquid interfaces using the Volume of Fluid approach are currently limited in their parallel scalability. In 2010 for instance, an internal NASA/MSFC review of three commercial tools revealed that parallel scalability was seriously compromised at 8 cpus and no additional speedup was possible after 32 cpus. Other non-interface CFD applications at the time were demonstrating useful parallel scalability up to 4,096 processors or more. Based on this review, NASA/MSFC initiated an effort to implement a Volume of Fluid implementation within the unstructured mesh, pressure-based algorithm CFD program, Loci-STREAM. After verification was achieved by comparing results to the commercial CFD program CFD-Ace+, and validation by direct comparison with data, Loci-STREAM-VoF is now the production CFD tool for propellant slosh force and slosh damping rate simulations at NASA/MSFC. On these applications, good parallel scalability has been demonstrated for problems sizes of tens of millions of cells and thousands of cpu cores. Ongoing efforts are focused on the application of Loci-STREAM-VoF to predict the transient flow patterns of water on the SLS Mobile Launch Platform in order to support the phasing of water for launch environment mitigation so that vehicle determinantal effects are not realized.
The FORCE: A portable parallel programming language supporting computational structural mechanics
Jordan, Harry F.; Benten, Muhammad S.; Brehm, Juergen; Ramanan, Aruna
1989-01-01
This project supports the conversion of codes in Computational Structural Mechanics (CSM) to a parallel form which will efficiently exploit the computational power available from multiprocessors. The work is a part of a comprehensive, FORTRAN-based system to form a basis for a parallel version of the NICE/SPAR combination which will form the CSM Testbed. The software is macro-based and rests on the force methodology developed by the principal investigator in connection with an early scientific multiprocessor. Machine independence is an important characteristic of the system so that retargeting it to the Flex/32, or any other multiprocessor on which NICE/SPAR might be imnplemented, is well supported. The principal investigator has experience in producing parallel software for both full and sparse systems of linear equations using the force macros. Other researchers have used the Force in finite element programs. It has been possible to rapidly develop software which performs at maximum efficiency on a multiprocessor. The inherent machine independence of the system also means that the parallelization will not be limited to a specific multiprocessor.
Multi-petascale highly efficient parallel supercomputer
Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.; Blumrich, Matthias A.; Boyle, Peter; Brunheroto, Jose R.; Chen, Dong; Cher, Chen-Yong; Chiu, George L.; Christ, Norman; Coteus, Paul W.; Davis, Kristan D.; Dozsa, Gabor J.; Eichenberger, Alexandre E.; Eisley, Noel A.; Ellavsky, Matthew R.; Evans, Kahn C.; Fleischer, Bruce M.; Fox, Thomas W.; Gara, Alan; Giampapa, Mark E.; Gooding, Thomas M.; Gschwind, Michael K.; Gunnels, John A.; Hall, Shawn A.; Haring, Rudolf A.; Heidelberger, Philip; Inglett, Todd A.; Knudson, Brant L.; Kopcsay, Gerard V.; Kumar, Sameer; Mamidala, Amith R.; Marcella, James A.; Megerian, Mark G.; Miller, Douglas R.; Miller, Samuel J.; Muff, Adam J.; Mundy, Michael B.; O'Brien, John K.; O'Brien, Kathryn M.; Ohmacht, Martin; Parker, Jeffrey J.; Poole, Ruth J.; Ratterman, Joseph D.; Salapura, Valentina; Satterfield, David L.; Senger, Robert M.; Steinmacher-Burow, Burkhard; Stockdell, William M.; Stunkel, Craig B.; Sugavanam, Krishnan; Sugawara, Yutaka; Takken, Todd E.; Trager, Barry M.; Van Oosten, James L.; Wait, Charles D.; Walkup, Robert E.; Watson, Alfred T.; Wisniewski, Robert W.; Wu, Peng
2018-05-15
A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time and supports DMA functionality allowing for parallel processing message-passing.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
Rostrup, Scott; De Sterck, Hans
2010-12-01
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL
A parallel solver for huge dense linear systems
Badia, J. M.; Movilla, J. L.; Climente, J. I.; Castillo, M.; Marqués, M.; Mayo, R.; Quintana-Ortí, E. S.; Planelles, J.
2011-11-01
HDSS (Huge Dense Linear System Solver) is a Fortran Application Programming Interface (API) to facilitate the parallel solution of very large dense systems to scientists and engineers. The API makes use of parallelism to yield an efficient solution of the systems on a wide range of parallel platforms, from clusters of processors to massively parallel multiprocessors. It exploits out-of-core strategies to leverage the secondary memory in order to solve huge linear systems O(100.000). The API is based on the parallel linear algebra library PLAPACK, and on its Out-Of-Core (OOC) extension POOCLAPACK. Both PLAPACK and POOCLAPACK use the Message Passing Interface (MPI) as the communication layer and BLAS to perform the local matrix operations. The API provides a friendly interface to the users, hiding almost all the technical aspects related to the parallel execution of the code and the use of the secondary memory to solve the systems. In particular, the API can automatically select the best way to store and solve the systems, depending of the dimension of the system, the number of processes and the main memory of the platform. Experimental results on several parallel platforms report high performance, reaching more than 1 TFLOP with 64 cores to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors. New version program summaryProgram title: Huge Dense System Solver (HDSS) Catalogue identifier: AEHU_v1_1 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHU_v1_1.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 87 062 No. of bytes in distributed program, including test data, etc.: 1 069 110 Distribution format: tar.gz Programming language: Fortran90, C Computer: Parallel architectures: multiprocessors, computer clusters Operating system
Fast implementations of 3D PET reconstruction using vector and parallel programming techniques
Guerrero, T.M.; Cherry, S.R.; Dahlbom, M.; Ricci, A.R.; Hoffman, E.J.
1993-01-01
Computationally intensive techniques that offer potential clinical use have arisen in nuclear medicine. Examples include iterative reconstruction, 3D PET data acquisition and reconstruction, and 3D image volume manipulation including image registration. One obstacle in achieving clinical acceptance of these techniques is the computational time required. This study focuses on methods to reduce the computation time for 3D PET reconstruction through the use of fast computer hardware, vector and parallel programming techniques, and algorithm optimization. The strengths and weaknesses of i860 microprocessor based workstation accelerator boards are investigated in implementations of 3D PET reconstruction
A program system for ab initio MO calculations on vector and parallel processing machines. Pt. 1
Ernenwein, R.; Rohmer, M.M.; Benard, M.
1990-01-01
We present a program system for ab initio molecular orbital calculations on vector and parallel computers. The present article is devoted to the computation of one- and two-electron integrals over contracted Gaussian basis sets involving s-, p-, d- and f-type functions. The McMurchie and Davidson (MMD) algorithm has been implemented and parallelized by distributing over a limited number of logical tasks the calculation of the 55 relevant classes of integrals. All sections of the MMD algorithm have been efficiently vectorized, leading to a scalar/vector ratio of 5.8. Different algorithms are proposed and compared for an optimal vectorization of the contraction of the 'intermediate integrals' generated by the MMD formalism. Advantage is taken of the dynamic storage allocation for tuning the length of the vector loops (i.e. the size of the vectorization buffer) as a function of (i) the total memory available for the job, (ii) the number of logical tasks defined by the user (≤13), and (iii) the storage requested by each specific class of integrals. Test calculations carried out on a CRAY-2 computer show that the average number of finite integrals computed over a (s, p, d, f) CGTO basis set is about 1180000 per second and per processor. The combination of vectorization and parallelism on this 4-processor machine reduces the CPU time by a factor larger than 20 with respect to the scalar and sequential performance. (orig.)
Optimisation of a parallel ocean general circulation model
M. I. Beare
1997-10-01
Full Text Available This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers.
Optimisation of a parallel ocean general circulation model
M. I. Beare
Full Text Available This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers.
Andersson, M.
1996-09-01
We have introduced heterogeneity to an existing model as a special feature and simultaneously extended the model from 1D to 3D. Briefly, the code generates stochastic fractures in a given geosphere. These fractures are connected in series to form one pathway for radionuclide transport from the repository to the biosphere. Rock heterogeneity is realized by simulating physical and chemical properties for each fracture, i.e. these properties vary along the transport pathway (which is an ensemble of all fractures serially connected). In this case, each Monte Carlo simulation involves a set of many thousands of realizations, one for each pathway. Each pathway can be formed by approx. 100 fractures. This means that for a Monte Carlo simulation of 1000 realizations, we need to perform a total of 100,000 simulations. Therefore the introduction of heterogeneity has increased the CPU demands by two orders of magnitude. To overcome the demand for CPU, the program, MLCRYSTAL, has been implemented in a parallel workstation environment using the MPI, Message Passing Interface, and later on ported to an IBM-SP2 parallel supercomputer. The program is presented here and a preliminary set of results is given with the conclusions that can be drawn. 3 refs, 12 figs.
Andersson, M.
1996-09-01
We have introduced heterogeneity to an existing model as a special feature and simultaneously extended the model from 1D to 3D. Briefly, the code generates stochastic fractures in a given geosphere. These fractures are connected in series to form one pathway for radionuclide transport from the repository to the biosphere. Rock heterogeneity is realized by simulating physical and chemical properties for each fracture, i.e. these properties vary along the transport pathway (which is an ensemble of all fractures serially connected). In this case, each Monte Carlo simulation involves a set of many thousands of realizations, one for each pathway. Each pathway can be formed by approx. 100 fractures. This means that for a Monte Carlo simulation of 1000 realizations, we need to perform a total of 100,000 simulations. Therefore the introduction of heterogeneity has increased the CPU demands by two orders of magnitude. To overcome the demand for CPU, the program, MLCRYSTAL, has been implemented in a parallel workstation environment using the MPI, Message Passing Interface, and later on ported to an IBM-SP2 parallel supercomputer. The program is presented here and a preliminary set of results is given with the conclusions that can be drawn. 3 refs, 12 figs
A program system for ab initio MO calculations on vector and parallel processing machines. Pt. 3
Wiest, R.; Demuynck, J.; Benard, M.; Rohmer, M.M.; Ernenwein, R.
1991-01-01
This series of three papers presents a program system for ab initio molecular orbital calculations on vector and parallel computers. Part III is devoted to the four-index transformation on a molecular orbital basis of size NMO of the file of two-electorn integrals (pqparallelrs) generated by a contracted Gaussian set of size NATO (number of atomic orbitals). A fast Yoshimine algorithm first sorts the (pqparallelrs) integrals with respect to index pq only. This file of half-sorted integrals labelled by their rs-index can be processed without further modification to generate either the transformed integrals or the supermatrix elements. The large memory available on the CRAY-2 hase made possible to implement the transformation algorithm proposed by Bender in 1972, which requires a core-storage allocation varying as (NATO) 3 . Two versions of Bender's algorithm are included in the present program. The first version is an in-core version, where the complete file of accumulated contributions to transformed integrals in stored and updated in central memory. This version has been parallelized by distributing over a limited number of logical tasks the NATO steps corresponding to the scanning of the most external loop. The second version is an out-of-core version, in which twin files are alternatively used as input and output for the accumulated contributions to transformed integrals. This version is not parallel. The choice of one or another version and (for version 1) the determination of the number of tasks depends upon the balance between the available and the requested amounts of storage. The storage management and the choice of the proper version are carried out automatically using dynamic storage allocation. Both versions are vectorized and take advantage of the molecular symmetry. (orig.)
Slepoy, A; Peters, M D; Thompson, A P
2007-11-30
Molecular dynamics and other molecular simulation methods rely on a potential energy function, based only on the relative coordinates of the atomic nuclei. Such a function, called a force field, approximately represents the electronic structure interactions of a condensed matter system. Developing such approximate functions and fitting their parameters remains an arduous, time-consuming process, relying on expert physical intuition. To address this problem, a functional programming methodology was developed that may enable automated discovery of entirely new force-field functional forms, while simultaneously fitting parameter values. The method uses a combination of genetic programming, Metropolis Monte Carlo importance sampling and parallel tempering, to efficiently search a large space of candidate functional forms and parameters. The methodology was tested using a nontrivial problem with a well-defined globally optimal solution: a small set of atomic configurations was generated and the energy of each configuration was calculated using the Lennard-Jones pair potential. Starting with a population of random functions, our fully automated, massively parallel implementation of the method reproducibly discovered the original Lennard-Jones pair potential by searching for several hours on 100 processors, sampling only a minuscule portion of the total search space. This result indicates that, with further improvement, the method may be suitable for unsupervised development of more accurate force fields with completely new functional forms. Copyright (c) 2007 Wiley Periodicals, Inc.
PAREMD: A parallel program for the evaluation of momentum space properties of atoms and molecules
Meena, Deep Raj; Gadre, Shridhar R.; Balanarayan, P.
2018-03-01
The present work describes a code for evaluating the electron momentum density (EMD), its moments and the associated Shannon information entropy for a multi-electron molecular system. The code works specifically for electronic wave functions obtained from traditional electronic structure packages such as GAMESS and GAUSSIAN. For the momentum space orbitals, the general expression for Gaussian basis sets in position space is analytically Fourier transformed to momentum space Gaussian basis functions. The molecular orbital coefficients of the wave function are taken as an input from the output file of the electronic structure calculation. The analytic expressions of EMD are evaluated over a fine grid and the accuracy of the code is verified by a normalization check and a numerical kinetic energy evaluation which is compared with the analytic kinetic energy given by the electronic structure package. Apart from electron momentum density, electron density in position space has also been integrated into this package. The program is written in C++ and is executed through a Shell script. It is also tuned for multicore machines with shared memory through OpenMP. The program has been tested for a variety of molecules and correlated methods such as CISD, Møller-Plesset second order (MP2) theory and density functional methods. For correlated methods, the PAREMD program uses natural spin orbitals as an input. The program has been benchmarked for a variety of Gaussian basis sets for different molecules showing a linear speedup on a parallel architecture.
Rowland, Paula; McMillan, Sarah; McGillicuddy, Patti; Richards, Joy
2017-01-01
Public and patient involvement (PPI) in health care may refer to many different processes, ranging from participating in decision-making about one's own care to participating in health services research, health policy development, or organizational reforms. Across these many forms of public and patient involvement, the conceptual and theoretical underpinnings remain poorly articulated. Instead, most public and patient involvement programs rely on policy initiatives as their conceptual frameworks. This lack of conceptual clarity participates in dilemmas of program design, implementation, and evaluation. This study contributes to the development of theoretical understandings of public and patient involvement. In particular, we focus on the deployment of patient engagement programs within health service organizations. To develop a deeper understanding of the conceptual underpinnings of these programs, we examined the concept of "the patient perspective" as used by patient engagement practitioners and participants. Specifically, we focused on the way this phrase was used in the singular: "the" patient perspective or "the" patient voice. From qualitative analysis of interviews with 20 patient advisers and 6 staff members within a large urban health network in Canada, we argue that "the patient perspective" is referred to as a particular kind of situated knowledge, specifically an embodied knowledge of vulnerability. We draw parallels between this logic of patient perspective and the logic of early feminist theory, including the concepts of standpoint theory and strong objectivity. We suggest that champions of patient engagement may learn much from the way feminist theorists have constructed their arguments and addressed critique.
A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
Cieślik Marcin
2011-02-01
Full Text Available Abstract Background Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts. Results To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'. A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption. An add-on module ('NuBio' facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures and functionality (e.g., to parse/write standard file formats. Conclusions PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and
Matthew O'keefe
1995-01-01
Full Text Available Massively parallel processors (MPPs hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. We have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.
Du, Chao; Ye, Aizhong; Gan, Yanjun; You, Jinjun; Duan, Qinyun; Ma, Feng; Hou, Jingwen
2017-12-01
High-resolution Digital Elevation Models (DEMs) can be used to extract high-accuracy prerequisite drainage networks. A higher resolution represents a larger number of grids. With an increase in the number of grids, the flow direction determination will require substantial computer resources and computing time. Parallel computing is a feasible method with which to resolve this problem. In this paper, we proposed a parallel programming method within the .NET Framework with a C# Compiler in a Windows environment. The basin is divided into sub-basins, and subsequently the different sub-basins operate on multiple threads concurrently to calculate flow directions. The method was applied to calculate the flow direction of the Yellow River basin from 3 arc-second resolution SRTM DEM. Drainage networks were extracted and compared with HydroSHEDS river network to assess their accuracy. The results demonstrate that this method can calculate the flow direction from high-resolution DEMs efficiently and extract high-precision continuous drainage networks.
Mobile and replicated alignment of arrays in data-parallel programs
Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert
1993-01-01
When a data-parallel language like FORTRAN 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. We solve two facets of the problem of finding alignments that reduce residual communication: we determine alignments that vary in loops, and objects that should have replicated alignments. We show that loop-dependent mobile alignment is sometimes necessary for optimum performance, and we provide algorithms with which a compiler can determine good mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. We propose an algorithm based on network flow that determines which objects to replicate so as to minimize the total amount of broadcast communication in replication. This work on mobile and replicated alignment extends our earlier work on determining static alignment.
High-Level Management of Communication Schedules in HPF-like Languages
Benkner, Siegfried
1997-01-01
..., providing the users with a high-level language interface for programming scalable parallel architectures and delegating to the compiler the task of producing an explicitly parallel message-passing program...
Plasma Physics Calculations on a Parallel Macintosh Cluster
Decyk, Viktor; Dauger, Dean; Kokelaar, Pieter
2000-03-01
We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 MFlops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.
Parallelizing Gene Expression Programming Algorithm in Enabling Large-Scale Classification
Lixiong Xu
2017-01-01
Full Text Available As one of the most effective function mining algorithms, Gene Expression Programming (GEP algorithm has been widely used in classification, pattern recognition, prediction, and other research fields. Based on the self-evolution, GEP is able to mine an optimal function for dealing with further complicated tasks. However, in big data researches, GEP encounters low efficiency issue due to its long time mining processes. To improve the efficiency of GEP in big data researches especially for processing large-scale classification tasks, this paper presents a parallelized GEP algorithm using MapReduce computing model. The experimental results show that the presented algorithm is scalable and efficient for processing large-scale classification tasks.
D'Angelo, Gianni; Rampone, Salvatore
2014-01-01
the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed. In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is
Porting Gravitational Wave Signal Extraction to Parallel Virtual Machine (PVM)
Thirumalainambi, Rajkumar; Thompson, David E.; Redmon, Jeffery
2009-01-01
Laser Interferometer Space Antenna (LISA) is a planned NASA-ESA mission to be launched around 2012. The Gravitational Wave detection is fundamentally the determination of frequency, source parameters, and waveform amplitude derived in a specific order from the interferometric time-series of the rotating LISA spacecrafts. The LISA Science Team has developed a Mock LISA Data Challenge intended to promote the testing of complicated nested search algorithms to detect the 100-1 millihertz frequency signals at amplitudes of 10E-21. However, it has become clear that, sequential search of the parameters is very time consuming and ultra-sensitive; hence, a new strategy has been developed. Parallelization of existing sequential search algorithms of Gravitational Wave signal identification consists of decomposing sequential search loops, beginning with outermost loops and working inward. In this process, the main challenge is to detect interdependencies among loops and partitioning the loops so as to preserve concurrency. Existing parallel programs are based upon either shared memory or distributed memory paradigms. In PVM, master and node programs are used to execute parallelization and process spawning. The PVM can handle process management and process addressing schemes using a virtual machine configuration. The task scheduling and the messaging and signaling can be implemented efficiently for the LISA Gravitational Wave search process using a master and 6 nodes. This approach is accomplished using a server that is available at NASA Ames Research Center, and has been dedicated to the LISA Data Challenge Competition. Historically, gravitational wave and source identification parameters have taken around 7 days in this dedicated single thread Linux based server. Using PVM approach, the parameter extraction problem can be reduced to within a day. The low frequency computation and a proxy signal-to-noise ratio are calculated in separate nodes that are controlled by the master
Salko, Robert K.; Schmidt, Rodney C.; Avramova, Maria N.
2015-01-01
Highlights: • COBRA-TF was adopted by the Consortium for Advanced Simulation of LWRs. • We have improved code performance to support running large-scale LWR simulations. • Code optimization has led to reductions in execution time and memory usage. • An MPI parallelization has reduced full-core simulation time from days to minutes. - Abstract: This paper describes major improvements to the computational infrastructure of the CTF subchannel code so that full-core, pincell-resolved (i.e., one computational subchannel per real bundle flow channel) simulations can now be performed in much shorter run-times, either in stand-alone mode or as part of coupled-code multi-physics calculations. These improvements support the goals of the Department Of Energy Consortium for Advanced Simulation of Light Water Reactors (CASL) Energy Innovation Hub to develop high fidelity multi-physics simulation tools for nuclear energy design and analysis. A set of serial code optimizations—including fixing computational inefficiencies, optimizing the numerical approach, and making smarter data storage choices—are first described and shown to reduce both execution time and memory usage by about a factor of ten. Next, a “single program multiple data” parallelization strategy targeting distributed memory “multiple instruction multiple data” platforms utilizing domain decomposition is presented. In this approach, data communication between processors is accomplished by inserting standard Message-Passing Interface (MPI) calls at strategic points in the code. The domain decomposition approach implemented assigns one MPI process to each fuel assembly, with each domain being represented by its own CTF input file. The creation of CTF input files, both for serial and parallel runs, is also fully automated through use of a pressurized water reactor (PWR) pre-processor utility that uses a greatly simplified set of user input compared with the traditional CTF input. To run CTF in
Moryakov, A. V., E-mail: sailor@orc.ru [National Research Centre Kurchatov Institute (Russian Federation)
2016-12-15
An algorithm for solving the linear Cauchy problem for large systems of ordinary differential equations is presented. The algorithm for systems of first-order differential equations is implemented in the EDELWEISS code with the possibility of parallel computations on supercomputers employing the MPI (Message Passing Interface) standard for the data exchange between parallel processes. The solution is represented by a series of orthogonal polynomials on the interval [0, 1]. The algorithm is characterized by simplicity and the possibility to solve nonlinear problems with a correction of the operator in accordance with the solution obtained in the previous iterative process.
Research in Parallel Algorithms and Software for Computational Aerosciences
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
Hasanov, Khalid
2014-03-04
© 2014, Springer Science+Business Media New York. Many state-of-the-art parallel algorithms, which are widely used in scientific applications executed on high-end computing systems, were designed in the twentieth century with relatively small-scale parallelism in mind. Indeed, while in 1990s a system with few hundred cores was considered a powerful supercomputer, modern top supercomputers have millions of cores. In this paper, we present a hierarchical approach to optimization of message-passing parallel algorithms for execution on large-scale distributed-memory systems. The idea is to reduce the communication cost by introducing hierarchy and hence more parallelism in the communication scheme. We apply this approach to SUMMA, the state-of-the-art parallel algorithm for matrix–matrix multiplication, and demonstrate both theoretically and experimentally that the modified Hierarchical SUMMA significantly improves the communication cost and the overall performance on large-scale platforms.
Parallel Monte Carlo simulation of aerosol dynamics
Zhou, K.
2014-01-01
A highly efficient Monte Carlo (MC) algorithm is developed for the numerical simulation of aerosol dynamics, that is, nucleation, surface growth, and coagulation. Nucleation and surface growth are handled with deterministic means, while coagulation is simulated with a stochastic method (Marcus-Lushnikov stochastic process). Operator splitting techniques are used to synthesize the deterministic and stochastic parts in the algorithm. The algorithm is parallelized using the Message Passing Interface (MPI). The parallel computing efficiency is investigated through numerical examples. Near 60% parallel efficiency is achieved for the maximum testing case with 3.7 million MC particles running on 93 parallel computing nodes. The algorithm is verified through simulating various testing cases and comparing the simulation results with available analytical and/or other numerical solutions. Generally, it is found that only small number (hundreds or thousands) of MC particles is necessary to accurately predict the aerosol particle number density, volume fraction, and so forth, that is, low order moments of the Particle Size Distribution (PSD) function. Accurately predicting the high order moments of the PSD needs to dramatically increase the number of MC particles. 2014 Kun Zhou et al.
O'Connor, B P
2000-08-01
Popular statistical software packages do not have the proper procedures for determining the number of components in factor and principal components analyses. Parallel analysis and Velicer's minimum average partial (MAP) test are validated procedures, recommended widely by statisticians. However, many researchers continue to use alternative, simpler, but flawed procedures, such as the eigenvalues-greater-than-one rule. Use of the proper procedures might be increased if these procedures could be conducted within familiar software environments. This paper describes brief and efficient programs for using SPSS and SAS to conduct parallel analyses and the MAP test.
Practical Formal Verification of MPI and Thread Programs
Gopalakrishnan, Ganesh; Kirby, Robert M.
Large-scale simulation codes in science and engineering are written using the Message Passing Interface (MPI). Shared memory threads are widely used directly, or to implement higher level programming abstractions. Traditional debugging methods for MPI or thread programs are incapable of providing useful formal guarantees about coverage. They get bogged down in the sheer number of interleavings (schedules), often missing shallow bugs. In this tutorial we will introduce two practical formal verification tools: ISP (for MPI C programs) and Inspect (for Pthread C programs). Unlike other formal verification tools, ISP and Inspect run directly on user source codes (much like a debugger). They pursue only the relevant set of process interleavings, using our own customized Dynamic Partial Order Reduction algorithms. For a given test harness, DPOR allows these tools to guarantee the absence of deadlocks, instrumented MPI object leaks and communication races (using ISP), and shared memory races (using Inspect). ISP and Inspect have been used to verify large pieces of code: in excess of 10,000 lines of MPI/C for ISP in under 5 seconds, and about 5,000 lines of Pthread/C code in a few hours (and much faster with the use of a cluster or by exploiting special cases such as symmetry) for Inspect. We will also demonstrate the Microsoft Visual Studio and Eclipse Parallel Tools Platform integrations of ISP (these will be available on the LiveCD).
From functional programming to multicore parallelism: A case study based on Presburger Arithmetic
Dung, Phan Anh; Hansen, Michael Reichhardt
2011-01-01
, we are interested in using PA in connection with the Duration Calculus Model Checker (DCMC) [5]. There are effective decision procedures for PA including Cooper’s algorithm and the Omega Test; however, their complexity is extremely high with doubly exponential lower bound and triply exponential upper...... bound [7]. We investigate these decision procedures in the context of multicore parallelism with the hope of exploiting multicore powers. Unfortunately, we are not aware of any prior parallelism research related to decision procedures for PA. The closest work is the preliminary results on parallelism...
Plimpton, Steven J.; Hendrickson, Bruce; Burns, Shawn P.; McLendon, William III; Rauchwerger, Lawrence
2005-01-01
The method of discrete ordinates is commonly used to solve the Boltzmann transport equation. The solution in each ordinate direction is most efficiently computed by sweeping the radiation flux across the computational grid. For unstructured grids this poses many challenges, particularly when implemented on distributed-memory parallel machines where the grid geometry is spread across processors. We present several algorithms relevant to this approach: (a) an asynchronous message-passing algorithm that performs sweeps simultaneously in multiple ordinate directions, (b) a simple geometric heuristic to prioritize the computational tasks that a processor works on, (c) a partitioning algorithm that creates columnar-style decompositions for unstructured grids, and (d) an algorithm for detecting and eliminating cycles that sometimes exist in unstructured grids and can prevent sweeps from successfully completing. Algorithms (a) and (d) are fully parallel; algorithms (b) and (c) can be used in conjunction with (a) to achieve higher parallel efficiencies. We describe our message-passing implementations of these algorithms within a radiation transport package. Performance and scalability results are given for unstructured grids with up to 3 million elements (500 million unknowns) running on thousands of processors of Sandia National Laboratories' Intel Tflops machine and DEC-Alpha CPlant cluster
First massively parallel algorithm to be implemented in Apollo-II code
Stankovski, Z.
1994-01-01
The collision probability (CP) method in neutron transport, as applied to arbitrary 2D XY geometries, like the TDT module in APOLLO-II, is very time consuming. Consequently RZ or 3D extensions became prohibitive. Fortunately, this method is very suitable for parallelization. Massively parallel computer architectures, especially MIMD machines, bring a new breath to this method. In this paper we present a CM5 implementation of the CP method. Parallelization is applied to the energy groups, using the CMMD message passing library. In our case we use 32 processors for the standard 99-group APOLLIB-II library. The real advantage of this algorithm will appear in the calculation of the future fine multigroup library (about 8000 groups) of the SAPHYR project with a massively parallel computer (to the order of hundreds of processors). (author). 3 tabs., 4 figs., 4 refs
Same-source parallel implementation of the PSU/NCAR MM5
Michalakes, J.
1997-12-31
The Pennsylvania State/National Center for Atmospheric Research Mesoscale Model is a limited-area model of atmospheric systems, now in its fifth generation, MM5. Designed and maintained for vector and shared-memory parallel architectures, the official version of MM5 does not run on message-passing distributed memory (DM) parallel computers. The authors describe a same-source parallel implementation of the PSU/NCAR MM5 using FLIC, the Fortran Loop and Index Converter. The resulting source is nearly line-for-line identical with the original source code. The result is an efficient distributed memory parallel option to MM5 that can be seamlessly integrated into the official version.
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments
Jin, Shuangshuang; Chen, Yousu; Wu, Di; Diao, Ruisheng; Huang, Zhenyu
2015-12-09
Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Message Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.
Harper, Richard
1989-01-01
In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described.
Concurrent Collections (CnC): A new approach to parallel programming
CERN. Geneva
2010-01-01
A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicity or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer o...
Böddeker, B.; Teichler, H.
The MD simulation program TABB is motivated by the need of long time simulations for the investigation of slow processes near the glass transition of glass forming alloys. TABB is written in C++ with a high degree of flexibility: TABB allows the use of any short ranged pair potentials or EAM potentials, by generating and using a spline representation of all functions and their derivatives. TABB supports several numerical integration algorithms like the Runge-Kotta or the modified Gear-predictor-corrector algorithm of order five. The boundary conditions can be chosen to resemble the geometry of bulk materials or films. The simulation box length or the pressure can be fixed for each dimension separately. TABB may be used in isokinetic, isoenergeric or canonic (with random forces) mode. TABB contains a simple instruction interpreter to easily control the parameters and options during the simulation. The same source code can be compiled either for workstations or for parallel computers. The main optimization goal of TABB is to allow long time simulations of medium or small sized systems. To make this possible, much attention is spent on the optimized communication between the nodes. TABB uses a domain decomposition procedure. To use many nodes with a small system, the domain size has to be small compared to the range of particle interactions. In the limit of many nodes for only few atoms, the bottle neck of communication is the latency time. TABB minimizes the number of pairs of domains containing atoms that interact between these domains. This procedure minimizes the need of communication calls between pairs of nodes. TABB decides automatically, to how many, and to which directions the decomposition shall be applied. E.g., in the case of one dimensional domain decomposition, the simulation box is only split into "slabs" along a selected direction. The three dimensional domain decomposition is best with respect to the number of interacting domains only for simulations
Development of parallel 3D discrete ordinates transport program on JASMIN framework
Cheng, T.; Wei, J.; Shen, H.; Zhong, B.; Deng, L.
2015-01-01
A parallel 3D discrete ordinates radiation transport code JSNT-S is developed, aiming at simulating real-world radiation shielding and reactor physics applications in a reasonable time. Through the patch-based domain partition algorithm, the memory requirement is shared among processors and a space-angle parallel sweeping algorithm is developed based on data-driven algorithm. Acceleration methods such as partial current rebalance are implemented. The correctness is proved through the VENUS-3 and other benchmark models. In the radiation shielding calculation of the Qinshan-II reactor pressure vessel model with 24.3 billion DoF, only 88 seconds is required and the overall parallel efficiency of 44% is achieved on 1536 CPU cores. (author)
Optimizing spread dynamics on graphs by message passing
Altarelli, F; Braunstein, A; Dall’Asta, L; Zecchina, R
2013-01-01
Cascade processes are responsible for many important phenomena in natural and social sciences. Simple models of irreversible dynamics on graphs, in which nodes activate depending on the state of their neighbors, have been successfully applied to describe cascades in a large variety of contexts. Over the past decades, much effort has been devoted to understanding the typical behavior of the cascades arising from initial conditions extracted at random from some given ensemble. However, the problem of optimizing the trajectory of the system, i.e. of identifying appropriate initial conditions to maximize (or minimize) the final number of active nodes, is still considered to be practically intractable, with the only exception being models that satisfy a sort of diminishing returns property called submodularity. Submodular models can be approximately solved by means of greedy strategies, but by definition they lack cooperative characteristics which are fundamental in many real systems. Here we introduce an efficient algorithm based on statistical physics for the optimization of trajectories in cascade processes on graphs. We show that for a wide class of irreversible dynamics, even in the absence of submodularity, the spread optimization problem can be solved efficiently on large networks. Analytic and algorithmic results on random graphs are complemented by the solution of the spread maximization problem on a real-world network (the Epinions consumer reviews network). (paper)
Discrete geometric analysis of message passing algorithm on graphs
Watanabe, Yusuke
2010-04-01
We often encounter probability distributions given as unnormalized products of non-negative functions. The factorization structures are represented by hypergraphs called factor graphs. Such distributions appear in various fields, including statistics, artificial intelligence, statistical physics, error correcting codes, etc. Given such a distribution, computations of marginal distributions and the normalization constant are often required. However, they are computationally intractable because of their computational costs. One successful approximation method is Loopy Belief Propagation (LBP) algorithm. The focus of this thesis is an analysis of the LBP algorithm. If the factor graph is a tree, i.e. having no cycle, the algorithm gives the exact quantities. If the factor graph has cycles, however, the LBP algorithm does not give exact results and possibly exhibits oscillatory and non-convergent behaviors. The thematic question of this thesis is "How the behaviors of the LBP algorithm are affected by the discrete geometry of the factor graph?" The primary contribution of this thesis is the discovery of a formula that establishes the relation between the LBP, the Bethe free energy and the graph zeta function. This formula provides new techniques for analysis of the LBP algorithm, connecting properties of the graph and of the LBP and the Bethe free energy. We demonstrate applications of the techniques to several problems including (non) convexity of the Bethe free energy, the uniqueness and stability of the LBP fixed point. We also discuss the loop series initiated by Chertkov and Chernyak. The loop series is a subgraph expansion of the normalization constant, or partition function, and reflects the graph geometry. We investigate theoretical natures of the series. Moreover, we show a partial connection between the loop series and the graph zeta function.
Optimizing spread dynamics on graphs by message passing
Altarelli, F.; Braunstein, A.; Dall'Asta, L.; Zecchina, R.
2013-09-01
Cascade processes are responsible for many important phenomena in natural and social sciences. Simple models of irreversible dynamics on graphs, in which nodes activate depending on the state of their neighbors, have been successfully applied to describe cascades in a large variety of contexts. Over the past decades, much effort has been devoted to understanding the typical behavior of the cascades arising from initial conditions extracted at random from some given ensemble. However, the problem of optimizing the trajectory of the system, i.e. of identifying appropriate initial conditions to maximize (or minimize) the final number of active nodes, is still considered to be practically intractable, with the only exception being models that satisfy a sort of diminishing returns property called submodularity. Submodular models can be approximately solved by means of greedy strategies, but by definition they lack cooperative characteristics which are fundamental in many real systems. Here we introduce an efficient algorithm based on statistical physics for the optimization of trajectories in cascade processes on graphs. We show that for a wide class of irreversible dynamics, even in the absence of submodularity, the spread optimization problem can be solved efficiently on large networks. Analytic and algorithmic results on random graphs are complemented by the solution of the spread maximization problem on a real-world network (the Epinions consumer reviews network).
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations
Oliker, Leonid; Li, Xiaoye; Husbands, Parry; Biswas, Rupak; Biegel, Bryan (Technical Monitor)
2002-01-01
The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. For systems that are ill-conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(O) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications: ordering significantly improves overall performance on both distributed and distributed shared-memory systems, that cache reuse may be more important than reducing communication, that it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and that a hybrid MPI+OpenMP paradigm increases programming complexity with little performance gains. A implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread level parallelism.
Parallel processing of two-dimensional Sn transport calculations
Uematsu, M.
1997-01-01
A parallel processing method for the two-dimensional S n transport code DOT3.5 has been developed to achieve a drastic reduction in computation time. In the proposed method, parallelization is achieved with angular domain decomposition and/or space domain decomposition. The calculational speed of parallel processing by angular domain decomposition is largely influenced by frequent communications between processing elements. To assess parallelization efficiency, sample problems with up to 32 x 32 spatial meshes were solved with a Sun workstation using the PVM message-passing library. As a result, parallel calculation using 16 processing elements, for example, was found to be nine times as fast as that with one processing element. As for parallel processing by geometry segmentation, the influence of processing element communications on computation time is small; however, discontinuity at the segment boundary degrades convergence speed. To accelerate the convergence, an alternate sweep of angular flux in conjunction with space domain decomposition and a two-step rescaling method consisting of segmentwise rescaling and ordinary pointwise rescaling have been developed. By applying the developed method, the number of iterations needed to obtain a converged flux solution was reduced by a factor of 2. As a result, parallel calculation using 16 processing elements was found to be 5.98 times as fast as the original DOT3.5 calculation
An object-oriented bulk synchronous parallel library for multicore programming
Yzelman, A.N.; Bisseling, R.H.
2012-01-01
We show that the bulk synchronous parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. A proof-of-concept MulticoreBSP library has
Waintraub, Marcel; Pereira, Claudio M.N.A.; Baptista, Rafael P.
2005-01-01
This work presents the development of a distributed parallel genetic algorithm applied to a nuclear reactor core design optimization. In the implementation of the parallelism, a 'Message Passing Interface' (MPI) library, standard for parallel computation in distributed memory platforms, has been used. Another important characteristic of MPI is its portability for various architectures. The main objectives of this paper are: validation of the results obtained by the application of this algorithm in a nuclear reactor core optimization problem, through comparisons with previous results presented by Pereira et al.; and performance test of the Brazilian Nuclear Engineering Institute (IEN) cluster in reactors physics optimization problems. The experiments demonstrated that the developed parallel genetic algorithm using the MPI library presented significant gains in the obtained results and an accentuated reduction of the processing time. Such results ratify the use of the parallel genetic algorithms for the solution of nuclear reactor core optimization problems. (author)
Yeh, M.; Kim, J.; Khan, F.S.
1995-01-01
We present a parallel decomposition of the tight-binding fictitious Lagrangian algorithm for the Intel iPSC/860 and the Intel Paragon parallel computers. We show that it is possible to perform long simulations, of the order of 10 000 time steps, on semiconducting clusters consisting of as many as 512 atoms, on a time scale of the order of 20 h or less. We have made a very careful timing analysis of all parts of our code, and have identified the bottlenecks. We have also derived formulas which can predict the timing of our code, based on the number of processors, message passing bandwidth, floating point performance of each node, and the set up time for message passing, appropriate to the machine being used. The time of the simulation scales as the square of the number of particles, if the number of processors is made to scale linearly with the number of particles. We show that for a system as large as 512 atoms, the main bottleneck of the computation is the orthogonalization of the wave functions, which consumes about 90% of the total time of the simulation
Parallel SN algorithms in shared- and distributed-memory environments
Haghighat, Alireza; Hunter, Melissa A.; Mattis, Ronald E.
1995-01-01
Different 2-D spatial domain partitioning Sn transport theory algorithms have been developed on the basis of the Block-Jacobi iterative scheme. These algorithms have been incorporated into TWOTRAN-II, and tested on a shared-memory CRAY Y-MP C90 and a distributed-memory IBM SP1. For a series of fixed source r-z geometry homogeneous problems, parallel efficiencies in a range of 50-90% are achieved on the C90 with 6 processors, and lower values (20-60%) are obtained on the SP1. It is demonstrated that better performance is attainable if one addresses issues such as convergence rate, load-balancing, and granularity for both architectures, as well as message passing (network bandwidth and latency) for SP1. (author). 17 refs, 4 figs
Ordering schemes for sparse matrices using modern programming paradigms
Oliker, Leonid; Li, Xiaoye; Husbands, Parry; Biswas, Rupak
2000-01-01
The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. In previous work, we investigated the effects of various ordering and partitioning strategies on the performance of CG using different programming paradigms and architectures. This paper makes several extensions to our prior research. First, we present a hybrid(MPI+OpenMP) implementation of the CG algorithm on the IBM SP and show that the hybrid paradigm increases programming complexity with little performance gains compared to a pure MPI implementation. For ill-conditioned linear systems, it is often necessary to use a preconditioning technique. We present MPI results for ILU(0) preconditioned CG (PCG) using the BlockSolve95 library, and show that the initial ordering of the input matrix dramatically affect PCG's performance. Finally, a multithreaded version of the PCG is developed on the Cray (Tera) MTA. Unlike the message-passing version, this implementation did not require the complexities of special orderings or graph dependency analysis. However, only limited scalability was achieved due to the lack of available thread level parallelism
CUDA/GPU Technology : Parallel Programming For High Performance Scientific Computing
YUHENDRA; KUZE, Hiroaki; JOSAPHAT, Tetuko Sri Sumantyo
2009-01-01
[ABSTRACT]Graphics processing units (GP Us) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. In the high performance computation capabilities, graphic processing units (GPU) lead to much more powerful performance than conventional CPUs by means of parallel processing. In 2007, the birth of Compute Unified Device Architecture (CUDA) and CUDA-enabled GPUs by NVIDIA Corporation brought a revolution in the general purpose GPU a...
A parallel neural network training algorithm for control of discrete dynamical systems.
Gordillo, J. L.; Hanebutte, U. R.; Vitela, J. E.
1998-01-20
In this work we present a parallel neural network controller training code, that uses MPI, a portable message passing environment. A comprehensive performance analysis is reported which compares results of a performance model with actual measurements. The analysis is made for three different load assignment schemes: block distribution, strip mining and a sliding average bin packing (best-fit) algorithm. Such analysis is crucial since optimal load balance can not be achieved because the work load information is not available a priori. The speedup results obtained with the above schemes are compared with those corresponding to the bin packing load balance scheme with perfect load prediction based on a priori knowledge of the computing effort. Two multiprocessor platforms: a SGI/Cray Origin 2000 and a IBM SP have been utilized for this study. It is shown that for the best load balance scheme a parallel efficiency of over 50% for the entire computation is achieved by 17 processors of either parallel computers.
A parallel solution for high resolution histological image analysis.
Bueno, G; González, R; Déniz, O; García-Rojo, M; González-García, J; Fernández-Carrobles, M M; Vállez, N; Salido, J
2012-10-01
This paper describes a general methodology for developing parallel image processing algorithms based on message passing for high resolution images (on the order of several Gigabytes). These algorithms have been applied to histological images and must be executed on massively parallel processing architectures. Advances in new technologies for complete slide digitalization in pathology have been combined with developments in biomedical informatics. However, the efficient use of these digital slide systems is still a challenge. The image processing that these slides are subject to is still limited both in terms of data processed and processing methods. The work presented here focuses on the need to design and develop parallel image processing tools capable of obtaining and analyzing the entire gamut of information included in digital slides. Tools have been developed to assist pathologists in image analysis and diagnosis, and they cover low and high-level image processing methods applied to histological images. Code portability, reusability and scalability have been tested by using the following parallel computing architectures: distributed memory with massive parallel processors and two networks, INFINIBAND and Myrinet, composed of 17 and 1024 nodes respectively. The parallel framework proposed is flexible, high performance solution and it shows that the efficient processing of digital microscopic images is possible and may offer important benefits to pathology laboratories. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
New parallel SOR method by domain partitioning
Xie, Dexuan [Courant Inst. of Mathematical Sciences New York Univ., NY (United States)
1996-12-31
In this paper, we propose and analyze a new parallel SOR method, the PSOR method, formulated by using domain partitioning together with an interprocessor data-communication technique. For the 5-point approximation to the Poisson equation on a square, we show that the ordering of the PSOR based on the strip partition leads to a consistently ordered matrix, and hence the PSOR and the SOR using the row-wise ordering have the same convergence rate. However, in general, the ordering used in PSOR may not be {open_quote}consistently ordered{close_quotes}. So, there is a need to analyze the convergence of PSOR directly. In this paper, we present a PSOR theory, and show that the PSOR method can have the same asymptotic rate of convergence as the corresponding sequential SOR method for a wide class of linear systems in which the matrix is {open_quotes}consistently ordered{close_quotes}. Finally, we demonstrate the parallel performance of the PSOR method on four different message passing multiprocessors (a KSR1, the Intel Delta, an Intel Paragon and an IBM SP2), along with a comparison with the point Red-Black and four-color SOR methods.
Yim, Keun Soo
This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of
José Miguel Vargas-Félix
2012-11-01
Full Text Available The Finite Element Method (FEM is used to solve problems like solid deformation and heat diffusion in domains with complex geometries. This kind of geometries requires discretization with millions of elements; this is equivalent to solve systems of equations with sparse matrices and tens or hundreds of millions of variables. The aim is to use computer clusters to solve these systems. The solution method used is Schur substructuration. Using it is possible to divide a large system of equations into many small ones to solve them more efficiently. This method allows parallelization. MPI (Message Passing Interface is used to distribute the systems of equations to solve each one in a computer of a cluster. Each system of equations is solved using a solver implemented to use OpenMP as a local parallelization method.The Finite Element Method (FEM is used to solve problems like solid deformation and heat diffusion in domains with complex geometries. This kind of geometries requires discretization with millions of elements; this is equivalent to solve systems of equations with sparse matrices and tens or hundreds of millions of variables. The aim is to use computer clusters to solve these systems. The solution method used is Schur substructuration. Using it is possible to divide a large system of equations into many small ones to solve them more efficiently. This method allows parallelization. MPI (Message Passing Interface is used to distribute the systems of equations to solve each one in a computer of a cluster. Each system of equations is solved using a solver implemented to use OpenMP as a local parallelization method.
A hybrid parallel framework for the cellular Potts model simulations
Jiang, Yi [Los Alamos National Laboratory; He, Kejing [SOUTH CHINA UNIV; Dong, Shoubin [SOUTH CHINA UNIV
2009-01-01
The Cellular Potts Model (CPM) has been widely used for biological simulations. However, most current implementations are either sequential or approximated, which can't be used for large scale complex 3D simulation. In this paper we present a hybrid parallel framework for CPM simulations. The time-consuming POE solving, cell division, and cell reaction operation are distributed to clusters using the Message Passing Interface (MPI). The Monte Carlo lattice update is parallelized on shared-memory SMP system using OpenMP. Because the Monte Carlo lattice update is much faster than the POE solving and SMP systems are more and more common, this hybrid approach achieves good performance and high accuracy at the same time. Based on the parallel Cellular Potts Model, we studied the avascular tumor growth using a multiscale model. The application and performance analysis show that the hybrid parallel framework is quite efficient. The hybrid parallel CPM can be used for the large scale simulation ({approx}10{sup 8} sites) of complex collective behavior of numerous cells ({approx}10{sup 6}).
Development of parallel Fokker-Planck code ALLAp
Batishcheva, A.A.; Sigmar, D.J.; Koniges, A.E.
1996-01-01
We report on our ongoing development of the 3D Fokker-Planck code ALLA for a highly collisional scrape-off-layer (SOL) plasma. A SOL with strong gradients of density and temperature in the spatial dimension is modeled. Our method is based on a 3-D adaptive grid (in space, magnitude of the velocity, and cosine of the pitch angle) and a second order conservative scheme. Note that the grid size is typically 100 x 257 x 65 nodes. It was shown in our previous work that only these capabilities make it possible to benchmark a 3D code against a spatially-dependent self-similar solution of a kinetic equation with the Landau collision term. In the present work we show results of a more precise benchmarking against the exact solutions of the kinetic equation using a new parallel code ALLAp with an improved method of parallelization and a modified boundary condition at the plasma edge. We also report first results from the code parallelization using Message Passing Interface for a Massively Parallel CRI T3D platform. We evaluate the ALLAp code performance versus the number of T3D processors used and compare its efficiency against a Work/Data Sharing parallelization scheme and a workstation version
Analysis of multigrid methods on massively parallel computers: Architectural implications
Matheson, Lesley R.; Tarjan, Robert E.
1993-01-01
We study the potential performance of multigrid algorithms running on massively parallel computers with the intent of discovering whether presently envisioned machines will provide an efficient platform for such algorithms. We consider the domain parallel version of the standard V cycle algorithm on model problems, discretized using finite difference techniques in two and three dimensions on block structured grids of size 10(exp 6) and 10(exp 9), respectively. Our models of parallel computation were developed to reflect the computing characteristics of the current generation of massively parallel multicomputers. These models are based on an interconnection network of 256 to 16,384 message passing, 'workstation size' processors executing in an SPMD mode. The first model accomplishes interprocessor communications through a multistage permutation network. The communication cost is a logarithmic function which is similar to the costs in a variety of different topologies. The second model allows single stage communication costs only. Both models were designed with information provided by machine developers and utilize implementation derived parameters. With the medium grain parallelism of the current generation and the high fixed cost of an interprocessor communication, our analysis suggests an efficient implementation requires the machine to support the efficient transmission of long messages, (up to 1000 words) or the high initiation cost of a communication must be significantly reduced through an alternative optimization technique. Furthermore, with variable length message capability, our analysis suggests the low diameter multistage networks provide little or no advantage over a simple single stage communications network.
Julián A García-Grajales
Full Text Available With the growing body of research on traumatic brain injury and spinal cord injury, computational neuroscience has recently focused its modeling efforts on neuronal functional deficits following mechanical loading. However, in most of these efforts, cell damage is generally only characterized by purely mechanistic criteria, functions of quantities such as stress, strain or their corresponding rates. The modeling of functional deficits in neurites as a consequence of macroscopic mechanical insults has been rarely explored. In particular, a quantitative mechanically based model of electrophysiological impairment in neuronal cells, Neurite, has only very recently been proposed. In this paper, we present the implementation details of this model: a finite difference parallel program for simulating electrical signal propagation along neurites under mechanical loading. Following the application of a macroscopic strain at a given strain rate produced by a mechanical insult, Neurite is able to simulate the resulting neuronal electrical signal propagation, and thus the corresponding functional deficits. The simulation of the coupled mechanical and electrophysiological behaviors requires computational expensive calculations that increase in complexity as the network of the simulated cells grows. The solvers implemented in Neurite--explicit and implicit--were therefore parallelized using graphics processing units in order to reduce the burden of the simulation costs of large scale scenarios. Cable Theory and Hodgkin-Huxley models were implemented to account for the electrophysiological passive and active regions of a neurite, respectively, whereas a coupled mechanical model accounting for the neurite mechanical behavior within its surrounding medium was adopted as a link between electrophysiology and mechanics. This paper provides the details of the parallel implementation of Neurite, along with three different application examples: a long myelinated axon
Vectorization and parallelization of Monte-Carlo programs for calculation of radiation transport
Seidel, R.
1995-01-01
The versatile MCNP-3B Monte-Carlo code written in FORTRAN77, for simulation of the radiation transport of neutral particles, has been subjected to vectorization and parallelization of essential parts, without touching its versatility. Vectorization is not dependent on a specific computer. Several sample tasks have been selected in order to test the vectorized MCNP-3B code in comparison to the scalar MNCP-3B code. The samples are a representative example of the 3-D calculations to be performed for simulation of radiation transport in neutron and reactor physics. (1) 4πneutron detector. (2) High-energy calorimeter. (3) PROTEUS benchmark (conversion rates and neutron multiplication factors for the HCLWR (High Conversion Light Water Reactor)). (orig./HP) [de
Contributions for the optimization of the extensibility of parallel programing of turbulent plasmas
Rozar, F.
2015-01-01
The work realized through this thesis focuses on the optimization of the Gysela code which simulates a plasma turbulence. Optimization of a scientific application concerns mainly one of the three following points: 1) the simulation of larger meshes, 2) the reduction of computing time and 3) the enhancement of the computation accuracy. The first part of this manuscript presents the contributions relative to the simulation of larger mesh. Alike many simulation codes, getting more realistic simulations is often analogous to rene the meshes. The finer the mesh the larger the memory consumption. Moreover, during these last few years, the supercomputers had trend to provide less and less memory per computer core. For these reasons, we have developed a library, the libMTM (Modeling and Tracing Memory), dedicated to study precisely the memory consumption of parallel softwares. The libMTM tools allowed us to reduce the memory consumption of Gysela and to study its scalability. As far as we know, there is no other tool which provides equivalent features which allow the memory scalability study. The second part of the manuscript presents the works relative to the optimization of the computation time and the improvement of accuracy of the gyro-average operator. This operator represents a corner stone of the gyrokinetic model which is used by the Gysela application. The improvement of accuracy emanates from a change in the computing method: a scheme based on a 2D Hermite interpolation substitutes the Pade approximation. Although the new version of the gyro-average operator is more accurate, it is also more expensive in computation time than the former one. In order to keep the simulation in reasonable time, different optimizations have been performed on the new computing method to get it competitive. Finally, we have developed a MPI parallelized version of the new gyro-average operator. The good scalability of this new gyro-average computer will allow, eventually, a reduction
V. Yu. Kleshnin
2016-01-01
Full Text Available The article describes the matrix algebra libraries based on the modern technologies of parallel programming for the Spectrum software, which can use a spectral method (in the spectral form of mathematical description to analyse, synthesise and identify deterministic and stochastic dynamical systems. The developed matrix algebra libraries use the following technologies for the GPUs: OmniThreadLibrary, OpenMP, Intel Threading Building Blocks, Intel Cilk Plus for CPUs nVidia CUDA, OpenCL, and Microsoft Accelerated Massive Parallelism.The developed libraries support matrices with real elements (single and double precision. The matrix dimensions are limited by 32-bit or 64-bit memory model and computer configuration. These libraries are general-purpose and can be used not only for the Spectrum software. They can also find application in the other projects where there is a need to perform operations with large matrices.The article provides a comparative analysis of the libraries developed for various matrix operations (addition, subtraction, scalar multiplication, multiplication, powers of matrices, tensor multiplication, transpose, inverse matrix, finding a solution of the system of linear equations through the numerical experiments using different CPU and GPU. The article contains sample programs and performance test results for matrix multiplication, which requires most of all computational resources in regard to the other operations.
Morse, H Stephen
1994-01-01
Practical Parallel Computing provides information pertinent to the fundamental aspects of high-performance parallel processing. This book discusses the development of parallel applications on a variety of equipment.Organized into three parts encompassing 12 chapters, this book begins with an overview of the technology trends that converge to favor massively parallel hardware over traditional mainframes and vector machines. This text then gives a tutorial introduction to parallel hardware architectures. Other chapters provide worked-out examples of programs using several parallel languages. Thi
De Novo Ultrascale Atomistic Simulations On High-End Parallel Supercomputers
Nakano, A; Kalia, R K; Nomura, K; Sharma, A; Vashishta, P; Shimojo, F; van Duin, A; Goddard, III, W A; Biswas, R; Srivastava, D; Yang, L H
2006-09-04
We present a de novo hierarchical simulation framework for first-principles based predictive simulations of materials and their validation on high-end parallel supercomputers and geographically distributed clusters. In this framework, high-end chemically reactive and non-reactive molecular dynamics (MD) simulations explore a wide solution space to discover microscopic mechanisms that govern macroscopic material properties, into which highly accurate quantum mechanical (QM) simulations are embedded to validate the discovered mechanisms and quantify the uncertainty of the solution. The framework includes an embedded divide-and-conquer (EDC) algorithmic framework for the design of linear-scaling simulation algorithms with minimal bandwidth complexity and tight error control. The EDC framework also enables adaptive hierarchical simulation with automated model transitioning assisted by graph-based event tracking. A tunable hierarchical cellular decomposition parallelization framework then maps the O(N) EDC algorithms onto Petaflops computers, while achieving performance tunability through a hierarchy of parameterized cell data/computation structures, as well as its implementation using hybrid Grid remote procedure call + message passing + threads programming. High-end computing platforms such as IBM BlueGene/L, SGI Altix 3000 and the NSF TeraGrid provide an excellent test grounds for the framework. On these platforms, we have achieved unprecedented scales of quantum-mechanically accurate and well validated, chemically reactive atomistic simulations--1.06 billion-atom fast reactive force-field MD and 11.8 million-atom (1.04 trillion grid points) quantum-mechanical MD in the framework of the EDC density functional theory on adaptive multigrids--in addition to 134 billion-atom non-reactive space-time multiresolution MD, with the parallel efficiency as high as 0.998 on 65,536 dual-processor BlueGene/L nodes. We have also achieved an automated execution of hierarchical QM
Maeda, Takuto; Takemura, Shunsuke; Furumura, Takashi
2017-07-01
We have developed an open-source software package, Open-source Seismic Wave Propagation Code (OpenSWPC), for parallel numerical simulations of seismic wave propagation in 3D and 2D (P-SV and SH) viscoelastic media based on the finite difference method in local-to-regional scales. This code is equipped with a frequency-independent attenuation model based on the generalized Zener body and an efficient perfectly matched layer for absorbing boundary condition. A hybrid-style programming using OpenMP and the Message Passing Interface (MPI) is adopted for efficient parallel computation. OpenSWPC has wide applicability for seismological studies and great portability to allowing excellent performance from PC clusters to supercomputers. Without modifying the code, users can conduct seismic wave propagation simulations using their own velocity structure models and the necessary source representations by specifying them in an input parameter file. The code has various modes for different types of velocity structure model input and different source representations such as single force, moment tensor and plane-wave incidence, which can easily be selected via the input parameters. Widely used binary data formats, the Network Common Data Form (NetCDF) and the Seismic Analysis Code (SAC) are adopted for the input of the heterogeneous structure model and the outputs of the simulation results, so users can easily handle the input/output datasets. All codes are written in Fortran 2003 and are available with detailed documents in a public repository.[Figure not available: see fulltext.
Parallel computing techniques for rotorcraft aerodynamics
Ekici, Kivanc
The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).
Studies of parallel algorithms for the solution of a Fokker-Planck equation
Deck, D.; Samba, G.
1995-11-01
The study of laser-created plasmas often requires the use of a kinetic model rather than a hydrodynamic one. This model change occurs, for example, in the hot spot formation in an ICF experiment or during the relaxation of colliding plasmas. When the gradients scalelengths or the size of a given system are not small compared to the characteristic mean-free-path, we have to deal with non-equilibrium situations, which can be described by the distribution functions of every species in the system. We present here a numerical method in plane or spherical 1-D geometry, for the solution of a Fokker-Planck equation that describes the evolution of stich functions in the phase space. The size and the time scale of kinetic simulations require the use of Massively Parallel Computers (MPP). We have adopted a message-passing strategy using Parallel Virtual Machine (PVM)
SBML-PET-MPI: a parallel parameter estimation tool for Systems Biology Markup Language based models.
Zi, Zhike
2011-04-01
Parameter estimation is crucial for the modeling and dynamic analysis of biological systems. However, implementing parameter estimation is time consuming and computationally demanding. Here, we introduced a parallel parameter estimation tool for Systems Biology Markup Language (SBML)-based models (SBML-PET-MPI). SBML-PET-MPI allows the user to perform parameter estimation and parameter uncertainty analysis by collectively fitting multiple experimental datasets. The tool is developed and parallelized using the message passing interface (MPI) protocol, which provides good scalability with the number of processors. SBML-PET-MPI is freely available for non-commercial use at http://www.bioss.uni-freiburg.de/cms/sbml-pet-mpi.html or http://sites.google.com/site/sbmlpetmpi/.
Parallel file system with metadata distributed across partitioned key-value store c
Bent, John M.; Faibish, Sorin; Grider, Gary; Torres, Aaron
2017-09-19
Improved techniques are provided for storing metadata associated with a plurality of sub-files associated with a single shared file in a parallel file system. The shared file is generated by a plurality of applications executing on a plurality of compute nodes. A compute node implements a Parallel Log Structured File System (PLFS) library to store at least one portion of the shared file generated by an application executing on the compute node and metadata for the at least one portion of the shared file on one or more object storage servers. The compute node is also configured to implement a partitioned data store for storing a partition of the metadata for the shared file, wherein the partitioned data store communicates with partitioned data stores on other compute nodes using a message passing interface. The partitioned data store can be implemented, for example, using Multidimensional Data Hashing Indexing Middleware (MDHIM).
SILVA JUNIOR,J. B.
2016-12-01
Full Text Available The Intrusion Detection System (IDS needs to compare the contents of all packets arriving at the network interface with a set of signatures for indicating possible attacks, a task that consumes much CPU processing time. In order to alleviate this problem, some researchers have tried to parallelize the IDS's comparison engine, transferring execution from the CPU to GPU. This paper identifies and maps the parallelization features of the Aho-Corasick algorithm, which is used in Snort to compare patterns, in order to show this algorithm's implementation and execution issues, as well as optimization techniques for the Aho-Corasick machine. We have found 147 papers from important computer science publications databases, and have mapped them. We selected 22 and analyzed them in order to find our results. Our analysis of the papers showed, among other results, that parallelization of the AC algorithm is a new task and the authors have focused on the State Transition Table as the most common way to implement the algorithm on the GPU. Furthermore, we found that some techniques speed up the algorithm and reduce the required machine storage space are highly used, such as the algorithm running on the fastest memories and mechanisms for reducing the number of nodes and bit maping.
Jejcic, A.; Maillard, J.; Maurel, G.; Silva, J.; Wolff-Bacha, F.
1997-01-01
The work in the field of parallel processing has developed as research activities using several numerical Monte Carlo simulations related to basic or applied current problems of nuclear and particle physics. For the applications utilizing the GEANT code development or improvement works were done on parts simulating low energy physical phenomena like radiation, transport and interaction. The problem of actinide burning by means of accelerators was approached using a simulation with the GEANT code. A program of neutron tracking in the range of low energies up to the thermal region has been developed. It is coupled to the GEANT code and permits in a single pass the simulation of a hybrid reactor core receiving a proton burst. Other works in this field refers to simulations for nuclear medicine applications like, for instance, development of biological probes, evaluation and characterization of the gamma cameras (collimators, crystal thickness) as well as the method for dosimetric calculations. Particularly, these calculations are suited for a geometrical parallelization approach especially adapted to parallel machines of the TN310 type. Other works mentioned in the same field refer to simulation of the electron channelling in crystals and simulation of the beam-beam interaction effect in colliders. The GEANT code was also used to simulate the operation of germanium detectors designed for natural and artificial radioactivity monitoring of environment
Huei Peng
2013-04-01
Full Text Available This paper compares two optimal energy management methods for parallel hybrid electric vehicles using an Automatic Manual Transmission (AMT. A control-oriented model of the powertrain and vehicle dynamics is built first. The energy management is formulated as a typical optimal control problem to trade off the fuel consumption and gear shifting frequency under admissible constraints. The Dynamic Programming (DP and Pontryagin’s Minimum Principle (PMP are applied to obtain the optimal solutions. Tuning with the appropriate co-states, the PMP solution is found to be very close to that from DP. The solution for the gear shifting in PMP has an algebraic expression associated with the vehicular velocity and can be implemented more efficiently in the control algorithm. The computation time of PMP is significantly less than DP.
Xyce parallel electronic simulator : users' guide.
Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.; Santarelli, Keith R.; Fixel, Deborah A.; Coffey, Todd Stirling; Russo, Thomas V.; Schiek, Richard Louis; Warrender, Christina E.; Keiter, Eric Richard; Pawlowski, Roger Patrick
2011-05-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers; (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only); and (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is
Automated Instrumentation, Monitoring and Visualization of PVM Programs Using AIMS
Mehra, Pankaj; VanVoorst, Brian; Yan, Jerry; Lum, Henry, Jr. (Technical Monitor)
1994-01-01
We present views and analysis of the execution of several PVM (Parallel Virtual Machine) codes for Computational Fluid Dynamics on a networks of Sparcstations, including: (1) NAS Parallel Benchmarks CG and MG; (2) a multi-partitioning algorithm for NAS Parallel Benchmark SP; and (3) an overset grid flowsolver. These views and analysis were obtained using our Automated Instrumentation and Monitoring System (AIMS) version 3.0, a toolkit for debugging the performance of PVM programs. We will describe the architecture, operation and application of AIMS. The AIMS toolkit contains: (1) Xinstrument, which can automatically instrument various computational and communication constructs in message-passing parallel programs; (2) Monitor, a library of runtime trace-collection routines; (3) VK (Visual Kernel), an execution-animation tool with source-code clickback; and (4) Tally, a tool for statistical analysis of execution profiles. Currently, Xinstrument can handle C and Fortran 77 programs using PVM 3.2.x; Monitor has been implemented and tested on Sun 4 systems running SunOS 4.1.2; and VK uses XIIR5 and Motif 1.2. Data and views obtained using AIMS clearly illustrate several characteristic features of executing parallel programs on networked workstations: (1) the impact of long message latencies; (2) the impact of multiprogramming overheads and associated load imbalance; (3) cache and virtual-memory effects; and (4) significant skews between workstation clocks. Interestingly, AIMS can compensate for constant skew (zero drift) by calibrating the skew between a parent and its spawned children. In addition, AIMS' skew-compensation algorithm can adjust timestamps in a way that eliminates physically impossible communications (e.g., messages going backwards in time). Our current efforts are directed toward creating new views to explain the observed performance of PVM programs. Some of the features planned for the near future include: (1) ConfigView, showing the physical topology
Dube, Evi [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Shereda, Charles [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Nau, Lee [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Harris, Lance [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
2010-09-27
As supercomputing moves toward exascale, node architectures will change significantly. CPU core counts on nodes will increase by an order of magnitude or more. Heterogeneous architectures will become more commonplace, with GPUs or FPGAs providing additional computational power. Novel programming models may make better use of on-node parallelism in these new architectures than do current models. In this paper we examine several of these novel models – UPC, CUDA, and OpenCL –to determine their suitability to LLNL scientific application codes. Our study consisted of several phases: We conducted interviews with code teams and selected two codes to port; We learned how to program in the new models and ported the codes; We debugged and tuned the ported applications; We measured results, and documented our findings. We conclude that UPC is a challenge for porting code, Berkeley UPC is not very robust, and UPC is not suitable as a general alternative to OpenMP for a number of reasons. CUDA is well supported and robust but is a proprietary NVIDIA standard, while OpenCL is an open standard. Both are well suited to a specific set of application problems that can be run on GPUs, but some problems are not suited to GPUs. Further study of the landscape of novel models is recommended.
Hung, Linda; Huang, Chen; Shin, Ilgyou; Ho, Gregory S.; Lignères, Vincent L.; Carter, Emily A.
2010-12-01
Orbital-free density functional theory (OFDFT) is a first principles quantum mechanics method to find the ground-state energy of a system by variationally minimizing with respect to the electron density. No orbitals are used in the evaluation of the kinetic energy (unlike Kohn-Sham DFT), and the method scales nearly linearly with the size of the system. The PRinceton Orbital-Free Electronic Structure Software (PROFESS) uses OFDFT to model materials from the atomic scale to the mesoscale. This new version of PROFESS allows the study of larger systems with two significant changes: PROFESS is now parallelized, and the ion-electron and ion-ion terms scale quasilinearly, instead of quadratically as in PROFESS v1 (L. Hung and E.A. Carter, Chem. Phys. Lett. 475 (2009) 163). At the start of a run, PROFESS reads the various input files that describe the geometry of the system (ion positions and cell dimensions), the type of elements (defined by electron-ion pseudopotentials), the actions you want it to perform (minimize with respect to electron density and/or ion positions and/or cell lattice vectors), and the various options for the computation (such as which functionals you want it to use). Based on these inputs, PROFESS sets up a computation and performs the appropriate optimizations. Energies, forces, stresses, material geometries, and electron density configurations are some of the values that can be output throughout the optimization. New version program summaryProgram Title: PROFESS Catalogue identifier: AEBN_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEBN_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 68 721 No. of bytes in distributed program, including test data, etc.: 1 708 547 Distribution format: tar.gz Programming language: Fortran 90 Computer
Barbara Chapman
2012-02-01
OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close to DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support.
Guo, L-X; Li, J; Zeng, H
2009-11-01
We present an investigation of the electromagnetic scattering from a three-dimensional (3-D) object above a two-dimensional (2-D) randomly rough surface. A Message Passing Interface-based parallel finite-difference time-domain (FDTD) approach is used, and the uniaxial perfectly matched layer (UPML) medium is adopted for truncation of the FDTD lattices, in which the finite-difference equations can be used for the total computation domain by properly choosing the uniaxial parameters. This makes the parallel FDTD algorithm easier to implement. The parallel performance with different number of processors is illustrated for one rough surface realization and shows that the computation time of our parallel FDTD algorithm is dramatically reduced relative to a single-processor implementation. Finally, the composite scattering coefficients versus scattered and azimuthal angle are presented and analyzed for different conditions, including the surface roughness, the dielectric constants, the polarization, and the size of the 3-D object.
Parallel file system performances in fusion data storage
Iannone, F.; Podda, S.; Bracco, G.; Manduchi, G.; Maslennikov, A.; Migliori, S.; Wolkersdorfer, K.
2012-01-01
High I/O flow rates, up to 10 GB/s, are required in large fusion Tokamak experiments like ITER where hundreds of nodes store simultaneously large amounts of data acquired during the plasma discharges. Typical network topologies such as linear arrays (systolic), rings, meshes (2-D arrays), tori (3-D arrays), trees, butterfly, hypercube in combination with high speed data transports like Infiniband or 10G-Ethernet, are the main areas in which the effort to overcome the so-called parallel I/O bottlenecks is most focused. The high I/O flow rates were modelled in an emulated testbed based on the parallel file systems such as Lustre and GPFS, commonly used in High Performance Computing. The test runs on High Performance Computing–For Fusion (8640 cores) and ENEA CRESCO (3392 cores) supercomputers. Message Passing Interface based applications were developed to emulate parallel I/O on Lustre and GPFS using data archival and access solutions like MDSPLUS and Universal Access Layer. These methods of data storage organization are widely diffused in nuclear fusion experiments and are being developed within the EFDA Integrated Tokamak Modelling – Task Force; the authors tried to evaluate their behaviour in a realistic emulation setup.
Parallel file system performances in fusion data storage
Iannone, F., E-mail: francesco.iannone@enea.it [Associazione EURATOM-ENEA sulla Fusione, C.R.ENEA Frascati, via E.Fermi, 45 - 00044 Frascati, Rome (Italy); Podda, S.; Bracco, G. [ENEA Information Communication Tecnologies, Lungotevere Thaon di Revel, 76 - 00196 Rome (Italy); Manduchi, G. [Associazione EURATOM-ENEA sulla Fusione, Consorzio RFX, Corso Stati Uniti, 4 - 35127 Padua (Italy); Maslennikov, A. [CASPUR Inter-University Consortium for the Application of Super-Computing for Research, via dei Tizii, 6b - 00185 Rome (Italy); Migliori, S. [ENEA Information Communication Tecnologies, Lungotevere Thaon di Revel, 76 - 00196 Rome (Italy); Wolkersdorfer, K. [Juelich Supercomputing Centre-FZJ, D-52425 Juelich (Germany)
2012-12-15
High I/O flow rates, up to 10 GB/s, are required in large fusion Tokamak experiments like ITER where hundreds of nodes store simultaneously large amounts of data acquired during the plasma discharges. Typical network topologies such as linear arrays (systolic), rings, meshes (2-D arrays), tori (3-D arrays), trees, butterfly, hypercube in combination with high speed data transports like Infiniband or 10G-Ethernet, are the main areas in which the effort to overcome the so-called parallel I/O bottlenecks is most focused. The high I/O flow rates were modelled in an emulated testbed based on the parallel file systems such as Lustre and GPFS, commonly used in High Performance Computing. The test runs on High Performance Computing-For Fusion (8640 cores) and ENEA CRESCO (3392 cores) supercomputers. Message Passing Interface based applications were developed to emulate parallel I/O on Lustre and GPFS using data archival and access solutions like MDSPLUS and Universal Access Layer. These methods of data storage organization are widely diffused in nuclear fusion experiments and are being developed within the EFDA Integrated Tokamak Modelling - Task Force; the authors tried to evaluate their behaviour in a realistic emulation setup.
Parallelism in matrix computations
Gallopoulos, Efstratios; Sameh, Ahmed H
2016-01-01
This book is primarily intended as a research monograph that could also be used in graduate courses for the design of parallel algorithms in matrix computations. It assumes general but not extensive knowledge of numerical linear algebra, parallel architectures, and parallel programming paradigms. The book consists of four parts: (I) Basics; (II) Dense and Special Matrix Computations; (III) Sparse Matrix Computations; and (IV) Matrix functions and characteristics. Part I deals with parallel programming paradigms and fundamental kernels, including reordering schemes for sparse matrices. Part II is devoted to dense matrix computations such as parallel algorithms for solving linear systems, linear least squares, the symmetric algebraic eigenvalue problem, and the singular-value decomposition. It also deals with the development of parallel algorithms for special linear systems such as banded ,Vandermonde ,Toeplitz ,and block Toeplitz systems. Part III addresses sparse matrix computations: (a) the development of pa...
Optimization approaches to mpi and area merging-based parallel buffer algorithm
Junfu Fan
Full Text Available On buffer zone construction, the rasterization-based dilation method inevitably introduces errors, and the double-sided parallel line method involves a series of complex operations. In this paper, we proposed a parallel buffer algorithm based on area merging and MPI (Message Passing Interface to improve the performances of buffer analyses on processing large datasets. Experimental results reveal that there are three major performance bottlenecks which significantly impact the serial and parallel buffer construction efficiencies, including the area merging strategy, the task load balance method and the MPI inter-process results merging strategy. Corresponding optimization approaches involving tree-like area merging strategy, the vertex number oriented parallel task partition method and the inter-process results merging strategy were suggested to overcome these bottlenecks. Experiments were carried out to examine the performance efficiency of the optimized parallel algorithm. The estimation results suggested that the optimization approaches could provide high performance and processing ability for buffer construction in a cluster parallel environment. Our method could provide insights into the parallelization of spatial analysis algorithm.
Load-balancing techniques for a parallel electromagnetic particle-in-cell code
PLIMPTON,STEVEN J.; SEIDEL,DAVID B.; PASIK,MICHAEL F.; COATS,REBECCA S.
2000-01-01
QUICKSILVER is a 3-d electromagnetic particle-in-cell simulation code developed and used at Sandia to model relativistic charged particle transport. It models the time-response of electromagnetic fields and low-density-plasmas in a self-consistent manner: the fields push the plasma particles and the plasma current modifies the fields. Through an LDRD project a new parallel version of QUICKSILVER was created to enable large-scale plasma simulations to be run on massively-parallel distributed-memory supercomputers with thousands of processors, such as the Intel Tflops and DEC CPlant machines at Sandia. The new parallel code implements nearly all the features of the original serial QUICKSILVER and can be run on any platform which supports the message-passing interface (MPI) standard as well as on single-processor workstations. This report describes basic strategies useful for parallelizing and load-balancing particle-in-cell codes, outlines the parallel algorithms used in this implementation, and provides a summary of the modifications made to QUICKSILVER. It also highlights a series of benchmark simulations which have been run with the new code that illustrate its performance and parallel efficiency. These calculations have up to a billion grid cells and particles and were run on thousands of processors. This report also serves as a user manual for people wishing to run parallel QUICKSILVER.
Xyce parallel electronic simulator users guide, version 6.0.
Keiter, Eric R; Mei, Ting; Russo, Thomas V.; Schiek, Richard Louis; Thornquist, Heidi K.; Verley, Jason C.; Fixel, Deborah A.; Coffey, Todd S; Pawlowski, Roger P; Warrender, Christina E.; Baur, David Gregory.
2013-08-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.
Xyce parallel electronic simulator users' guide, Version 6.0.1.
Keiter, Eric R; Mei, Ting; Russo, Thomas V.; Schiek, Richard Louis; Thornquist, Heidi K.; Verley, Jason C.; Fixel, Deborah A.; Coffey, Todd S; Pawlowski, Roger P; Warrender, Christina E.; Baur, David Gregory.
2014-01-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.
Xyce parallel electronic simulator users guide, version 6.1
Keiter, Eric R; Mei, Ting; Russo, Thomas V.; Schiek, Richard Louis; Sholander, Peter E.; Thornquist, Heidi K.; Verley, Jason C.; Baur, David Gregory
2014-03-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas; Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers; A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to develop new types of analysis without requiring the implementation of analysis-specific device models; Device models that are specifically tailored to meet Sandia's needs, including some radiationaware devices (for Sandia users only); and Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase-a message passing parallel implementation-which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.
Xyce™ Parallel Electronic Simulator Users' Guide, Version 6.5.
Keiter, Eric R. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States). Electrical Models and Simulation; Aadithya, Karthik V. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States). Electrical Models and Simulation; Mei, Ting [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States). Electrical Models and Simulation; Russo, Thomas V. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States). Electrical Models and Simulation; Schiek, Richard L. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States). Electrical Models and Simulation; Sholander, Peter E. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States). Electrical Models and Simulation; Thornquist, Heidi K. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States). Electrical Models and Simulation; Verley, Jason C. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States). Electrical Models and Simulation
2016-06-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.
Load-balancing techniques for a parallel electromagnetic particle-in-cell code
Plimpton, Steven J.; Seidel, David B.; Pasik, Michael F.; Coats, Rebecca S.
2000-01-01
QUICKSILVER is a 3-d electromagnetic particle-in-cell simulation code developed and used at Sandia to model relativistic charged particle transport. It models the time-response of electromagnetic fields and low-density-plasmas in a self-consistent manner: the fields push the plasma particles and the plasma current modifies the fields. Through an LDRD project a new parallel version of QUICKSILVER was created to enable large-scale plasma simulations to be run on massively-parallel distributed-memory supercomputers with thousands of processors, such as the Intel Tflops and DEC CPlant machines at Sandia. The new parallel code implements nearly all the features of the original serial QUICKSILVER and can be run on any platform which supports the message-passing interface (MPI) standard as well as on single-processor workstations. This report describes basic strategies useful for parallelizing and load-balancing particle-in-cell codes, outlines the parallel algorithms used in this implementation, and provides a summary of the modifications made to QUICKSILVER. It also highlights a series of benchmark simulations which have been run with the new code that illustrate its performance and parallel efficiency. These calculations have up to a billion grid cells and particles and were run on thousands of processors. This report also serves as a user manual for people wishing to run parallel QUICKSILVER
Xyce Parallel Electronic Simulator Users' Guide Version 6.8
Keiter, Eric R. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Aadithya, Karthik Venkatraman [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Mei, Ting [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Russo, Thomas V. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Schiek, Richard L. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sholander, Peter E. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Thornquist, Heidi K. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Verley, Jason C. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2017-10-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase$-$ a message passing parallel implementation $-$ which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.
Parallel multiscale simulations of a brain aneurysm
Grinberg, Leopold [Division of Applied Mathematics, Brown University, Providence, RI 02912 (United States); Fedosov, Dmitry A. [Institute of Complex Systems and Institute for Advanced Simulation, Forschungszentrum Jülich, Jülich 52425 (Germany); Karniadakis, George Em, E-mail: george_karniadakis@brown.edu [Division of Applied Mathematics, Brown University, Providence, RI 02912 (United States)
2013-07-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multiscale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier–Stokes solver NεκTαr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers (NεκTαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300 K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in
Parallel multiscale simulations of a brain aneurysm
Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em
2013-01-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multiscale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier–Stokes solver NεκTαr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers (NεκTαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300 K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in
Distributed Memory Parallel Computing with SEAWAT
Verkaik, J.; Huizer, S.; van Engelen, J.; Oude Essink, G.; Ram, R.; Vuik, K.
2017-12-01
Fresh groundwater reserves in coastal aquifers are threatened by sea-level rise, extreme weather conditions, increasing urbanization and associated groundwater extraction rates. To counteract these threats, accurate high-resolution numerical models are required to optimize the management of these precious reserves. The major model drawbacks are long run times and large memory requirements, limiting the predictive power of these models. Distributed memory parallel computing is an efficient technique for reducing run times and memory requirements, where the problem is divided over multiple processor cores. A new Parallel Krylov Solver (PKS) for SEAWAT is presented. PKS has recently been applied to MODFLOW and includes Conjugate Gradient (CG) and Biconjugate Gradient Stabilized (BiCGSTAB) linear accelerators. Both accelerators are preconditioned by an overlapping additive Schwarz preconditioner in a way that: a) subdomains are partitioned using Recursive Coordinate Bisection (RCB) load balancing, b) each subdomain uses local memory only and communicates with other subdomains by Message Passing Interface (MPI) within the linear accelerator, c) it is fully integrated in SEAWAT. Within SEAWAT, the PKS-CG solver replaces the Preconditioned Conjugate Gradient (PCG) solver for solving the variable-density groundwater flow equation and the PKS-BiCGSTAB solver replaces the Generalized Conjugate Gradient (GCG) solver for solving the advection-diffusion equation. PKS supports the third-order Total Variation Diminishing (TVD) scheme for computing advection. Benchmarks were performed on the Dutch national supercomputer (https://userinfo.surfsara.nl/systems/cartesius) using up to 128 cores, for a synthetic 3D Henry model (100 million cells) and the real-life Sand Engine model ( 10 million cells). The Sand Engine model was used to investigate the potential effect of the long-term morphological evolution of a large sand replenishment and climate change on fresh groundwater resources
An Optimized Parallel FDTD Topology for Challenging Electromagnetic Simulations on Supercomputers
Shugang Jiang
2015-01-01
Full Text Available It may not be a challenge to run a Finite-Difference Time-Domain (FDTD code for electromagnetic simulations on a supercomputer with more than 10 thousands of CPU cores; however, to make FDTD code work with the highest efficiency is a challenge. In this paper, the performance of parallel FDTD is optimized through MPI (message passing interface virtual topology, based on which a communication model is established. The general rules of optimal topology are presented according to the model. The performance of the method is tested and analyzed on three high performance computing platforms with different architectures in China. Simulations including an airplane with a 700-wavelength wingspan, and a complex microstrip antenna array with nearly 2000 elements are performed very efficiently using a maximum of 10240 CPU cores.
SOFTWARE FOR DESIGNING PARALLEL APPLICATIONS
M. K. Bouza
2017-01-01
Full Text Available The object of research is the tools to support the development of parallel programs in C/C ++. The methods and software which automates the process of designing parallel applications are proposed.
Control rod drop transient analysis with the coupled parallel code pCTF-PARCSv2.7
Ramos, Enrique; Roman, Jose E.; Abarca, Agustín; Miró, Rafael; Bermejo, Juan A.
2016-01-01
Highlights: • An MPI parallel version of the thermal–hydraulic subchannel code COBRA-TF has been developed. • The parallel code has been coupled to the 3D neutron diffusion code PARCSv2.7. • The new codes are validated with a control rod drop transient. - Abstract: In order to reduce the response time when simulating large reactors in detail, a parallel version of the thermal–hydraulic subchannel code COBRA-TF (CTF) has been developed using the standard Message Passing Interface (MPI). The parallelization is oriented to reactor cells, so it is best suited for models consisting of many cells. The generation of the Jacobian matrix is parallelized, in such a way that each processor is in charge of generating the data associated with a subset of cells. Also, the solution of the linear system of equations is done in parallel, using the PETSc toolkit. With the goal of creating a powerful tool to simulate the reactor core behavior during asymmetrical transients, the 3D neutron diffusion code PARCSv2.7 (PARCS) has been coupled with the parallel version of CTF (pCTF) using the Parallel Virtual Machine (PVM) technology. In order to validate the correctness of the parallel coupled code, a control rod drop transient has been simulated comparing the results with the real experimental measures acquired during an NPP real test.
Zhan, X.
2005-01-01
A parallel Fortran-MPI (Message Passing Interface) software for numerical inversion of the Laplace transform based on a Fourier series method is developed to meet the need of solving intensive computational problems involving oscillatory water level's response to hydraulic tests in a groundwater environment. The software is a parallel version of ACM (The Association for Computing Machinery) Transactions on Mathematical Software (TOMS) Algorithm 796. Running 38 test examples indicated that implementation of MPI techniques with distributed memory architecture speedups the processing and improves the efficiency. Applications to oscillatory water levels in a well during aquifer tests are presented to illustrate how this package can be applied to solve complicated environmental problems involved in differential and integral equations. The package is free and is easy to use for people with little or no previous experience in using MPI but who wish to get off to a quick start in parallel computing. ?? 2004 Elsevier Ltd. All rights reserved.
CERN. Geneva
2016-01-01
The traditionally used and well established parallel programming models OpenMP and MPI are both targeting lower level parallelism and are meant to be as language agnostic as possible. For a long time, those models were the only widely available portable options for developing parallel C++ applications beyond using plain threads. This has strongly limited the optimization capabilities of compilers, has inhibited extensibility and genericity, and has restricted the use of those models together with other, modern higher level abstractions introduced by the C++11 and C++14 standards. The recent revival of interest in the industry and wider community for the C++ language has also spurred a remarkable amount of standardization proposals and technical specifications being developed. Those efforts however have so far failed to build a vision on how to seamlessly integrate various types of parallelism, such as iterative parallel execution, task-based parallelism, asynchronous many-task execution flows, continuation s...
CHOLLA: A NEW MASSIVELY PARALLEL HYDRODYNAMICS CODE FOR ASTROPHYSICAL SIMULATION
Schneider, Evan E.; Robertson, Brant E. [Steward Observatory, University of Arizona, 933 North Cherry Avenue, Tucson, AZ 85721 (United States)
2015-04-15
We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳256{sup 3}) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density.
CHOLLA: A NEW MASSIVELY PARALLEL HYDRODYNAMICS CODE FOR ASTROPHYSICAL SIMULATION
Schneider, Evan E.; Robertson, Brant E.
2015-01-01
We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳256 3 ) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density
Parallel implementation of many-body mean-field equations
Chinn, C.R.; Umar, A.S.; Vallieres, M.; Strayer, M.R.
1994-01-01
We describe the numerical methods used to solve the system of stiff, nonlinear partial differential equations resulting from the Hartree-Fock description of many-particle quantum systems, as applied to the structure of the nucleus. The solutions are performed on a three-dimensional Cartesian lattice. Discretization is achieved through the lattice basis-spline collocation method, in which quantum-state vectors and coordinate-space operators are expressed in terms of basis-spline functions on a spatial lattice. All numerical procedures reduce to a series of matrix-vector multiplications and other elementary operations, which we perform on a number of different computing architectures, including the Intel Paragon and the Intel iPSC/860 hypercube. Parallelization is achieved through a combination of mechanisms employing the Gram-Schmidt procedure, broadcasts, global operations, and domain decomposition of state vectors. We discuss the approach to the problems of limited node memory and node-to-node communication overhead inherent in using distributed-memory, multiple-instruction, multiple-data stream parallel computers. An algorithm was developed to reduce the communication overhead by pipelining some of the message passing procedures
Parallelization characteristics of the DeCART code
Cho, J. Y.; Joo, H. G.; Kim, H. Y.; Lee, C. C.; Chang, M. H.; Zee, S. Q.
2003-12-01
This report is to describe the parallelization characteristics of the DeCART code and also examine its parallel performance. Parallel computing algorithms are implemented to DeCART to reduce the tremendous computational burden and memory requirement involved in the three-dimensional whole core transport calculation. In the parallelization of the DeCART code, the axial domain decomposition is first realized by using MPI (Message Passing Interface), and then the azimuthal angle domain decomposition by using either MPI or OpenMP. When using the MPI for both the axial and the angle domain decomposition, the concept of MPI grouping is employed for convenient communication in each communication world. For the parallel computation, most of all the computing modules except for the thermal hydraulic module are parallelized. These parallelized computing modules include the MOC ray tracing, CMFD, NEM, region-wise cross section preparation and cell homogenization modules. For the distributed allocation, most of all the MOC and CMFD/NEM variables are allocated only for the assigned planes, which reduces the required memory by a ratio of the number of the assigned planes to the number of all planes. The parallel performance of the DeCART code is evaluated by solving two problems, a rodded variation of the C5G7 MOX three-dimensional benchmark problem and a simplified three-dimensional SMART PWR core problem. In the aspect of parallel performance, the DeCART code shows a good speedup of about 40.1 and 22.4 in the ray tracing module and about 37.3 and 20.2 in the total computing time when using 48 CPUs on the IBM Regatta and 24 CPUs on the LINUX cluster, respectively. In the comparison between the MPI and OpenMP, OpenMP shows a somewhat better performance than MPI. Therefore, it is concluded that the first priority in the parallel computation of the DeCART code is in the axial domain decomposition by using MPI, and then in the angular domain using OpenMP, and finally the angular
Crockett, Thomas W.
1995-01-01
This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.
1982-01-01
Parallel Computations focuses on parallel computation, with emphasis on algorithms used in a variety of numerical and physical applications and for many different types of parallel computers. Topics covered range from vectorization of fast Fourier transforms (FFTs) and of the incomplete Cholesky conjugate gradient (ICCG) algorithm on the Cray-1 to calculation of table lookups and piecewise functions. Single tridiagonal linear systems and vectorized computation of reactive flow are also discussed.Comprised of 13 chapters, this volume begins by classifying parallel computers and describing techn
González-Domínguez, Jorge; Remeseiro, Beatriz; Martín, María J
2017-02-01
The analysis of the interference patterns on the tear film lipid layer is a useful clinical test to diagnose dry eye syndrome. This task can be automated with a high degree of accuracy by means of the use of tear film maps. However, the time required by the existing applications to generate them prevents a wider acceptance of this method by medical experts. Multithreading has been previously successfully employed by the authors to accelerate the tear film map definition on multicore single-node machines. In this work, we propose a hybrid message-passing and multithreading parallel approach that further accelerates the generation of tear film maps by exploiting the computational capabilities of distributed-memory systems such as multicore clusters and supercomputers. The algorithm for drawing tear film maps is parallelized using Message Passing Interface (MPI) for inter-node communications and the multithreading support available in the C++11 standard for intra-node parallelization. The original algorithm is modified to reduce the communications and increase the scalability. The hybrid method has been tested on 32 nodes of an Intel cluster (with two 12-core Haswell 2680v3 processors per node) using 50 representative images. Results show that maximum runtime is reduced from almost two minutes using the previous only-multithreaded approach to less than ten seconds using the hybrid method. The hybrid MPI/multithreaded implementation can be used by medical experts to obtain tear film maps in only a few seconds, which will significantly accelerate and facilitate the diagnosis of the dry eye syndrome. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Parallelization of the MAAP-A code neutronics/thermal hydraulics coupling
Froehle, P.H.; Wei, T.Y.C.; Weber, D.P.; Henry, R.E.
1998-01-01
A major new feature, one-dimensional space-time kinetics, has been added to a developmental version of the MAAP code through the introduction of the DIF3D-K module. This code is referred to as MAAP-A. To reduce the overall job time required, a capability has been provided to run the MAAP-A code in parallel. The parallel version of MAAP-A utilizes two machines running in parallel, with the DIF3D-K module executing on one machine and the rest of the MAAP-A code executing on the other machine. Timing results obtained during the development of the capability indicate that reductions in time of 30--40% are possible. The parallel version can be run on two SPARC 20 (SUN OS 5.5) workstations connected through the ethernet. MPI (Message Passing Interface standard) needs to be implemented on the machines. If necessary the parallel version can also be run on only one machine. The results obtained running in this one-machine mode identically match the results obtained from the serial version of the code
Gianluca, Longoni; Alireza, Haghighat
2003-01-01
In recent years, the SP L (simplified spherical harmonics) equations have received renewed interest for the simulation of nuclear systems. We have derived the SP L equations starting from the even-parity form of the S N equations. The SP L equations form a system of (L+1)/2 second order partial differential equations that can be solved with standard iterative techniques such as the Conjugate Gradient (CG). We discretized the SP L equations with the finite-volume approach in a 3-D Cartesian space. We developed a new 3-D general code, Pensp L (Parallel Environment Neutral-particle SP L ). Pensp L solves both fixed source and criticality eigenvalue problems. In order to optimize the memory management, we implemented a Compressed Diagonal Storage (CDS) to store the SP L matrices. Pensp L includes parallel algorithms for space and moment domain decomposition. The computational load is distributed on different processors, using a mapping function, which maps the 3-D Cartesian space and moments onto processors. The code is written in Fortran 90 using the Message Passing Interface (MPI) libraries for the parallel implementation of the algorithm. The code has been tested on the Pcpen cluster and the parallel performance has been assessed in terms of speed-up and parallel efficiency. (author)
1991-10-23
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.
Denis Becker
2016-05-01
Full Text Available Transaction level models of systems-on-chip in SystemC are commonly used in the industry to provide an early simulation environment. The SystemC standard imposes coroutine semantics for the scheduling of simulated processes, to ensure determinism and reproducibility of simulations. However, because of this, sequential implementations have, for a long time, been the only option available, and still now the reference implementation is sequential. With the increasing size and complexity of models, and the multiplication of computation cores on recent machines, the parallelization of SystemC simulations is a major research concern. There have been several proposals for SystemC parallelization, but most of them are limited to cycle-accurate models. In this paper we focus on loosely timed models, which are commonly used in the industry. We present an industrial context and show that, unfortunately, most of the existing approaches for SystemC parallelization can fundamentally not apply in this context. We support this claim with a set of measurements performed on a platform used in production at STMicroelectronics. This paper surveys existing techniques, presents a visualization and profiling tool and identifies unsolved challenges in the parallelization of SystemC models at transaction level.
Casanova, Henri; Robert, Yves
2008-01-01
""…The authors of the present book, who have extensive credentials in both research and instruction in the area of parallelism, present a sound, principled treatment of parallel algorithms. … This book is very well written and extremely well designed from an instructional point of view. … The authors have created an instructive and fascinating text. The book will serve researchers as well as instructors who need a solid, readable text for a course on parallelism in computing. Indeed, for anyone who wants an understandable text from which to acquire a current, rigorous, and broad vi
Study of MPI based on parallel MOM on PC clusters for EM-beam scattering by 2-D PEC rough surfaces
Jun, Ma; Li-Xin, Guo; An-Qi, Wang
2009-01-01
This paper firstly applies the finite impulse response filter (FIR) theory combined with the fast Fourier transform (FFT) method to generate two-dimensional Gaussian rough surface. Using the electric field integral equation (EFIE), it introduces the method of moment (MOM) with RWG vector basis function and Galerkin's method to investigate the electromagnetic beam scattering by a two-dimensional PEC Gaussian rough surface on personal computer (PC) clusters. The details of the parallel conjugate gradient method (CGM) for solving the matrix equation are also presented and the numerical simulations are obtained through the message passing interface (MPI) platform on the PC clusters. It finds significantly that the parallel MOM supplies a novel technique for solving a two-dimensional rough surface electromagnetic-scattering problem. The influences of the root-mean-square height, the correlation length and the polarization on the beam scattering characteristics by two-dimensional PEC Gaussian rough surfaces are finally discussed. (classical areas of phenomenology)
Cao, Bin; Zhao, Jianwei; Yang, Po
2018-01-01
-objective evolutionary algorithms the Cooperative Coevolutionary Generalized Differential Evolution 3, the Cooperative Multi-objective Differential Evolution and the Nondominated Sorting Genetic Algorithm III, the proposed algorithm addresses the deployment optimization problem efficiently and effectively.......Using immune algorithms is generally a time-intensive process especially for problems with a large number of variables. In this paper, we propose a distributed parallel cooperative coevolutionary multi-objective large-scale immune algorithm that is implemented using the message passing interface...... (MPI). The proposed algorithm is composed of three layers: objective, group and individual layers. First, for each objective in the multi-objective problem to be addressed, a subpopulation is used for optimization, and an archive population is used to optimize all the objectives. Second, the large...
Yang, Sheng-Chun; Lu, Zhong-Yuan; Qian, Hu-Jun; Wang, Yong-Lei; Han, Jie-Ping
2017-11-01
, which approximately take up most of the total simulation time. Although the parallel method CU-ENUF (Yang et al., 2016) based on GPU has achieved a qualitative leap compared with previous methods in electrostatic interactions computation, the computation capability is limited to the throughput capacity of a single GPU for super-scale simulation system. Therefore, we should look for an effective method to handle the calculation of electrostatic interactions efficiently for a simulation system with super-scale size. Solution method: We constructed a hybrid parallel architecture, in which CPU and GPU are combined to accelerate the electrostatic computation effectively. Firstly, the simulation system is divided into many subtasks via domain-decomposition method. Then MPI (Message Passing Interface) is used to implement the CPU-parallel computation with each computer node corresponding to a particular subtask, and furthermore each subtask in one computer node will be executed in GPU in parallel efficiently. In this hybrid parallel method, the most critical technical problem is how to parallelize a CUNFFT (nonequispaced fast Fourier transform based on CUDA) in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills. Restrictions: The HP-ENUF is mainly oriented to super-scale system simulations, in which the performance superiority is shown adequately. However, for a small simulation system containing less than 106 particles, the mode of multiple computer nodes has no apparent efficiency advantage or even lower efficiency due to the serious network delay among computer nodes, than the mode of single computer node. References: (1) S.-C. Yang, H.-J. Qian, Z.-Y. Lu, Appl. Comput. Harmon. Anal. 2016, http://dx.doi.org/10.1016/j.acha.2016.04.009. (2) S.-C. Yang, Y.-L. Wang, G.-S. Jiao, H.-J. Qian, Z.-Y. Lu, J. Comput. Chem. 37 (2016) 378. (3) S.-C. Yang, Y.-L. Zhu, H.-J. Qian, Z.-Y. Lu, Appl. Chem. Res. Chin. Univ
Zaghi, S.
2014-07-01
OFF, an open source (free software) code for performing fluid dynamics simulations, is presented. The aim of OFF is to solve, numerically, the unsteady (and steady) compressible Navier-Stokes equations of fluid dynamics by means of finite volume techniques: the research background is mainly focused on high-order (WENO) schemes for multi-fluids, multi-phase flows over complex geometries. To this purpose a highly modular, object-oriented application program interface (API) has been developed. In particular, the concepts of data encapsulation and inheritance available within Fortran language (from standard 2003) have been stressed in order to represent each fluid dynamics "entity" (e.g. the conservative variables of a finite volume, its geometry, etc…) by a single object so that a large variety of computational libraries can be easily (and efficiently) developed upon these objects. The main features of OFF can be summarized as follows: Programming LanguageOFF is written in standard (compliant) Fortran 2003; its design is highly modular in order to enhance simplicity of use and maintenance without compromising the efficiency; Parallel Frameworks Supported the development of OFF has been also targeted to maximize the computational efficiency: the code is designed to run on shared-memory multi-cores workstations and distributed-memory clusters of shared-memory nodes (supercomputers); the code's parallelization is based on Open Multiprocessing (OpenMP) and Message Passing Interface (MPI) paradigms; Usability, Maintenance and Enhancement in order to improve the usability, maintenance and enhancement of the code also the documentation has been carefully taken into account; the documentation is built upon comprehensive comments placed directly into the source files (no external documentation files needed): these comments are parsed by means of doxygen free software producing high quality html and latex documentation pages; the distributed versioning system referred as git
James Wolfer
2015-02-01
Full Text Available Traditionally, topics such as parallel computing, computer graphics, and artificial intelligence have been taught as stand-alone courses in the computing curriculum. Often these are elective courses, limiting the material to the subset of students choosing to take the course. Recently there has been movement to distribute topics across the curriculum in order to ensure that all graduates have been exposed to concepts such as parallel computing. Previous work described an attempt to systematically weave a tapestry of topics into the undergraduate computing curriculum. This paper reviews that work and expands it with representative examples of assignments, demonstrations, and results as well as describing how the tools and examples deployed for these classes have a residual effect on classes such as Comptuer Literacy.
Parallel discrete ordinates algorithms on distributed and common memory systems
Wienke, B.R.; Hiromoto, R.E.; Brickner, R.G.
1987-01-01
The S/sub n/ algorithm employs iterative techniques in solving the linear Boltzmann equation. These methods, both ordered and chaotic, were compared on both the Denelcor HEP and the Intel hypercube. Strategies are linked to the organization and accessibility of memory (common memory versus distributed memory architectures), with common concern for acquisition of global information. Apart from this, the inherent parallelism of the algorithm maps directly onto the two architectures. Results comparing execution times, speedup, and efficiency are based on a representative 16-group (full upscatter and downscatter) sample problem. Calculations were performed on both the Los Alamos National Laboratory (LANL) Denelcor HEP and the LANL Intel hypercube. The Denelcor HEP is a 64-bit multi-instruction, multidate MIMD machine consisting of up to 16 process execution modules (PEMs), each capable of executing 64 processes concurrently. Each PEM can cooperate on a job, or run several unrelated jobs, and share a common global memory through a crossbar switch. The Intel hypercube, on the other hand, is a distributed memory system composed of 128 processing elements, each with its own local memory. Processing elements are connected in a nearest-neighbor hypercube configuration and sharing of data among processors requires execution of explicit message-passing constructs
GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform
Ronglin Jiang
2014-01-01
Full Text Available This paper introduces a (finite difference time domain FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI and Open Multiprocessing (OpenMP. Since both Central Processing Unit (CPU and Graphics Processing Unit (GPU resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with 16×18 elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.
New adaptive differencing strategy in the PENTRAN 3-d parallel Sn code
Sjoden, G.E.; Haghighat, A.
1996-01-01
It is known that three-dimensional (3-D) discrete ordinates (S n ) transport problems require an immense amount of storage and computational effort to solve. For this reason, parallel codes that offer a capability to completely decompose the angular, energy, and spatial domains among a distributed network of processors are required. One such code recently developed is PENTRAN, which iteratively solves 3-D multi-group, anisotropic S n problems on distributed-memory platforms, such as the IBM-SP2. Because large problems typically contain several different material zones with various properties, available differencing schemes should automatically adapt to the transport physics in each material zone. To minimize the memory and message-passing overhead required for massively parallel S n applications, available differencing schemes in an adaptive strategy should also offer reasonable accuracy and positivity, yet require only the zeroth spatial moment of the transport equation; differencing schemes based on higher spatial moments, in spite of their greater accuracy, require at least twice the amount of storage and communication cost for implementation in a massively parallel transport code. This paper discusses a new adaptive differencing strategy that uses increasingly accurate schemes with low parallel memory and communication overhead. This strategy, implemented in PENTRAN, includes a new scheme, exponential directional averaged (EDA) differencing
Leamy, Michael J.; Springer, Adam C.
In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
Experiences in the parallelization of the discrete ordinates method using OpenMP and MPI
Pautz, A. [TUV Hannover/Sachsen-Anhalt e.V. (Germany); Langenbuch, S. [Gesellschaft fur Anlagen- und Reaktorsicherheit (GRS) mbH (Germany)
2003-07-01
The method of Discrete Ordinates is in principle parallelizable to a high degree, since the transport 'mesh sweeps' are mutually independent for all angular directions. However, in the well-known production code Dort such a type of angular domain decomposition has to be done on a spatial line-byline basis, causing the parallelism in the code to be very fine-grained. The construction of scalar fluxes and moments requires a large effort for inter-thread or inter-process communication. We have implemented two different parallelization approaches in Dort: firstly, we have used a shared-memory model suitable for SMP (Symmetric Multiprocessor) machines based on the standard OpenMP. The second approach uses the well-known Message Passing Interface (MPI) to establish communication between parallel processes running in a distributed-memory environment. We investigate the benefits and drawbacks of both models and show first results on performance and scaling behaviour of the parallel Dort code. (authors)
Regional-scale calculation of the LS factor using parallel processing
Liu, Kai; Tang, Guoan; Jiang, Ling; Zhu, A.-Xing; Yang, Jianyi; Song, Xiaodong
2015-05-01
With the increase of data resolution and the increasing application of USLE over large areas, the existing serial implementation of algorithms for computing the LS factor is becoming a bottleneck. In this paper, a parallel processing model based on message passing interface (MPI) is presented for the calculation of the LS factor, so that massive datasets at a regional scale can be processed efficiently. The parallel model contains algorithms for calculating flow direction, flow accumulation, drainage network, slope, slope length and the LS factor. According to the existence of data dependence, the algorithms are divided into local algorithms and global algorithms. Parallel strategy are designed according to the algorithm characters including the decomposition method for maintaining the integrity of the results, optimized workflow for reducing the time taken for exporting the unnecessary intermediate data and a buffer-communication-computation strategy for improving the communication efficiency. Experiments on a multi-node system show that the proposed parallel model allows efficient calculation of the LS factor at a regional scale with a massive dataset.
Experiences in the parallelization of the discrete ordinates method using OpenMP and MPI
Pautz, A.; Langenbuch, S.
2003-01-01
The method of Discrete Ordinates is in principle parallelizable to a high degree, since the transport 'mesh sweeps' are mutually independent for all angular directions. However, in the well-known production code Dort such a type of angular domain decomposition has to be done on a spatial line-byline basis, causing the parallelism in the code to be very fine-grained. The construction of scalar fluxes and moments requires a large effort for inter-thread or inter-process communication. We have implemented two different parallelization approaches in Dort: firstly, we have used a shared-memory model suitable for SMP (Symmetric Multiprocessor) machines based on the standard OpenMP. The second approach uses the well-known Message Passing Interface (MPI) to establish communication between parallel processes running in a distributed-memory environment. We investigate the benefits and drawbacks of both models and show first results on performance and scaling behaviour of the parallel Dort code. (authors)
Xyce parallel electronic simulator : users' guide. Version 5.1.
Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.; Santarelli, Keith R.; Fixel, Deborah A.; Coffey, Todd Stirling; Russo, Thomas V.; Schiek, Richard Louis; Keiter, Eric Richard; Pawlowski, Roger Patrick
2009-11-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers. (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only). (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a
Xyce Parallel Electronic Simulator : users' guide, version 4.1.
Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.; Santarelli, Keith R.; Fixel, Deborah A.; Coffey, Todd Stirling; Russo, Thomas V.; Schiek, Richard Louis; Keiter, Eric Richard; Pawlowski, Roger Patrick
2009-02-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers. (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only). (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a
Dewaraja, Yuni K; Ljungberg, Michael; Majumdar, Amitava; Bose, Abhijit; Koral, Kenneth F
2002-02-01
This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.
A pervasive parallel framework for visualization: final report for FWP 10-014707
Moreland, Kenneth D. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2014-01-01
We are on the threshold of a transformative change in the basic architecture of highperformance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. This report documents the results of our three-year ASCR project to address these challenges. Our project includes the development of the Dax toolkit, which contains the beginnings of new algorithms for a new generation of computers and the underlying infrastructure to rapidly prototype and build further algorithms as necessary.
Dubois, J.
2011-01-01
In science, simulation is a key process for research or validation. Modern computer technology allows faster numerical experiments, which are cheaper than real models. In the field of neutron simulation, the calculation of eigenvalues is one of the key challenges. The complexity of these problems is such that a lot of computing power may be necessary. The work of this thesis is first the evaluation of new computing hardware such as graphics card or massively multi-core chips, and their application to eigenvalue problems for neutron simulation. Then, in order to address the massive parallelism of supercomputers national, we also study the use of asynchronous hybrid methods for solving eigenvalue problems with this very high level of parallelism. Then we experiment the work of this research on several national supercomputers such as the Titane hybrid machine of the Computing Center, Research and Technology (CCRT), the Curie machine of the Very Large Computing Centre (TGCC), currently being installed, and the Hopper machine at the Lawrence Berkeley National Laboratory (LBNL). We also do our experiments on local workstations to illustrate the interest of this research in an everyday use with local computing resources. (author) [fr
Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J
2004-09-01
We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.
Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J.
2004-01-01
We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1x10 8 or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8x10 8 histories. For a smaller number of histories (1x10 8 ) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1x10 8 histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy
Efficient parallel implicit methods for rotary-wing aerodynamics calculations
Wissink, Andrew M.
Euler/Navier-Stokes Computational Fluid Dynamics (CFD) methods are commonly used for prediction of the aerodynamics and aeroacoustics of modern rotary-wing aircraft. However, their widespread application to large complex problems is limited lack of adequate computing power. Parallel processing offers the potential for dramatic increases in computing power, but most conventional implicit solution methods are inefficient in parallel and new techniques must be adopted to realize its potential. This work proposes alternative implicit schemes for Euler/Navier-Stokes rotary-wing calculations which are robust and efficient in parallel. The first part of this work proposes an efficient parallelizable modification of the Lower Upper-Symmetric Gauss Seidel (LU-SGS) implicit operator used in the well-known Transonic Unsteady Rotor Navier Stokes (TURNS) code. The new hybrid LU-SGS scheme couples a point-relaxation approach of the Data Parallel-Lower Upper Relaxation (DP-LUR) algorithm for inter-processor communication with the Symmetric Gauss Seidel algorithm of LU-SGS for on-processor computations. With the modified operator, TURNS is implemented in parallel using Message Passing Interface (MPI) for communication. Numerical performance and parallel efficiency are evaluated on the IBM SP2 and Thinking Machines CM-5 multi-processors for a variety of steady-state and unsteady test cases. The hybrid LU-SGS scheme maintains the numerical performance of the original LU-SGS algorithm in all cases and shows a good degree of parallel efficiency. It experiences a higher degree of robustness than DP-LUR for third-order upwind solutions. The second part of this work examines use of Krylov subspace iterative solvers for the nonlinear CFD solutions. The hybrid LU-SGS scheme is used as a parallelizable preconditioner. Two iterative methods are tested, Generalized Minimum Residual (GMRES) and Orthogonal s-Step Generalized Conjugate Residual (OSGCR). The Newton method demonstrates good
Computationally efficient implementation of combustion chemistry in parallel PDF calculations
Lu Liuyan; Lantz, Steven R.; Ren Zhuyin; Pope, Stephen B.
2009-01-01
In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f m pi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel
The numerical parallel computing of photon transport
Huang Qingnan; Liang Xiaoguang; Zhang Lifa
1998-12-01
The parallel computing of photon transport is investigated, the parallel algorithm and the parallelization of programs on parallel computers both with shared memory and with distributed memory are discussed. By analyzing the inherent law of the mathematics and physics model of photon transport according to the structure feature of parallel computers, using the strategy of 'to divide and conquer', adjusting the algorithm structure of the program, dissolving the data relationship, finding parallel liable ingredients and creating large grain parallel subtasks, the sequential computing of photon transport into is efficiently transformed into parallel and vector computing. The program was run on various HP parallel computers such as the HY-1 (PVP), the Challenge (SMP) and the YH-3 (MPP) and very good parallel speedup has been gotten
Design and Programming for Cable-Driven Parallel Robots in the German Pavilion at the EXPO 2015
Philipp Tempel
2015-08-01
Full Text Available In the German Pavilion at the EXPO 2015, two large cable-driven parallel robots are flying over the heads of the visitors representing two bees flying over Germany and displaying everyday life in Germany. Each robot consists of a mobile platform and eight cables suspended by winches and follows a desired trajectory, which needs to be computed in advance taking technical limitations, safety considerations and visual aspects into account. In this paper, a path planning software is presented, which includes the design process from developing a robot design and workspace estimation via planning complex trajectories considering technical limitations through to exporting a complete show. For a test trajectory, simulation results are given, which display the relevant trajectories and cable force distributions.
Kirk, B.L.; Azmy, Y.Y.
1992-01-01
In this paper the one-group, steady-state neutron diffusion equation in two-dimensional Cartesian geometry is solved using the nodal integral method. The discrete variable equations comprise loosely coupled sets of equations representing the nodal balance of neutrons, as well as neutron current continuity along rows or columns of computational cells. An iterative algorithm that is more suitable for solving large problems concurrently is derived based on the decomposition of the spatial domain and is accelerated using successive overrelaxation. This algorithm is very well suited for parallel computers, especially since the spatial domain decomposition occurs naturally, so that the number of iterations required for convergence does not depend on the number of processors participating in the calculation. Implementation of the authors' algorithm on the Intel iPSC/2 hypercube and Sequent Balance 8000 parallel computer is presented, and measured speedup and efficiency for test problems are reported. The results suggest that the efficiency of the hypercube quickly deteriorates when many processors are used, while the Sequent Balance retains very high efficiency for a comparable number of participating processors. This leads to the conjecture that message-passing parallel computers are not as well suited for this algorithm as shared-memory machines
De Ladurantaye, Vincent; Lavoie, Jean; Bergeron, Jocelyn; Parenteau, Maxime; Lu Huizhong; Pichevar, Ramin; Rouat, Jean
2012-01-01
A parallel implementation of a large spiking neural network is proposed and evaluated. The neural network implements the binding by synchrony process using the Oscillatory Dynamic Link Matcher (ODLM). Scalability, speed and performance are compared for 2 implementations: Message Passing Interface (MPI) and Compute Unified Device Architecture (CUDA) running on clusters of multicore supercomputers and NVIDIA graphical processing units respectively. A global spiking list that represents at each instant the state of the neural network is described. This list indexes each neuron that fires during the current simulation time so that the influence of their spikes are simultaneously processed on all computing units. Our implementation shows a good scalability for very large networks. A complex and large spiking neural network has been implemented in parallel with success, thus paving the road towards real-life applications based on networks of spiking neurons. MPI offers a better scalability than CUDA, while the CUDA implementation on a GeForce GTX 285 gives the best cost to performance ratio. When running the neural network on the GTX 285, the processing speed is comparable to the MPI implementation on RQCHP's Mammouth parallel with 64 notes (128 cores).
I/O Parallelization for the Goddard Earth Observing System Data Assimilation System (GEOS DAS)
Lucchesi, Rob; Sawyer, W.; Takacs, L. L.; Lyster, P.; Zero, J.
1998-01-01
The National Aeronautics and Space Administration (NASA) Data Assimilation Office (DAO) at the Goddard Space Flight Center (GSFC) has developed the GEOS DAS, a data assimilation system that provides production support for NASA missions and will support NASA's Earth Observing System (EOS) in the coming years. The GEOS DAS will be used to provide background fields of meteorological quantities to EOS satellite instrument teams for use in their data algorithms as well as providing assimilated data sets for climate studies on decadal time scales. The DAO has been involved in prototyping parallel implementations of the GEOS DAS for a number of years and is now embarking on an effort to convert the production version from shared-memory parallelism to distributed-memory parallelism using the portable Message-Passing Interface (MPI). The GEOS DAS consists of two main components, an atmospheric General Circulation Model (GCM) and a Physical-space Statistical Analysis System (PSAS). The GCM operates on data that are stored on a regular grid while PSAS works with observational data that are scattered irregularly throughout the atmosphere. As a result, the two components have different data decompositions. The GCM is decomposed horizontally as a checkerboard with all vertical levels of each box existing on the same processing element(PE). The dynamical core of the GCM can also operate on a rotated grid, which requires communication-intensive grid transformations during GCM integration. PSAS groups observations on PEs in a more irregular and dynamic fashion.
Stankovski, Z.
1995-01-01
The collision probability method in neutron transport, as applied to 2D geometries, consume a great amount of computer time, for a typical 2D assembly calculation about 90% of the computing time is consumed in the collision probability evaluations. Consequently RZ or 3D calculations became prohibitive. In this paper the author presents a simple but efficient parallel algorithm based on the message passing host/node programmation model. Parallelization was applied to the energy group treatment. Such approach permits parallelization of the existing code, requiring only limited modifications. Sequential/parallel computer portability is preserved, which is a necessary condition for a industrial code. Sequential performances are also preserved. The algorithm is implemented on a CRAY 90 coupled to a 128 processor T3D computer, a 16 processor IBM SPI and a network of workstations, using the Public Domain PVM library. The tests were executed for a 2D geometry with the standard 99-group library. All results were very satisfactory, the best ones with IBM SPI. Because of heterogeneity of the workstation network, the author did not ask high performances for this architecture. The same source code was used for all computers. A more impressive advantage of this algorithm will appear in the calculations of the SAPHYR project (with the future fine multigroup library of about 8000 groups) with a massively parallel computer, using several hundreds of processors
Gianluca, Longoni; Alireza, Haghighat [Florida University, Nuclear and Radiological Engineering Department, Gainesville, FL (United States)
2003-07-01
In recent years, the SP{sub L} (simplified spherical harmonics) equations have received renewed interest for the simulation of nuclear systems. We have derived the SP{sub L} equations starting from the even-parity form of the S{sub N} equations. The SP{sub L} equations form a system of (L+1)/2 second order partial differential equations that can be solved with standard iterative techniques such as the Conjugate Gradient (CG). We discretized the SP{sub L} equations with the finite-volume approach in a 3-D Cartesian space. We developed a new 3-D general code, Pensp{sub L} (Parallel Environment Neutral-particle SP{sub L}). Pensp{sub L} solves both fixed source and criticality eigenvalue problems. In order to optimize the memory management, we implemented a Compressed Diagonal Storage (CDS) to store the SP{sub L} matrices. Pensp{sub L} includes parallel algorithms for space and moment domain decomposition. The computational load is distributed on different processors, using a mapping function, which maps the 3-D Cartesian space and moments onto processors. The code is written in Fortran 90 using the Message Passing Interface (MPI) libraries for the parallel implementation of the algorithm. The code has been tested on the Pcpen cluster and the parallel performance has been assessed in terms of speed-up and parallel efficiency. (author)
Parallel computing: numerics, applications, and trends
Trobec, Roman; Vajteršic, Marián; Zinterhof, Peter
2009-01-01
... and/or distributed systems. The contributions to this book are focused on topics most concerned in the trends of today's parallel computing. These range from parallel algorithmics, programming, tools, network computing to future parallel computing. Particular attention is paid to parallel numerics: linear algebra, differential equations, numerica...
Robert W. Numrich
2008-04-22
The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for each object that hide the details of data decomposition from the programmer, and it contains collective operations that allow the programmer to calculate global reductions, such as global sums, global minima and global maxima, as well as vector and matrix norms of several kinds. CafLib is designed to be extensible in such a way that programmers can define distributed grid and field objects, based on vector and matrix objects from the library, for finite difference algorithms to solve partial differential equations. A very important extra benefit that resulted from the project is the inclusion of the co-array programming model in the next Fortran standard called Fortran 2008. It is the first parallel programming model ever included as a standard part of the language. Co-arrays will be a supported feature in all Fortran compilers, and the portability provided by standardization will encourage a large number of programmers to adopt it for new parallel application development. The combination of object-oriented programming in Fortran 2003 with co-arrays in Fortran 2008 provides a very powerful programming model for high-performance scientific computing. Additional benefits from the project, beyond the original goal, include a programto provide access to the co-array model through access to the Cray compiler as a resource for teaching and research. Several academics, for the first time, included the co-array model as a topic in their courses on parallel computing. A separate collaborative project with LANL and PNNL showed how to
McCallum, Ethan
2011-01-01
It's tough to argue with R as a high-quality, cross-platform, open source statistical software product-unless you're in the business of crunching Big Data. This concise book introduces you to several strategies for using R to analyze large datasets. You'll learn the basics of Snow, Multicore, Parallel, and some Hadoop-related tools, including how to find them, how to use them, when they work well, and when they don't. With these packages, you can overcome R's single-threaded nature by spreading work across multiple CPUs, or offloading work to multiple machines to address R's memory barrier.
Overview of the Force Scientific Parallel Language
Gita Alaghband
1994-01-01
Full Text Available The Force parallel programming language designed for large-scale shared-memory multiprocessors is presented. The language provides a number of parallel constructs as extensions to the ordinary Fortran language and is implemented as a two-level macro preprocessor to support portability across shared memory multiprocessors. The global parallelism model on which the Force is based provides a powerful parallel language. The parallel constructs, generic synchronization, and freedom from process management supported by the Force has resulted in structured parallel programs that are ported to the many multiprocessors on which the Force is implemented. Two new parallel constructs for looping and functional decomposition are discussed. Several programming examples to illustrate some parallel programming approaches using the Force are also presented.
Samaké, Abdoulaye; Rampal, Pierre; Bouillon, Sylvain; Ólason, Einar
2017-12-01
We present a parallel implementation framework for a new dynamic/thermodynamic sea-ice model, called neXtSIM, based on the Elasto-Brittle rheology and using an adaptive mesh. The spatial discretisation of the model is done using the finite-element method. The temporal discretisation is semi-implicit and the advection is achieved using either a pure Lagrangian scheme or an Arbitrary Lagrangian Eulerian scheme (ALE). The parallel implementation presented here focuses on the distributed-memory approach using the message-passing library MPI. The efficiency and the scalability of the parallel algorithms are illustrated by the numerical experiments performed using up to 500 processor cores of a cluster computing system. The performance obtained by the proposed parallel implementation of the neXtSIM code is shown being sufficient to perform simulations for state-of-the-art sea ice forecasting and geophysical process studies over geographical domain of several millions squared kilometers like the Arctic region.
Kostin, Mikhail [Michigan State Univ., East Lansing, MI (United States); Mokhov, Nikolai [Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States); Niita, Koji [Research Organization for Information Science and Technology, Ibaraki-ken (Japan)
2013-09-25
A parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. It is intended to be used with older radiation transport codes implemented in Fortran77, Fortran 90 or C. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was developed and tested in conjunction with the MARS15 code. It is possible to use it with other codes such as PHITS, FLUKA and MCNP after certain adjustments. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. The framework corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.
Renxin Xiao
2018-01-01
Full Text Available This paper proposes a comparison study of energy management methods for a parallel plug-in hybrid electric vehicle (PHEV. Based on detailed analysis of the vehicle driveline, quadratic convex functions are presented to describe the nonlinear relationship between engine fuel-rate and battery charging power at different vehicle speed and driveline power demand. The engine-on power threshold is estimated by the simulated annealing (SA algorithm, and the battery power command is achieved by convex optimization with target of improving fuel economy, compared with the dynamic programming (DP based method and the charging depleting–charging sustaining (CD/CS method. In addition, the proposed control methods are discussed at different initial battery state of charge (SOC values to extend the application. Simulation results validate that the proposed strategy based on convex optimization can save the fuel consumption and reduce the computation burden obviously.
Patterns for Parallel Software Design
Ortega-Arjona, Jorge Luis
2010-01-01
Essential reading to understand patterns for parallel programming Software patterns have revolutionized the way we think about how software is designed, built, and documented, and the design of parallel software requires you to consider other particular design aspects and special skills. From clusters to supercomputers, success heavily depends on the design skills of software developers. Patterns for Parallel Software Design presents a pattern-oriented software architecture approach to parallel software design. This approach is not a design method in the classic sense, but a new way of managin
Domain decomposition parallel computing for transient two-phase flow of nuclear reactors
Lee, Jae Ryong; Yoon, Han Young [KAERI, Daejeon (Korea, Republic of); Choi, Hyoung Gwon [Seoul National University, Seoul (Korea, Republic of)
2016-05-15
KAERI (Korea Atomic Energy Research Institute) has been developing a multi-dimensional two-phase flow code named CUPID for multi-physics and multi-scale thermal hydraulics analysis of Light water reactors (LWRs). The CUPID code has been validated against a set of conceptual problems and experimental data. In this work, the CUPID code has been parallelized based on the domain decomposition method with Message passing interface (MPI) library. For domain decomposition, the CUPID code provides both manual and automatic methods with METIS library. For the effective memory management, the Compressed sparse row (CSR) format is adopted, which is one of the methods to represent the sparse asymmetric matrix. CSR format saves only non-zero value and its position (row and column). By performing the verification for the fundamental problem set, the parallelization of the CUPID has been successfully confirmed. Since the scalability of a parallel simulation is generally known to be better for fine mesh system, three different scales of mesh system are considered: 40000 meshes for coarse mesh system, 320000 meshes for mid-size mesh system, and 2560000 meshes for fine mesh system. In the given geometry, both single- and two-phase calculations were conducted. In addition, two types of preconditioners for a matrix solver were compared: Diagonal and incomplete LU preconditioner. In terms of enhancement of the parallel performance, the OpenMP and MPI hybrid parallel computing for a pressure solver was examined. It is revealed that the scalability of hybrid calculation was enhanced for the multi-core parallel computation.
Reconfigurable multi-DSP parallel computing architecture based on DSM%基于DSM的可重构多DSP并行处理架构
程鑫; 吴华春
2012-01-01
提出一种基于DSM的可在线重构多DSP并行处理架构,采用基于自定义内部总线的信息传递服务,在分布式物理内存上实现了统一编址的共享内存模型,减小了DSP之间的数据传递开销；设计基于VME总线的在线重构来实现针对消息传递服务的重定义,增强了并行计算架构的通用性.实验表明,采用此DSM能减小了并行DSP对共享数据同步访问开销,满足多轴精密同步运动控制系统需求.%A design of reconfigurable multi-digital signal processor (DSP) parallel computing architecture based on distributed shared memory (DSM) was proposed. A message-passing communication based on the user-defined internal bus (IB) was designed to implement a shared memory model on physically distributed memory, which decreased the data transmission overhead. Online reconfiguration mechanism was designed to implement message-passing communication reconfiguration, which in-creasd the universality of parallel architecture. The experiment shows that adopting the DSM introduced can reduce simultaneous access overhead to shared data, which satisfies the requirements of ultra-precise multi-axis motion control system.
Spindler, Esther; Bitar, Nisreen; Solo, Julie; Menstell, Elizabeth; Shattuck, Dominick
2017-12-28
Health practitioners, researchers, and donors are stumped about Jordan's stalled fertility rate, which has stagnated between 3.7 and 3.5 children per woman from 2002 to 2012, above the national replacement level of 2.1. This stall paralleled United States Agency for International Development (USAID) funding investments in family planning in Jordan, triggering an assessment of USAID family planning programming in Jordan. This article describes the methods, results, and implications of the programmatic assessment. Methods included an extensive desk review of USAID programs in Jordan and 69 interviews with reproductive health stakeholders. We explored reasons for fertility stagnation in Jordan's total fertility rate (TFR) and assessed the effects of USAID programming on family planning outcomes over the same time period. The assessment results suggest that the increased use of less effective methods, in particular withdrawal and condoms, are contributing to Jordan's TFR stall. Jordan's limited method mix, combined with strong sociocultural determinants around reproduction and fertility desires, have contributed to low contraceptive effectiveness in Jordan. Over the same time period, USAID contributions toward increasing family planning access and use, largely focused on service delivery programs, were extensive. Examples of effective initiatives, among others, include task shifting of IUD insertion services to midwives due to a shortage of female physicians. However, key challenges to improved use of family planning services include limited government investments in family planning programs, influential service provider behaviors and biases that limit informed counseling and choice, pervasive strong social norms of family size and fertility, and limited availability of different contraceptive methods. In contexts where sociocultural norms and a limited method mix are the dominant barriers toward improved family planning use, increased national government investments
High-performance computing — an overview
Marksteiner, Peter
1996-08-01
An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
The parallel adult education system
Wahlgren, Bjarne
2015-01-01
for competence development. The Danish university educational system includes two parallel programs: a traditional academic track (candidatus) and an alternative practice-based track (master). The practice-based program was established in 2001 and organized as part time. The total program takes half the time...
Homemade Buckeye-Pi: A Learning Many-Node Platform for High-Performance Parallel Computing
Amooie, M. A.; Moortgat, J.
2017-12-01
We report on the "Buckeye-Pi" cluster, the supercomputer developed in The Ohio State University School of Earth Sciences from 128 inexpensive Raspberry Pi (RPi) 3 Model B single-board computers. Each RPi is equipped with fast Quad Core 1.2GHz ARMv8 64bit processor, 1GB of RAM, and 32GB microSD card for local storage. Therefore, the cluster has a total RAM of 128GB that is distributed on the individual nodes and a flash capacity of 4TB with 512 processors, while it benefits from low power consumption, easy portability, and low total cost. The cluster uses the Message Passing Interface protocol to manage the communications between each node. These features render our platform the most powerful RPi supercomputer to date and suitable for educational applications in high-performance-computing (HPC) and handling of large datasets. In particular, we use the Buckeye-Pi to implement optimized parallel codes in our in-house simulator for subsurface media flows with the goal of achieving a massively-parallelized scalable code. We present benchmarking results for the computational performance across various number of RPi nodes. We believe our project could inspire scientists and students to consider the proposed unconventional cluster architecture as a mainstream and a feasible learning platform for challenging engineering and scientific problems.
Three-Dimensional Induced Polarization Parallel Inversion Using Nonlinear Conjugate Gradients Method
Huan Ma
2015-01-01
Full Text Available Four kinds of array of induced polarization (IP methods (surface, borehole-surface, surface-borehole, and borehole-borehole are widely used in resource exploration. However, due to the presence of large amounts of the sources, it will take much time to complete the inversion. In the paper, a new parallel algorithm is described which uses message passing interface (MPI and graphics processing unit (GPU to accelerate 3D inversion of these four methods. The forward finite differential equation is solved by ILU0 preconditioner and the conjugate gradient (CG solver. The inverse problem is solved by nonlinear conjugate gradients (NLCG iteration which is used to calculate one forward and two “pseudo-forward” modelings and update the direction, space, and model in turn. Because each source is independent in forward and “pseudo-forward” modelings, multiprocess modes are opened by calling MPI library. The iterative matrix solver within CULA is called in each process. Some tables and synthetic data examples illustrate that this parallel inversion algorithm is effective. Furthermore, we demonstrate that the joint inversion of surface and borehole data produces resistivity and chargeability results are superior to those obtained from inversions of individual surface data.
An Efficient Parallel Multi-Scale Segmentation Method for Remote Sensing Imagery
Haiyan Gu
2018-04-01
Full Text Available Remote sensing (RS image segmentation is an essential step in geographic object-based image analysis (GEOBIA to ultimately derive “meaningful objects”. While many segmentation methods exist, most of them are not efficient for large data sets. Thus, the goal of this research is to develop an efficient parallel multi-scale segmentation method for RS imagery by combining graph theory and the fractal net evolution approach (FNEA. Specifically, a minimum spanning tree (MST algorithm in graph theory is proposed to be combined with a minimum heterogeneity rule (MHR algorithm that is used in FNEA. The MST algorithm is used for the initial segmentation while the MHR algorithm is used for object merging. An efficient implementation of the segmentation strategy is presented using data partition and the “reverse searching-forward processing” chain based on message passing interface (MPI parallel technology. Segmentation results of the proposed method using images from multiple sensors (airborne, SPECIM AISA EAGLE II, WorldView-2, RADARSAT-2 and different selected landscapes (residential/industrial, residential/agriculture covering four test sites indicated its efficiency in accuracy and speed. We conclude that the proposed method is applicable and efficient for the segmentation of a variety of RS imagery (airborne optical, satellite optical, SAR, high-spectral, while the accuracy is comparable with that of the FNEA method.
James G. Worner
2017-05-01
Full Text Available James Worner is an Australian-based writer and scholar currently pursuing a PhD at the University of Technology Sydney. His research seeks to expose masculinities lost in the shadow of Australia’s Anzac hegemony while exploring new opportunities for contemporary historiography. He is the recipient of the Doctoral Scholarship in Historical Consciousness at the university’s Australian Centre of Public History and will be hosted by the University of Bologna during 2017 on a doctoral research writing scholarship. ‘Parallel Lines’ is one of a collection of stories, The Shapes of Us, exploring liminal spaces of modern life: class, gender, sexuality, race, religion and education. It looks at lives, like lines, that do not meet but which travel in proximity, simultaneously attracted and repelled. James’ short stories have been published in various journals and anthologies.
Parallelization of the Physical-Space Statistical Analysis System (PSAS)
Larson, J. W.; Guo, J.; Lyster, P. M.
1999-01-01
Atmospheric data assimilation is a method of combining observations with model forecasts to produce a more accurate description of the atmosphere than the observations or forecast alone can provide. Data assimilation plays an increasingly important role in the study of climate and atmospheric chemistry. The NASA Data Assimilation Office (DAO) has developed the Goddard Earth Observing System Data Assimilation System (GEOS DAS) to create assimilated datasets. The core computational components of the GEOS DAS include the GEOS General Circulation Model (GCM) and the Physical-space Statistical Analysis System (PSAS). The need for timely validation of scientific enhancements to the data assimilation system poses computational demands that are best met by distributed parallel software. PSAS is implemented in Fortran 90 using object-based design principles. The analysis portions of the code solve two equations. The first of these is the "innovation" equation, which is solved on the unstructured observation grid using a preconditioned conjugate gradient (CG) method. The "analysis" equation is a transformation from the observation grid back to a structured grid, and is solved by a direct matrix-vector multiplication. Use of a factored-operator formulation reduces the computational complexity of both the CG solver and the matrix-vector multiplication, rendering the matrix-vector multiplications as a successive product of operators on a vector. Sparsity is introduced to these operators by partitioning the observations using an icosahedral decomposition scheme. PSAS builds a large (approx. 128MB) run-time database of parameters used in the calculation of these operators. Implementing a message passing parallel computing paradigm into an existing yet developing computational system as complex as PSAS is nontrivial. One of the technical challenges is balancing the requirements for computational reproducibility with the need for high performance. The problem of computational
Rizzo, Jon; Bell, Alexandra
2018-05-09
A mental model is the collection of an individual's perceptions, values, and expectations about a particular aspect of their life, which strongly influences behaviors. This study explored orthopedic outpatients mental models of adherence to prescribed home exercise programs and how they related to mental models of adherence to other types of personal regimens. The study followed an interpretive description qualitative design. Data were collected via two semi-structured interviews. Interview One focused on participants prior experiences adhering to personal regimens. Interview Two focused on experiences adhering to their current prescribed home exercise program. Data analysis followed a constant comparative method. Findings revealed similarity in perceptions, values, and expectations that informed individuals mental models of adherence to personal regimens and prescribed home exercise programs. Perceived realized results, expected results, perceived social supports, and value of convenience characterized mental models of adherence. Parallels between mental models of adherence for prescribed home exercise and other personal regimens suggest that patients adherence behavior to prescribed routines may be influenced by adherence experiences in other aspects of their lives. By gaining insight into patients adherence experiences, values, and expectations across life domains, clinicians may tailor supports that enhance home exercise adherence. Implications for Rehabilitation A mental model is the collection of an individual's perceptions, values, and expectations about a particular aspect of their life, which is based on prior experiences and strongly influences behaviors. This study demonstrated similarity in orthopedic outpatients mental models of adherence to prescribed home exercise programs and adherence to personal regimens in other aspects of their lives. Physical therapists should inquire about patients non-medical adherence experiences, as strategies patients
Gladwin, D.; Stewart, P.; Stewart, J.
2011-02-01
This article addresses the problem of maintaining a stable rectified DC output from the three-phase AC generator in a series-hybrid vehicle powertrain. The series-hybrid prime power source generally comprises an internal combustion (IC) engine driving a three-phase permanent magnet generator whose output is rectified to DC. A recent development has been to control the engine/generator combination by an electronically actuated throttle. This system can be represented as a nonlinear system with significant time delay. Previously, voltage control of the generator output has been achieved by model predictive methods such as the Smith Predictor. These methods rely on the incorporation of an accurate system model and time delay into the control algorithm, with a consequent increase in computational complexity in the real-time controller, and as a necessity relies to some extent on the accuracy of the models. Two complementary performance objectives exist for the control system. Firstly, to maintain the IC engine at its optimal operating point, and secondly, to supply a stable DC supply to the traction drive inverters. Achievement of these goals minimises the transient energy storage requirements at the DC link, with a consequent reduction in both weight and cost. These objectives imply constant velocity operation of the IC engine under external load disturbances and changes in both operating conditions and vehicle speed set-points. In order to achieve these objectives, and reduce the complexity of implementation, in this article a controller is designed by the use of Genetic Programming methods in the Simulink modelling environment, with the aim of obtaining a relatively simple controller for the time-delay system which does not rely on the implementation of real time system models or time delay approximations in the controller. A methodology is presented to utilise the miriad of existing control blocks in the Simulink libraries to automatically evolve optimal control
MPIRUN: A Portable Loader for Multidisciplinary and Multi-Zonal Applications
Fineberg, Samuel A.; Woodrow, Thomas S. (Technical Monitor)
1994-01-01
Multidisciplinary and multi-zonal applications are an important class of applications in the area of Computational Aerosciences. In these codes, two or more distinct parallel programs or copies of a single program are utilized to model a single problem. To support such applications, it is common to use a programming model where a program is divided into several single program multiple data stream (SPMD) applications, each of which solves the equations for a single physical discipline or grid zone. These SPMD applications are then bound together to form a single multidisciplinary or multi-zonal program in which the constituent parts communicate via point-to-point message passing routines. One method for implementing the message passing portion of these codes is with the new Message Passing Interface (MPI) standard. Unfortunately, this standard only specifies the message passing portion of an application, but does not specify any portable mechanisms for loading an application. MPIRUN was developed to provide a portable means for loading MPI programs, and was specifically targeted at multidisciplinary and multi-zonal applications. Programs using MPIRUN for loading and MPI for message passing are then portable between all machines supported by MPIRUN. MPIRUN is currently implemented for the Intel iPSC/860, TMC CM5, IBM SP-1 and SP-2, Intel Paragon, and workstation clusters. Further, MPIRUN is designed to be simple enough to port easily to any system supporting MPI.
Compiler Technology for Parallel Scientific Computation
Can Özturan
1994-01-01
Full Text Available There is a need for compiler technology that, given the source program, will generate efficient parallel codes for different architectures with minimal user involvement. Parallel computation is becoming indispensable in solving large-scale problems in science and engineering. Yet, the use of parallel computation is limited by the high costs of developing the needed software. To overcome this difficulty we advocate a comprehensive approach to the development of scalable architecture-independent software for scientific computation based on our experience with equational programming language (EPL. Our approach is based on a program decomposition, parallel code synthesis, and run-time support for parallel scientific computation. The program decomposition is guided by the source program annotations provided by the user. The synthesis of parallel code is based on configurations that describe the overall computation as a set of interacting components. Run-time support is provided by the compiler-generated code that redistributes computation and data during object program execution. The generated parallel code is optimized using techniques of data alignment, operator placement, wavefront determination, and memory optimization. In this article we discuss annotations, configurations, parallel code generation, and run-time support suitable for parallel programs written in the functional parallel programming language EPL and in Fortran.
Daniel P Denning
Full Text Available Caspases are cysteine proteases that can drive apoptosis in metazoans and have critical functions in the elimination of cells during development, the maintenance of tissue homeostasis, and responses to cellular damage. Although a growing body of research suggests that programmed cell death can occur in the absence of caspases, mammalian studies of caspase-independent apoptosis are confounded by the existence of at least seven caspase homologs that can function redundantly to promote cell death. Caspase-independent programmed cell death is also thought to occur in the invertebrate nematode Caenorhabditis elegans. The C. elegans genome contains four caspase genes (ced-3, csp-1, csp-2, and csp-3, of which only ced-3 has been demonstrated to promote apoptosis. Here, we show that CSP-1 is a pro-apoptotic caspase that promotes programmed cell death in a subset of cells fated to die during C. elegans embryogenesis. csp-1 is expressed robustly in late pachytene nuclei of the germline and is required maternally for its role in embryonic programmed cell deaths. Unlike CED-3, CSP-1 is not regulated by the APAF-1 homolog CED-4 or the BCL-2 homolog CED-9, revealing that csp-1 functions independently of the canonical genetic pathway for apoptosis. Previously we demonstrated that embryos lacking all four caspases can eliminate cells through an extrusion mechanism and that these cells are apoptotic. Extruded cells differ from cells that normally undergo programmed cell death not only by being extruded but also by not being engulfed by neighboring cells. In this study, we identify in csp-3; csp-1; csp-2 ced-3 quadruple mutants apoptotic cell corpses that fully resemble wild-type cell corpses: these caspase-deficient cell corpses are morphologically apoptotic, are not extruded, and are internalized by engulfing cells. We conclude that both caspase-dependent and caspase-independent pathways promote apoptotic programmed cell death and the phagocytosis of cell
Expressing Parallelism with ROOT
Piparo, D. [CERN; Tejedor, E. [CERN; Guiraud, E. [CERN; Ganis, G. [CERN; Mato, P. [CERN; Moneta, L. [CERN; Valls Pla, X. [CERN; Canal, P. [Fermilab
2017-11-22
The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.
Expressing Parallelism with ROOT
Piparo, D.; Tejedor, E.; Guiraud, E.; Ganis, G.; Mato, P.; Moneta, L.; Valls Pla, X.; Canal, P.
2017-10-01
The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.
Chen, C.M.; Lee, S.Y.
1995-01-01
The EM algorithm promises an estimated image with the maximal likelihood for 3D PET image reconstruction. However, due to its long computation time, the EM algorithm has not been widely used in practice. While several parallel implementations of the EM algorithm have been developed to make the EM algorithm feasible, they do not guarantee an optimal parallelization efficiency. In this paper, the authors propose a new parallel EM algorithm which maximizes the performance by optimizing data replication on a mesh-connected message-passing multiprocessor. To optimize data replication, the authors have formally derived the optimal allocation of shared data, group sizes, integration and broadcasting of replicated data as well as the scheduling of shared data accesses. The proposed parallel EM algorithm has been implemented on an iPSC/860 with 16 PEs. The experimental and theoretical results, which are consistent with each other, have shown that the proposed parallel EM algorithm could improve performance substantially over those using unoptimized data replication
An Introduction to Parallel Computation R
How are they programmed? This article provides an introduction. A parallel computer is a network of processors built for ... and have been used to solve problems much faster than a single ... in parallel computer design is to select an organization which ..... The most ambitious approach to parallel computing is to develop.
Exploiting Symmetry on Parallel Architectures.
Stiller, Lewis Benjamin
1995-01-01
This thesis describes techniques for the design of parallel programs that solve well-structured problems with inherent symmetry. Part I demonstrates the reduction of such problems to generalized matrix multiplication by a group-equivariant matrix. Fast techniques for this multiplication are described, including factorization, orbit decomposition, and Fourier transforms over finite groups. Our algorithms entail interaction between two symmetry groups: one arising at the software level from the problem's symmetry and the other arising at the hardware level from the processors' communication network. Part II illustrates the applicability of our symmetry -exploitation techniques by presenting a series of case studies of the design and implementation of parallel programs. First, a parallel program that solves chess endgames by factorization of an associated dihedral group-equivariant matrix is described. This code runs faster than previous serial programs, and discovered it a number of results. Second, parallel algorithms for Fourier transforms for finite groups are developed, and preliminary parallel implementations for group transforms of dihedral and of symmetric groups are described. Applications in learning, vision, pattern recognition, and statistics are proposed. Third, parallel implementations solving several computational science problems are described, including the direct n-body problem, convolutions arising from molecular biology, and some communication primitives such as broadcast and reduce. Some of our implementations ran orders of magnitude faster than previous techniques, and were used in the investigation of various physical phenomena.
S. CHU; S. ELLIOTT
2000-08-01
Ecodynamics and the sea-air transfer of climate relevant trace gases are intimately coupled in the oceanic mixed layer. Ventilation of species such as dimethyl sulfide and methyl bromide constitutes a key linkage within the earth system. We are creating a research tool for the study of marine trace gas distributions by implementing coupled ecology-gas chemistry in the high resolution Parallel Ocean Program (POP). The fundamental circulation model is eddy resolving, with cell sizes averaging 0.15 degree (lat/long). Here we describe ecochemistry integration. Density dependent mortality and iron geochemistry have enhanced agreement with chlorophyll measurements. Indications are that dimethyl sulfide production rates must be adjusted for latitude dependence to match recent compilations. This may reflect the need for phytoplankton to conserve nitrogen by favoring sulfurous osmolytes. Global simulations are also available for carbonyl sulfide, the methyl halides and for nonmethane hydrocarbons. We discuss future applications including interaction with atmospheric chemistry models, high resolution biogeochemical snapshots and the study of open ocean fertilization.
A PARALLEL MONTE CARLO CODE FOR SIMULATING COLLISIONAL N-BODY SYSTEMS
Pattabiraman, Bharath; Umbreit, Stefan; Liao, Wei-keng; Choudhary, Alok; Kalogera, Vassiliki; Memik, Gokhan; Rasio, Frederic A., E-mail: bharath@u.northwestern.edu [Center for Interdisciplinary Exploration and Research in Astrophysics, Northwestern University, Evanston, IL (United States)
2013-02-15
We present a new parallel code for computing the dynamical evolution of collisional N-body systems with up to N {approx} 10{sup 7} particles. Our code is based on the Henon Monte Carlo method for solving the Fokker-Planck equation, and makes assumptions of spherical symmetry and dynamical equilibrium. The principal algorithmic developments involve optimizing data structures and the introduction of a parallel random number generation scheme as well as a parallel sorting algorithm required to find nearest neighbors for interactions and to compute the gravitational potential. The new algorithms we introduce along with our choice of decomposition scheme minimize communication costs and ensure optimal distribution of data and workload among the processing units. Our implementation uses the Message Passing Interface library for communication, which makes it portable to many different supercomputing architectures. We validate the code by calculating the evolution of clusters with initial Plummer distribution functions up to core collapse with the number of stars, N, spanning three orders of magnitude from 10{sup 5} to 10{sup 7}. We find that our results are in good agreement with self-similar core-collapse solutions, and the core-collapse times generally agree with expectations from the literature. Also, we observe good total energy conservation, within {approx}< 0.04% throughout all simulations. We analyze the performance of the code, and demonstrate near-linear scaling of the runtime with the number of processors up to 64 processors for N = 10{sup 5}, 128 for N = 10{sup 6} and 256 for N = 10{sup 7}. The runtime reaches saturation with the addition of processors beyond these limits, which is a characteristic of the parallel sorting algorithm. The resulting maximum speedups we achieve are approximately 60 Multiplication-Sign , 100 Multiplication-Sign , and 220 Multiplication-Sign , respectively.
A PARALLEL MONTE CARLO CODE FOR SIMULATING COLLISIONAL N-BODY SYSTEMS
Pattabiraman, Bharath; Umbreit, Stefan; Liao, Wei-keng; Choudhary, Alok; Kalogera, Vassiliki; Memik, Gokhan; Rasio, Frederic A.
2013-01-01
We present a new parallel code for computing the dynamical evolution of collisional N-body systems with up to N ∼ 10 7 particles. Our code is based on the Hénon Monte Carlo method for solving the Fokker-Planck equation, and makes assumptions of spherical symmetry and dynamical equilibrium. The principal algorithmic developments involve optimizing data structures and the introduction of a parallel random number generation scheme as well as a parallel sorting algorithm required to find nearest neighbors for interactions and to compute the gravitational potential. The new algorithms we introduce along with our choice of decomposition scheme minimize communication costs and ensure optimal distribution of data and workload among the processing units. Our implementation uses the Message Passing Interface library for communication, which makes it portable to many different supercomputing architectures. We validate the code by calculating the evolution of clusters with initial Plummer distribution functions up to core collapse with the number of stars, N, spanning three orders of magnitude from 10 5 to 10 7 . We find that our results are in good agreement with self-similar core-collapse solutions, and the core-collapse times generally agree with expectations from the literature. Also, we observe good total energy conservation, within ∼ 5 , 128 for N = 10 6 and 256 for N = 10 7 . The runtime reaches saturation with the addition of processors beyond these limits, which is a characteristic of the parallel sorting algorithm. The resulting maximum speedups we achieve are approximately 60×, 100×, and 220×, respectively.
Scalable High Performance Message Passing over InfiniBand for Open MPI
Friedley, A; Hoefler, T; Leininger, M L; Lumsdaine, A
2007-10-24
InfiniBand (IB) is a popular network technology for modern high-performance computing systems. MPI implementations traditionally support IB using a reliable, connection-oriented (RC) transport. However, per-process resource usage that grows linearly with the number of processes, makes this approach prohibitive for large-scale systems. IB provides an alternative in the form of a connectionless unreliable datagram transport (UD), which allows for near-constant resource usage and initialization overhead as the process count increases. This paper describes a UD-based implementation for IB in Open MPI as a scalable alternative to existing RC-based schemes. We use the software reliability capabilities of Open MPI to provide the guaranteed delivery semantics required by MPI. Results show that UD not only requires fewer resources at scale, but also allows for shorter MPI startup times. A connectionless model also improves performance for applications that tend to send small messages to many different processes.
Message-Passing Receiver for OFDM Systems over Highly Delay-Dispersive Channels
Barbu, Oana-Elena; Manchón, Carles Navarro; Rom, Christian
2017-01-01
Propagation channels with maximum excess delay exceeding the duration of the cyclic prefix (CP) in OFDM systems cause intercarrier and intersymbol interference which, unless accounted for, degrade the receiver performance. Using tools from Bayesian inference and sparse signal reconstruction, we...... derive an iterative algorithm that estimates an approximate representation of the channel impulse response and the noise variance, estimates and cancels the intrinsic interference and decodes the data over a block of symbols. Simulation results show that the receiver employing our algorithm outperforms...
A model based message passing approach for flexible and scalable home automation controllers
Bienhaus, D. [INNIAS GmbH und Co. KG, Frankenberg (Germany); David, K.; Klein, N.; Kroll, D. [ComTec Kassel Univ., SE Kassel Univ. (Germany); Heerdegen, F.; Jubeh, R.; Zuendorf, A. [Kassel Univ. (Germany). FG Software Engineering; Hofmann, J. [BSC Computer GmbH, Allendorf (Germany)
2012-07-01
There is a large variety of home automation systems that are largely proprietary systems from different vendors. In addition, the configuration and administration of home automation systems is frequently a very complex task especially, if more complex functionality shall be achieved. Therefore, an open model for home automation was developed that is especially designed for easy integration of various home automation systems. This solution also provides a simple modeling approach that is inspired by typical home automation components like switches, timers, etc. In addition, a model based technology to achieve rich functionality and usability was implemented. (orig.)
Xyce Parallel Electronic Simulator - User's Guide, Version 1.0
HUTCHINSON, SCOTT A; KEITER, ERIC R.; HOEKSTRA, ROBERT J.; WATERS, LON J.; RUSSO, THOMAS V.; RANKIN, ERIC LAMONT; WIX, STEVEN D.
2002-11-01
This manual describes the use of the Xyce Parallel Electronic Simulator code for simulating electrical circuits at a variety of abstraction levels. The Xyce Parallel Electronic Simulator has been written to support,in a rigorous manner, the simulation needs of the Sandia National Laboratories electrical designers. As such, the development has focused on improving the capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers. (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. (3) A client-server or multi-tiered operating model wherein the numerical kernel can operate independently of the graphical user interface (GUI). (4) Object-oriented code design and implementation using modern coding-practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. The code is a parallel code in the most general sense of the phrase--a message passing parallel implementation--which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Furthermore, careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved even as the number of processors grows. Another feature required by designers is the ability to add device models, many specific to the needs of Sandia, to the code. To this end, the device package in the Xyce Parallel Electronic Simulator is designed to support a variety of device model inputs. These input formats include standard analytical models, behavioral models
MINUIT package parallelization and applications using the RooFit package
Lazzaro, Alfio; Moneta, Lorenzo
2010-01-01
The fitting procedures are based on numerical minimization of functions. The MINUIT package is the most common package used for such procedures in High Energy Physics community. The main algorithm in this package, MIGRAD, searches the minimum of a function using the gradient information. For each minimization iteration, MIGRAD requires the calculation of the derivative for each free parameter of the function to be minimized. Minimization is required for data analysis problems based on the maximum likelihood technique. The calculation of complex likelihood functions, with several free parameters, many independent variables and large data samples, can be very CPU-time consuming. Then, the minimization process requires the calculation of the likelihood functions several times for each minimization iteration. In this paper we will show how MIGRAD algorithm and the likelihood function calculation can be easily parallelized using Message Passing Interface techniques. We will present the speed-up improvements obtained in typical physics applications such as complex maximum likelihood fits using the RooFit package.
Lartillot, Nicolas; Rodrigue, Nicolas; Stubbs, Daniel; Richer, Jacques
2013-07-01
Modeling across site variation of the substitution process is increasingly recognized as important for obtaining more accurate phylogenetic reconstructions. Both finite and infinite mixture models have been proposed and have been shown to significantly improve on classical single-matrix models. Compared with their finite counterparts, infinite mixtures have a greater expressivity. However, they are computationally more challenging. This has resulted in practical compromises in the design of infinite mixture models. In particular, a fast but simplified version of a Dirichlet process model over equilibrium frequency profiles implemented in PhyloBayes has often been used in recent phylogenomics studies, while more refined model structures, more realistic and empirically more fit, have been practically out of reach. We introduce a message passing interface version of PhyloBayes, implementing the Dirichlet process mixture models as well as more classical empirical matrices and finite mixtures. The parallelization is made efficient thanks to the combination of two algorithmic strategies: a partial Gibbs sampling update of the tree topology and the use of a truncated stick-breaking representation for the Dirichlet process prior. The implementation shows close to linear gains in computational speed for up to 64 cores, thus allowing faster phylogenetic reconstruction under complex mixture models. PhyloBayes MPI is freely available from our website www.phylobayes.org.
Xyce Parallel Electronic Simulator : users' guide, version 2.0.
Hoekstra, Robert John; Waters, Lon J.; Rankin, Eric Lamont; Fixel, Deborah A.; Russo, Thomas V.; Keiter, Eric Richard; Hutchinson, Scott Alan; Pawlowski, Roger Patrick; Wix, Steven D.
2004-06-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator capable of simulating electrical circuits at a variety of abstraction levels. Primarily, Xyce has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability the current state-of-the-art in the following areas: {sm_bullet} Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers. {sm_bullet} Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. {sm_bullet} Device models which are specifically tailored to meet Sandia's needs, including many radiation-aware devices. {sm_bullet} A client-server or multi-tiered operating model wherein the numerical kernel can operate independently of the graphical user interface (GUI). {sm_bullet} Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing of computing platforms. These include serial, shared-memory and distributed-memory parallel implementation - which allows it to run efficiently on the widest possible number parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. One feature required by designers is the ability to add device models, many specific to the needs of Sandia, to the code. To this end, the device package in the Xyce
Comparison of 250 MHz R10K Origin 2000 and 400 MHz Origin 2000 Using NAS Parallel Benchmarks
Turney, Raymond D.; Thigpen, William W. (Technical Monitor)
2001-01-01
This report describes results of benchmark tests on Steger, a 250 MHz Origin 2000 system with R10K processors, currently installed at the NASA Ames National Advanced Supercomputing (NAS) facility. For comparison purposes, the tests were also run on Lomax, a 400 MHz Origin 2000 with R12K processors. The BT, LU, and SP application benchmarks in the NAS Parallel Benchmark Suite and the kernel benchmark FT were chosen to measure system performance. Having been written to measure performance on Computational Fluid Dynamics applications, these benchmarks are assumed appropriate to represent the NAS workload. Since the NAS runs both message passing (MPI) and shared-memory, compiler directive type codes, both MPI and OpenMP versions of the benchmarks were used. The MPI versions used were the latest official release of the NAS Parallel Benchmarks, version 2.3. The OpenMP versions used were PBN3b2, a beta version that is in the process of being released. NPB 2.3 and PBN3b2 are technically different benchmarks, and NPB results are not directly comparable to PBN results.
Barth, Timothy J.; Chan, Tony F.; Tang, Wei-Pai
1998-01-01
This paper considers an algebraic preconditioning algorithm for hyperbolic-elliptic fluid flow problems. The algorithm is based on a parallel non-overlapping Schur complement domain-decomposition technique for triangulated domains. In the Schur complement technique, the triangulation is first partitioned into a number of non-overlapping subdomains and interfaces. This suggests a reordering of triangulation vertices which separates subdomain and interface solution unknowns. The reordering induces a natural 2 x 2 block partitioning of the discretization matrix. Exact LU factorization of this block system yields a Schur complement matrix which couples subdomains and the interface together. The remaining sections of this paper present a family of approximate techniques for both constructing and applying the Schur complement as a domain-decomposition preconditioner. The approximate Schur complement serves as an algebraic coarse space operator, thus avoiding the known difficulties associated with the direct formation of a coarse space discretization. In developing Schur complement approximations, particular attention has been given to improving sequential and parallel efficiency of implementations without significantly degrading the quality of the preconditioner. A computer code based on these developments has been tested on the IBM SP2 using MPI message passing protocol. A number of 2-D calculations are presented for both scalar advection-diffusion equations as well as the Euler equations governing compressible fluid flow to demonstrate performance of the preconditioning algorithm.
Parallel algorithms for nuclear reactor analysis via domain decomposition method
Kim, Yong Hee
1995-02-01
In this thesis, the neutron diffusion equation in reactor physics is discretized by the finite difference method and is solved on a parallel computer network which is composed of T-800 transputers. T-800 transputer is a message-passing type MIMD (multiple instruction streams and multiple data streams) architecture. A parallel variant of Schwarz alternating procedure for overlapping subdomains is developed with domain decomposition. The thesis provides convergence analysis and improvement of the convergence of the algorithm. The convergence of the parallel Schwarz algorithms with DN(or ND), DD, NN, and mixed pseudo-boundary conditions(a weighted combination of Dirichlet and Neumann conditions) is analyzed for both continuous and discrete models in two-subdomain case and various underlying features are explored. The analysis shows that the convergence rate of the algorithm highly depends on the pseudo-boundary conditions and the theoretically best one is the mixed boundary conditions(MM conditions). Also it is shown that there may exist a significant discrepancy between continuous model analysis and discrete model analysis. In order to accelerate the convergence of the parallel Schwarz algorithm, relaxation in pseudo-boundary conditions is introduced and the convergence analysis of the algorithm for two-subdomain case is carried out. The analysis shows that under-relaxation of the pseudo-boundary conditions accelerates the convergence of the parallel Schwarz algorithm if the convergence rate without relaxation is negative, and any relaxation(under or over) decelerates convergence if the convergence rate without relaxation is positive. Numerical implementation of the parallel Schwarz algorithm on an MIMD system requires multi-level iterations: two levels for fixed source problems, three levels for eigenvalue problems. Performance of the algorithm turns out to be very sensitive to the iteration strategy. In general, multi-level iterations provide good performance when
Is Monte Carlo embarrassingly parallel?
Hoogenboom, J. E. [Delft Univ. of Technology, Mekelweg 15, 2629 JB Delft (Netherlands); Delft Nuclear Consultancy, IJsselzoom 2, 2902 LB Capelle aan den IJssel (Netherlands)
2012-07-01
Monte Carlo is often stated as being embarrassingly parallel. However, running a Monte Carlo calculation, especially a reactor criticality calculation, in parallel using tens of processors shows a serious limitation in speedup and the execution time may even increase beyond a certain number of processors. In this paper the main causes of the loss of efficiency when using many processors are analyzed using a simple Monte Carlo program for criticality. The basic mechanism for parallel execution is MPI. One of the bottlenecks turn out to be the rendez-vous points in the parallel calculation used for synchronization and exchange of data between processors. This happens at least at the end of each cycle for fission source generation in order to collect the full fission source distribution for the next cycle and to estimate the effective multiplication factor, which is not only part of the requested results, but also input to the next cycle for population control. Basic improvements to overcome this limitation are suggested and tested. Also other time losses in the parallel calculation are identified. Moreover, the threading mechanism, which allows the parallel execution of tasks based on shared memory using OpenMP, is analyzed in detail. Recommendations are given to get the maximum efficiency out of a parallel Monte Carlo calculation. (authors)
Is Monte Carlo embarrassingly parallel?
Hoogenboom, J. E.
2012-01-01
Monte Carlo is often stated as being embarrassingly parallel. However, running a Monte Carlo calculation, especially a reactor criticality calculation, in parallel using tens of processors shows a serious limitation in speedup and the execution time may even increase beyond a certain number of processors. In this paper the main causes of the loss of efficiency when using many processors are analyzed using a simple Monte Carlo program for criticality. The basic mechanism for parallel execution is MPI. One of the bottlenecks turn out to be the rendez-vous points in the parallel calculation used for synchronization and exchange of data between processors. This happens at least at the end of each cycle for fission source generation in order to collect the full fission source distribution for the next cycle and to estimate the effective multiplication factor, which is not only part of the requested results, but also input to the next cycle for population control. Basic improvements to overcome this limitation are suggested and tested. Also other time losses in the parallel calculation are identified. Moreover, the threading mechanism, which allows the parallel execution of tasks based on shared memory using OpenMP, is analyzed in detail. Recommendations are given to get the maximum efficiency out of a parallel Monte Carlo calculation. (authors)
Towards high-performance symbolic computing using MuPAD as a problem solving environment
Sorgatz, A
1999-01-01
This article discusses the approach of developing MuPAD into an open and parallel problem solving environment for mathematical applications. It introduces the key technologies domains and dynamic modules and describes the current $9 state of macro parallelism which covers three fields of parallel programming: message passing, network variables and work groups. First parallel algorithms and examples of using the prototype of the MuPAD problem solving environment $9 are demonstrated. (12 refs).
On a model of three-dimensional bursting and its parallel implementation
Tabik, S.; Romero, L. F.; Garzón, E. M.; Ramos, J. I.
2008-04-01
A mathematical model for the simulation of three-dimensional bursting phenomena and its parallel implementation are presented. The model consists of four nonlinearly coupled partial differential equations that include fast and slow variables, and exhibits bursting in the absence of diffusion. The differential equations have been discretized by means of a second-order accurate in both space and time, linearly-implicit finite difference method in equally-spaced grids. The resulting system of linear algebraic equations at each time level has been solved by means of the Preconditioned Conjugate Gradient (PCG) method. Three different parallel implementations of the proposed mathematical model have been developed; two of these implementations, i.e., the MPI and the PETSc codes, are based on a message passing paradigm, while the third one, i.e., the OpenMP code, is based on a shared space address paradigm. These three implementations are evaluated on two current high performance parallel architectures, i.e., a dual-processor cluster and a Shared Distributed Memory (SDM) system. A novel representation of the results that emphasizes the most relevant factors that affect the performance of the paralled implementations, is proposed. The comparative analysis of the computational results shows that the MPI and the OpenMP implementations are about twice more efficient than the PETSc code on the SDM system. It is also shown that, for the conditions reported here, the nonlinear dynamics of the three-dimensional bursting phenomena exhibits three stages characterized by asynchronous, synchronous and then asynchronous oscillations, before a quiescent state is reached. It is also shown that the fast system reaches steady state in much less time than the slow variables.
Implementation and performance of parallelized elegant
Wang, Y.; Borland, M.
2008-01-01
The program elegant is widely used for design and modeling of linacs for free-electron lasers and energy recovery linacs, as well as storage rings and other applications. As part of a multi-year effort, we have parallelized many aspects of the code, including single-particle dynamics, wakefields, and coherent synchrotron radiation. We report on the approach used for gradual parallelization, which proved very beneficial in getting parallel features into the hands of users quickly. We also report details of parallelization of collective effects. Finally, we discuss performance of the parallelized code in various applications.
Implementations of BLAST for parallel computers.
Jülich, A
1995-02-01
The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.
Spot: A Programming Language for Verified Flight Software
Bocchino, Robert L., Jr.; Gamble, Edward; Gostelow, Kim P.; Some, Raphael R.
2014-01-01
The C programming language is widely used for programming space flight software and other safety-critical real time systems. C, however, is far from ideal for this purpose: as is well known, it is both low-level and unsafe. This paper describes Spot, a language derived from C for programming space flight systems. Spot aims to maintain compatibility with existing C code while improving the language and supporting verification with the SPIN model checker. The major features of Spot include actor-based concurrency, distributed state with message passing and transactional updates, and annotations for testing and verification. Spot also supports domain-specific annotations for managing spacecraft state, e.g., communicating telemetry information to the ground. We describe the motivation and design rationale for Spot, give an overview of the design, provide examples of Spot's capabilities, and discuss the current status of the implementation.
Zapata, M. A. Uh; Van Bang, D. Pham; Nguyen, K. D.
2016-05-01
This paper presents a parallel algorithm for the finite-volume discretisation of the Poisson equation on three-dimensional arbitrary geometries. The proposed method is formulated by using a 2D horizontal block domain decomposition and interprocessor data communication techniques with message passing interface. The horizontal unstructured-grid cells are reordered according to the neighbouring relations and decomposed into blocks using a load-balanced distribution to give all processors an equal amount of elements. In this algorithm, two parallel successive over-relaxation methods are presented: a multi-colour ordering technique for unstructured grids based on distributed memory and a block method using reordering index following similar ideas of the partitioning for structured grids. In all cases, the parallel algorithms are implemented with a combination of an acceleration iterative solver. This solver is based on a parabolic-diffusion equation introduced to obtain faster solutions of the linear systems arising from the discretisation. Numerical results are given to evaluate the performances of the methods showing speedups better than linear.
Shared Variable Oriented Parallel Precompiler for SPMD Model
无
1995-01-01
For the moment,commercial parallel computer systems with distributed memory architecture are usually provided with parallel FORTRAN or parallel C compliers,which are just traditional sequential FORTRAN or C compilers expanded with communication statements.Programmers suffer from writing parallel programs with communication statements. The Shared Variable Oriented Parallel Precompiler (SVOPP) proposed in this paper can automatically generate appropriate communication statements based on shared variables for SPMD(Single Program Multiple Data) computation model and greatly ease the parallel programming with high communication efficiency.The core function of parallel C precompiler has been successfully verified on a transputer-based parallel computer.Its prominent performance shows that SVOPP is probably a break-through in parallel programming technique.
Parallel algorithms for continuum dynamics
Hicks, D.L.; Liebrock, L.M.
1987-01-01
Simply porting existing parallel programs to a new parallel processor may not achieve the full speedup possible; to achieve the maximum efficiency may require redesigning the parallel algorithms for the specific architecture. The authors discuss here parallel algorithms that were developed first for the HEP processor and then ported to the CRAY X-MP/4, the ELXSI/10, and the Intel iPSC/32. Focus is mainly on the most recent parallel processing results produced, i.e., those on the Intel Hypercube. The applications are simulations of continuum dynamics in which the momentum and stress gradients are important. Examples of these are inertial confinement fusion experiments, severe breaks in the coolant system of a reactor, weapons physics, shock-wave physics. Speedup efficiencies on the Intel iPSC Hypercube are very sensitive to the ratio of communication to computation. Great care must be taken in designing algorithms for this machine to avoid global communication. This is much more critical on the iPSC than it was on the three previous parallel processors
Xyce Parallel Electronic Simulator Users Guide Version 6.2.
Keiter, Eric R.; Mei, Ting; Russo, Thomas V.; Schiek, Richard Louis; Sholander, Peter E.; Thornquist, Heidi K.; Verley, Jason C.; Baur, David Gregory
2014-09-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. Trademarks The information herein is subject to change without notice. Copyright c 2002-2014 Sandia Corporation. All rights reserved. Xyce TM Electronic Simulator and Xyce TM are trademarks of Sandia Corporation. Portions of the Xyce TM code are: Copyright c 2002, The Regents of the University of California. Produced at the Lawrence Livermore National Laboratory. Written by Alan Hindmarsh, Allan Taylor, Radu Serban. UCRL-CODE-2002-59 All rights reserved. Orcad, Orcad Capture, PSpice and Probe are
Xyce Parallel Electronic Simulator Users Guide Version 6.4
Keiter, Eric R. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Mei, Ting [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Russo, Thomas V. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Schiek, Richard [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sholander, Peter E. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Thornquist, Heidi K. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Verley, Jason [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Baur, David Gregory [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2015-12-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. Trademarks The information herein is subject to change without notice. Copyright c 2002-2015 Sandia Corporation. All rights reserved. Xyce TM Electronic Simulator and Xyce TM are trademarks of Sandia Corporation. Portions of the Xyce TM code are: Copyright c 2002, The Regents of the University of California. Produced at the Lawrence Livermore National Laboratory. Written by Alan Hindmarsh, Allan Taylor, Radu Serban. UCRL-CODE-2002-59 All rights reserved. Orcad, Orcad Capture, PSpice and Probe are
Xyce Parallel Electronic Simulator Users' Guide Version 6.7.
Keiter, Eric R. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Aadithya, Karthik Venkatraman [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Mei, Ting [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Russo, Thomas V. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Schiek, Richard [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sholander, Peter E. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Thornquist, Heidi K. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Verley, Jason [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2017-05-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright c 2002-2017 Sandia Corporation. All rights reserved. Trademarks Xyce TM Electronic Simulator and Xyce TM are trademarks of Sandia Corporation. Orcad, Orcad Capture, PSpice and Probe are registered trademarks of Cadence Design Systems, Inc. Microsoft, Windows and Windows 7 are registered trademarks of Microsoft Corporation. Medici, DaVinci and Taurus are registered trademarks of Synopsys Corporation. Amtec and TecPlot are trademarks of
Soltz, R; Vranas, P; Blumrich, M; Chen, D; Gara, A; Giampap, M; Heidelberger, P; Salapura, V; Sexton, J; Bhanot, G
2007-01-01
The theory of the strong nuclear force, Quantum Chromodynamics (QCD), can be numerically simulated from first principles on massively-parallel supercomputers using the method of Lattice Gauge Theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures that it suggests. We demonstrate these methods on the BlueGene massively-parallel supercomputer and argue that LQCD and the BlueGene architecture are a natural match. This can be traced to the simple fact that LQCD is a regular lattice discretization of space into lattice sites while the BlueGene supercomputer is a discretization of space into compute nodes, and that both are constrained by requirements of locality. This simple relation is both technologically important and theoretically intriguing. The main result of this paper is the speedup of LQCD using up to 131,072 CPUs on the largest BlueGene/L supercomputer. The speedup is perfect with sustained performance of about 20% of peak. This corresponds to a maximum of 70.5 sustained TFlop/s. At these speeds LQCD and BlueGene are poised to produce the next generation of strong interaction physics theoretical results
Streaming for Functional Data-Parallel Languages
Madsen, Frederik Meisner
In this thesis, we investigate streaming as a general solution to the space inefficiency commonly found in functional data-parallel programming languages. The data-parallel paradigm maps well to parallel SIMD-style hardware. However, the traditional fully materializing execution strategy...... by extending two existing data-parallel languages: NESL and Accelerate. In the extensions we map bulk operations to data-parallel streams that can evaluate fully sequential, fully parallel or anything in between. By a dataflow, piecewise parallel execution strategy, the runtime system can adjust to any target...... flattening necessitates all sub-computations to materialize at the same time. For example, naive n by n matrix multiplication requires n^3 space in NESL because the algorithm contains n^3 independent scalar multiplications. For large values of n, this is completely unacceptable. We address the problem...
Boekkooi-Timminga, Ellen
Nine methods for automated test construction are described. All are based on the concepts of information from item response theory. Two general kinds of methods for the construction of parallel tests are presented: (1) sequential test design; and (2) simultaneous test design. Sequential design implies that the tests are constructed one after the…
吕忠亭; 张玉强; 崔巍
2013-01-01
探讨了基于OpenMP的电磁场FDTD多核并行程序设计的方法，以期实现该方法在更复杂的算法中应用具有更理想的性能提升。针对一个一维电磁场FDTD算法问题，对其计算方法与过程做了简单描述。在Fortran语言环境中，采用OpenMP+细粒度并行的方式实现了并行化，即只对循环部分进行并行计算，并将该并行方法在一个三维瞬态场电偶极子辐射FDTD程序中进行了验证。该并行算法取得了较其他并行FDTD算法更快的加速比和更高的效率。结果表明基于OpenMP的电磁场FDTD并行算法具有非常好的加速比和效率。%The method of the electromagnetic field FDTD multi-core parallel programm design based on OpenMP is dis-cussed,in order to implement ideal performance improvement of this method in the application of more sophisticated algorithms. Aiming at a problem existing in one-dimensional electromagnetic FDTD algorithm , its calculation method and process are described briefly. In Fortran language environment,the parallelism is achieved with OpenMP technology and fine-grained parallel way,that is,the parallel computation is performed only for the cycle part. The parallel method was verified in a three-dimensional transient electromagnetic field FDTD program for dipole radiation. The parallel algorithm has achieved faster speedup and higher efficiency than other parallel FDTD algoritms. The results indicate that the electromagnetic field FDTD parallel algorithm based on OpenMP has a good speedup and efficiency.
Akl, Selim G
1985-01-01
Parallel Sorting Algorithms explains how to use parallel algorithms to sort a sequence of items on a variety of parallel computers. The book reviews the sorting problem, the parallel models of computation, parallel algorithms, and the lower bounds on the parallel sorting problems. The text also presents twenty different algorithms, such as linear arrays, mesh-connected computers, cube-connected computers. Another example where algorithm can be applied is on the shared-memory SIMD (single instruction stream multiple data stream) computers in which the whole sequence to be sorted can fit in the
3D streamers simulation in a pin to plane configuration using massively parallel computing
Plewa, J.-M.; Eichwald, O.; Ducasse, O.; Dessante, P.; Jacobs, C.; Renon, N.; Yousfi, M.
2018-03-01
This paper concerns the 3D simulation of corona discharge using high performance computing (HPC) managed with the message passing interface (MPI) library. In the field of finite volume methods applied on non-adaptive mesh grids and in the case of a specific 3D dynamic benchmark test devoted to streamer studies, the great efficiency of the iterative R&B SOR and BiCGSTAB methods versus the direct MUMPS method was clearly demonstrated in solving the Poisson equation using HPC resources. The optimization of the parallelization and the resulting scalability was undertaken as a function of the HPC architecture for a number of mesh cells ranging from 8 to 512 million and a number of cores ranging from 20 to 1600. The R&B SOR method remains at least about four times faster than the BiCGSTAB method and requires significantly less memory for all tested situations. The R&B SOR method was then implemented in a 3D MPI parallelized code that solves the classical first order model of an atmospheric pressure corona discharge in air. The 3D code capabilities were tested by following the development of one, two and four coplanar streamers generated by initial plasma spots for 6 ns. The preliminary results obtained allowed us to follow in detail the formation of the tree structure of a corona discharge and the effects of the mutual interactions between the streamers in terms of streamer velocity, trajectory and diameter. The computing time for 64 million of mesh cells distributed over 1000 cores using the MPI procedures is about 30 min ns-1, regardless of the number of streamers.
Non-Blocking Concurrent Imperative Programming with Session Types
Miguel Silva
2017-01-01
Full Text Available Concurrent C0 is an imperative programming language in the C family with session-typed message-passing concurrency. The previously proposed semantics implements asynchronous (non-blocking output; we extend it here with non-blocking input. A key idea is to postpone message reception as much as possible by interpreting receive commands as a request for a message. We implemented our ideas as a translation from a blocking intermediate language to a non-blocking language. Finally, we evaluated our techniques with several benchmark programs and show the results obtained. While the abstract measure of span always decreases (or remains unchanged, only a few of the examples reap a practical benefit.
The language parallel Pascal and other aspects of the massively parallel processor
Reeves, A. P.; Bruner, J. D.
1982-01-01
A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.
Fox, Geoffrey C; Messina, Guiseppe C
2014-01-01
A clear illustration of how parallel computers can be successfully appliedto large-scale scientific computations. This book demonstrates how avariety of applications in physics, biology, mathematics and other scienceswere implemented on real parallel computers to produce new scientificresults. It investigates issues of fine-grained parallelism relevant forfuture supercomputers with particular emphasis on hypercube architecture. The authors describe how they used an experimental approach to configuredifferent massively parallel machines, design and implement basic systemsoftware, and develop
Masahiro, Tatsumi; Akio, Yamamoto
2003-01-01
A production code SCOPE2 was developed based on the fine-grained parallel algorithm by the red/black iterative method targeting parallel computing environments such as a PC-cluster. It can perform a depletion calculation in a few hours using a PC-cluster with the model based on a 9-group nodal-SP3 transport method in 3-dimensional pin-by-pin geometry for in-core fuel management of commercial PWRs. The present algorithm guarantees the identical convergence process as that in serial execution, which is very important from the viewpoint of quality management. The fine-mesh geometry is constructed by hierarchical decomposition with introduction of intermediate management layer as a block that is a quarter piece of a fuel assembly in radial direction. A combination of a mesh division scheme forcing even meshes on each edge and a latency-hidden communication algorithm provided simplicity and efficiency to message passing to enhance parallel performance. Inter-processor communication and parallel I/O access were realized using the MPI functions. Parallel performance was measured for depletion calculations by the 9-group nodal-SP3 transport method in 3-dimensional pin-by-pin geometry with 340 x 340 x 26 meshes for full core geometry and 170 x 170 x 26 for quarter core geometry. A PC cluster that consists of 24 Pentium-4 processors connected by the Fast Ethernet was used for the performance measurement. Calculations in full core geometry gave better speedups compared to those in quarter core geometry because of larger granularity. Fine-mesh sweep and feedback calculation parts gave almost perfect scalability since granularity is large enough, while 1-group coarse-mesh diffusion acceleration gave only around 80%. The speedup and parallel efficiency for total computation time were 22.6 and 94%, respectively, for the calculation in full core geometry with 24 processors. (authors)
Experience with a clustered parallel reduction machine
Beemster, M.; Hartel, Pieter H.; Hertzberger, L.O.; Hofman, R.F.H.; Langendoen, K.G.; Li, L.L.; Milikowski, R.; Vree, W.G.; Barendregt, H.P.; Mulder, J.C.
A clustered architecture has been designed to exploit divide and conquer parallelism in functional programs. The programming methodology developed for the machine is based on explicit annotations and program transformations. It has been successfully applied to a number of algorithms resulting in a
DeHart, Mark D.; Williams, Mark L.; Bowman, Stephen M.
2010-01-01
The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement
Freeman, Bryan
2013-01-01
This book contains practical recipes on everything you will need to create task-based parallel programs using C#, .NET 4.5, and Visual Studio. The book is packed with illustrated code examples to create scalable programs.This book is intended to help experienced C# developers write applications that leverage the power of modern multicore processors. It provides the necessary knowledge for an experienced C# developer to work with .NET parallelism APIs. Previous experience of writing multithreaded applications is not necessary.
Parallel Atomistic Simulations
HEFFELFINGER,GRANT S.
2000-01-18
Algorithms developed to enable the use of atomistic molecular simulation methods with parallel computers are reviewed. Methods appropriate for bonded as well as non-bonded (and charged) interactions are included. While strategies for obtaining parallel molecular simulations have been developed for the full variety of atomistic simulation methods, molecular dynamics and Monte Carlo have received the most attention. Three main types of parallel molecular dynamics simulations have been developed, the replicated data decomposition, the spatial decomposition, and the force decomposition. For Monte Carlo simulations, parallel algorithms have been developed which can be divided into two categories, those which require a modified Markov chain and those which do not. Parallel algorithms developed for other simulation methods such as Gibbs ensemble Monte Carlo, grand canonical molecular dynamics, and Monte Carlo methods for protein structure determination are also reviewed and issues such as how to measure parallel efficiency, especially in the case of parallel Monte Carlo algorithms with modified Markov chains are discussed.
Yan, Beichuan; Regueiro, Richard A.
2018-02-01
A three-dimensional (3D) DEM code for simulating complex-shaped granular particles is parallelized using message-passing interface (MPI). The concepts of link-block, ghost/border layer, and migration layer are put forward for design of the parallel algorithm, and theoretical scalability function of 3-D DEM scalability and memory usage is derived. Many performance-critical implementation details are managed optimally to achieve high performance and scalability, such as: minimizing communication overhead, maintaining dynamic load balance, handling particle migrations across block borders, transmitting C++ dynamic objects of particles between MPI processes efficiently, eliminating redundant contact information between adjacent MPI processes. The code executes on multiple US Department of Defense (DoD) supercomputers and tests up to 2048 compute nodes for simulating 10 million three-axis ellipsoidal particles. Performance analyses of the code including speedup, efficiency, scalability, and granularity across five orders of magnitude of simulation scale (number of particles) are provided, and they demonstrate high speedup and excellent scalability. It is also discovered that communication time is a decreasing function of the number of compute nodes in strong scaling measurements. The code's capability of simulating a large number of complex-shaped particles on modern supercomputers will be of value in both laboratory studies on micromechanical properties of granular materials and many realistic engineering applications involving granular materials.
Computer-Aided Parallelizer and Optimizer
Jin, Haoqiang
2011-01-01
The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.
Kopasakis, George; Connolly, Joseph W.; Cheng, Larry
2015-01-01
This paper covers the development of stage-by-stage and parallel flow path compressor modeling approaches for a Variable Cycle Engine. The stage-by-stage compressor modeling approach is an extension of a technique for lumped volume dynamics and performance characteristic modeling. It was developed to improve the accuracy of axial compressor dynamics over lumped volume dynamics modeling. The stage-by-stage compressor model presented here is formulated into a parallel flow path model that includes both axial and rotational dynamics. This is done to enable the study of compressor and propulsion system dynamic performance under flow distortion conditions. The approaches utilized here are generic and should be applicable for the modeling of any axial flow compressor design accurate time domain simulations. The objective of this work is as follows. Given the parameters describing the conditions of atmospheric disturbances, and utilizing the derived formulations, directly compute the transfer function poles and zeros describing these disturbances for acoustic velocity, temperature, pressure, and density. Time domain simulations of representative atmospheric turbulence can then be developed by utilizing these computed transfer functions together with the disturbance frequencies of interest.
Sitchinava, Nodar; Zeh, Norbert
2012-01-01
We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number of...... in the optimal OhOf(psortN + K/PB) parallel I/O complexity, where K is the size of the output reported in the process and psortN is the parallel I/O complexity of sorting N elements using P processors....
Deshmane, Anagha; Gulani, Vikas; Griswold, Mark A; Seiberlich, Nicole
2012-07-01
Parallel imaging is a robust method for accelerating the acquisition of magnetic resonance imaging (MRI) data, and has made possible many new applications of MR imaging. Parallel imaging works by acquiring a reduced amount of k-space data with an array of receiver coils. These undersampled data can be acquired more quickly, but the undersampling leads to aliased images. One of several parallel imaging algorithms can then be used to reconstruct artifact-free images from either the aliased images (SENSE-type reconstruction) or from the undersampled data (GRAPPA-type reconstruction). The advantages of parallel imaging in a clinical setting include faster image acquisition, which can be used, for instance, to shorten breath-hold times resulting in fewer motion-corrupted examinations. In this article the basic concepts behind parallel imaging are introduced. The relationship between undersampling and aliasing is discussed and two commonly used parallel imaging methods, SENSE and GRAPPA, are explained in detail. Examples of artifacts arising from parallel imaging are shown and ways to detect and mitigate these artifacts are described. Finally, several current applications of parallel imaging are presented and recent advancements and promising research in parallel imaging are briefly reviewed. Copyright © 2012 Wiley Periodicals, Inc.
Parallel Algorithms and Patterns
Robey, Robert W. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
2016-06-16
This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.
Parallel Computing Strategies for Irregular Algorithms
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Grebner, Christoph; Becker, Johannes; Weber, Daniel; Bellinger, Daniel; Tafipolski, Maxim; Brückner, Charlotte; Engels, Bernd
2014-09-15
The presented program package, Conformational Analysis and Search Tool (CAST) allows the accurate treatment of large and flexible (macro) molecular systems. For the determination of thermally accessible minima CAST offers the newly developed TabuSearch algorithm, but algorithms such as Monte Carlo (MC), MC with minimization, and molecular dynamics are implemented as well. For the determination of reaction paths, CAST provides the PathOpt, the Nudge Elastic band, and the umbrella sampling approach. Access to free energies is possible through the free energy perturbation approach. Along with a number of standard force fields, a newly developed symmetry-adapted perturbation theory-based force field is included. Semiempirical computations are possible through DFTB+ and MOPAC interfaces. For calculations based on density functional theory, a Message Passing Interface (MPI) interface to the Graphics Processing Unit (GPU)-accelerated TeraChem program is available. The program is available on request. Copyright © 2014 Wiley Periodicals, Inc.
Parkhurst, David L.; Kipp, Kenneth L.; Engesgaard, Peter; Charlton, Scott R.
2004-01-01
format suitable for exporting to spreadsheets and post-processing programs; or in Hierarchical Data Format (HDF), which is a compressed binary format. Data in the HDF file can be visualized on Windows computers with the program Model Viewer and extracted with the utility program PHASTHDF; both programs are distributed with PHAST. Operator splitting of the flow, transport, and geochemical equations is used to separate the three processes into three sequential calculations. No iterations between transport and reaction calculations are implemented. A three-dimensional Cartesian coordinate system and finite-difference techniques are used for the spatial and temporal discretization of the flow and transport equations. The non-linear chemical equilibrium equations are solved by a Newton-Raphson method, and the kinetic reaction equations are solved by a Runge-Kutta or an implicit method for integrating ordinary differential equations. The PHAST simulator may require large amounts of memory and long Central Processing Unit (CPU) times. To reduce the long CPU times, a parallel version of PHAST has been developed that runs on a multiprocessor computer or on a collection of computers that are networked. The parallel version requires Message Passing Interface, which is currently (2004) freely available. The parallel version is effective in reducing simulation times. This report documents the use of the PHAST simulator, including running the simulator, preparing the input files, selecting the output files, and visualizing the results. It also presents four examples that verify the numerical method and demonstrate the capabilities of the simulator. PHAST requires three input files. Only the flow and transport file is described in detail in this report. The other two files, the chemistry data file and the database file, are identical to PHREEQC files and the detailed description of these files is found in the PHREEQC documentation.
Parallel discrete event simulation
Overeinder, B.J.; Hertzberger, L.O.; Sloot, P.M.A.; Withagen, W.J.
1991-01-01
In simulating applications for execution on specific computing systems, the simulation performance figures must be known in a short period of time. One basic approach to the problem of reducing the required simulation time is the exploitation of parallelism. However, in parallelizing the simulation
Parallel reservoir simulator computations
Hemanth-Kumar, K.; Young, L.C.
1995-01-01
The adaptation of a reservoir simulator for parallel computations is described. The simulator was originally designed for vector processors. It performs approximately 99% of its calculations in vector/parallel mode and relative to scalar calculations it achieves speedups of 65 and 81 for black oil and EOS simulations, respectively on the CRAY C-90
Automatic Management of Parallel and Distributed System Resources
Yan, Jerry; Ngai, Tin Fook; Lundstrom, Stephen F.
1990-01-01
Viewgraphs on automatic management of parallel and distributed system resources are presented. Topics covered include: parallel applications; intelligent management of multiprocessing systems; performance evaluation of parallel architecture; dynamic concurrent programs; compiler-directed system approach; lattice gaseous cellular automata; and sparse matrix Cholesky factorization.
Totally parallel multilevel algorithms
Frederickson, Paul O.
1988-01-01
Four totally parallel algorithms for the solution of a sparse linear system have common characteristics which become quite apparent when they are implemented on a highly parallel hypercube such as the CM2. These four algorithms are Parallel Superconvergent Multigrid (PSMG) of Frederickson and McBryan, Robust Multigrid (RMG) of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. In fact, all four can be formulated as particular cases of the same totally parallel multilevel algorithm, which are referred to as TPMA. In certain cases the spectral radius of TPMA is zero, and it is recognized to be a direct algorithm. In many other cases the spectral radius, although not zero, is small enough that a single iteration per timestep keeps the local error within the required tolerance.
Massively parallel mathematical sieves
Montry, G.R.
1989-01-01
The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.
Advanced parallel processing with supercomputer architectures
Hwang, K.
1987-01-01
This paper investigates advanced parallel processing techniques and innovative hardware/software architectures that can be applied to boost the performance of supercomputers. Critical issues on architectural choices, parallel languages, compiling techniques, resource management, concurrency control, programming environment, parallel algorithms, and performance enhancement methods are examined and the best answers are presented. The authors cover advanced processing techniques suitable for supercomputers, high-end mainframes, minisupers, and array processors. The coverage emphasizes vectorization, multitasking, multiprocessing, and distributed computing. In order to achieve these operation modes, parallel languages, smart compilers, synchronization mechanisms, load balancing methods, mapping parallel algorithms, operating system functions, application library, and multidiscipline interactions are investigated to ensure high performance. At the end, they assess the potentials of optical and neural technologies for developing future supercomputers
Applications of the parallel computing system using network
Ido, Shunji; Hasebe, Hiroki
1994-01-01
Parallel programming is applied to multiple processors connected in Ethernet. Data exchanges between tasks located in each processing element are realized by two ways. One is socket which is standard library on recent UNIX operating systems. Another is a network connecting software, named as Parallel Virtual Machine (PVM) which is a free software developed by ORNL, to use many workstations connected to network as a parallel computer. This paper discusses the availability of parallel computing using network and UNIX workstations and comparison between specialized parallel systems (Transputer and iPSC/860) in a Monte Carlo simulation which generally shows high parallelization ratio. (author)
Parallelism and Scalability in an Image Processing Application
Rasmussen, Morten Sleth; Stuart, Matthias Bo; Karlsson, Sven
2008-01-01
parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately......The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chip. This means that parallel processing is required in application areas that traditionally have not used...
Parallelism and Scalability in an Image Processing Application
Rasmussen, Morten Sleth; Stuart, Matthias Bo; Karlsson, Sven
2009-01-01
parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately......The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chips. This means that parallel processing is required in application areas that traditionally have not used...
Lee, A. Y.
1967-01-01
Computer program calculates the steady state fluid distribution, temperature rise, and pressure drop of a coolant, the material temperature distribution of a heat generating solid, and the heat flux distributions at the fluid-solid interfaces. It performs the necessary iterations automatically within the computer, in one machine run.
Grant, Mary C.; Zhang, Lilly; Damiano, Michele
2009-01-01
This study investigated kernel equating methods by comparing these methods to operational equatings for two tests in the SAT Subject Tests[TM] program. GENASYS (ETS, 2007) was used for all equating methods and scaled score kernel equating results were compared to Tucker, Levine observed score, chained linear, and chained equipercentile equating…
Adapting algorithms to massively parallel hardware
Sioulas, Panagiotis
2016-01-01
In the recent years, the trend in computing has shifted from delivering processors with faster clock speeds to increasing the number of cores per processor. This marks a paradigm shift towards parallel programming in which applications are programmed to exploit the power provided by multi-cores. Usually there is gain in terms of the time-to-solution and the memory footprint. Specifically, this trend has sparked an interest towards massively parallel systems that can provide a large number of processors, and possibly computing nodes, as in the GPUs and MPPAs (Massively Parallel Processor Arrays). In this project, the focus was on two distinct computing problems: k-d tree searches and track seeding cellular automata. The goal was to adapt the algorithms to parallel systems and evaluate their performance in different cases.
Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided
Robert Gerstenberger
2014-01-01
Full Text Available Modern interconnects offer remote direct memory access (RDMA features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.
Bayer image parallel decoding based on GPU
Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
2012-11-01
In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Heggarty, J.W.
1999-06-01
For almost thirty years, sequential R-matrix computation has been used by atomic physics research groups, from around the world, to model collision phenomena involving the scattering of electrons or positrons with atomic or molecular targets. As considerable progress has been made in the understanding of fundamental scattering processes, new data, obtained from more complex calculations, is of current interest to experimentalists. Performing such calculations, however, places considerable demands on the computational resources to be provided by the target machine, in terms of both processor speed and memory requirement. Indeed, in some instances the computational requirements are so great that the proposed R-matrix calculations are intractable, even when utilising contemporary classic supercomputers. Historically, increases in the computational requirements of R-matrix computation were accommodated by porting the problem codes to a more powerful classic supercomputer. Although this approach has been successful in the past, it is no longer considered to be a satisfactory solution due to the limitations of current (and future) Von Neumann machines. As a consequence, there has been considerable interest in the high performance multicomputers, that have emerged over the last decade which appear to offer the computational resources required by contemporary R-matrix research. Unfortunately, developing codes for these machines is not as simple a task as it was to develop codes for successive classic supercomputers. The difficulty arises from the considerable differences in the computing models that exist between the two types of machine and results in the programming of multicomputers to be widely acknowledged as a difficult, time consuming and error-prone task. Nevertheless, unless parallel R-matrix computation is realised, important theoretical and experimental atomic physics research will continue to be hindered. This thesis describes work that was undertaken in
Algorithms for parallel computers
Churchhouse, R.F.
1985-01-01
Until relatively recently almost all the algorithms for use on computers had been designed on the (usually unstated) assumption that they were to be run on single processor, serial machines. With the introduction of vector processors, array processors and interconnected systems of mainframes, minis and micros, however, various forms of parallelism have become available. The advantage of parallelism is that it offers increased overall processing speed but it also raises some fundamental questions, including: (i) which, if any, of the existing 'serial' algorithms can be adapted for use in the parallel mode. (ii) How close to optimal can such adapted algorithms be and, where relevant, what are the convergence criteria. (iii) How can we design new algorithms specifically for parallel systems. (iv) For multi-processor systems how can we handle the software aspects of the interprocessor communications. Aspects of these questions illustrated by examples are considered in these lectures. (orig.)
Parallelism and array processing
Zacharov, V.
1983-01-01
Modern computing, as well as the historical development of computing, has been dominated by sequential monoprocessing. Yet there is the alternative of parallelism, where several processes may be in concurrent execution. This alternative is discussed in a series of lectures, in which the main developments involving parallelism are considered, both from the standpoint of computing systems and that of applications that can exploit such systems. The lectures seek to discuss parallelism in a historical context, and to identify all the main aspects of concurrency in computation right up to the present time. Included will be consideration of the important question as to what use parallelism might be in the field of data processing. (orig.)
Parallel magnetic resonance imaging
Larkman, David J; Nunes, Rita G
2007-01-01
Parallel imaging has been the single biggest innovation in magnetic resonance imaging in the last decade. The use of multiple receiver coils to augment the time consuming Fourier encoding has reduced acquisition times significantly. This increase in speed comes at a time when other approaches to acquisition time reduction were reaching engineering and human limits. A brief summary of spatial encoding in MRI is followed by an introduction to the problem parallel imaging is designed to solve. There are a large number of parallel reconstruction algorithms; this article reviews a cross-section, SENSE, SMASH, g-SMASH and GRAPPA, selected to demonstrate the different approaches. Theoretical (the g-factor) and practical (coil design) limits to acquisition speed are reviewed. The practical implementation of parallel imaging is also discussed, in particular coil calibration. How to recognize potential failure modes and their associated artefacts are shown. Well-established applications including angiography, cardiac imaging and applications using echo planar imaging are reviewed and we discuss what makes a good application for parallel imaging. Finally, active research areas where parallel imaging is being used to improve data quality by repairing artefacted images are also reviewed. (invited topical review)
The STAPL Parallel Graph Library
Harshvardhan,; Fidel, Adam; Amato, Nancy M.; Rauchwerger, Lawrence
2013-01-01
This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable
Roofline model toolkit: A practical tool for architectural and program analysis
Lo, Yu Jung [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Williams, Samuel [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Van Straalen, Brian [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Ligocki, Terry J. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Cordery, Matthew J. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Wright, Nicholas J. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Hall, Mary W. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Oliker, Leonid [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
2015-04-18
We present preliminary results of the Roofline Toolkit for multicore, many core, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measure sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.
Hongwen He
2013-01-01
Full Text Available Energy management strategy influences the power performance and fuel economy of plug-in hybrid electric vehicles greatly. To explore the fuel-saving potential of a plug-in hybrid electric bus (PHEB, this paper searched the global optimal energy management strategy using dynamic programming (DP algorithm. Firstly, the simplified backward model of the PHEB was built which is necessary for DP algorithm. Then the torque and speed of engine and the torque of motor were selected as the control variables, and the battery state of charge (SOC was selected as the state variables. The DP solution procedure was listed, and the way was presented to find all possible control variables at every state of each stage in detail. Finally, the appropriate SOC increment is determined after quantizing the state variables, and then the optimal control of long driving distance of a specific driving cycle is replaced with the optimal control of one driving cycle, which reduces the computational time significantly and keeps the precision at the same time. The simulation results show that the fuel economy of the PEHB with the optimal energy management strategy is improved by 53.7% compared with that of the conventional bus, which can be a benchmark for the assessment of other control strategies.
Domain decomposition methods and parallel computing
Meurant, G.
1991-01-01
In this paper, we show how to efficiently solve large linear systems on parallel computers. These linear systems arise from discretization of scientific computing problems described by systems of partial differential equations. We show how to get a discrete finite dimensional system from the continuous problem and the chosen conjugate gradient iterative algorithm is briefly described. Then, the different kinds of parallel architectures are reviewed and their advantages and deficiencies are emphasized. We sketch the problems found in programming the conjugate gradient method on parallel computers. For this algorithm to be efficient on parallel machines, domain decomposition techniques are introduced. We give results of numerical experiments showing that these techniques allow a good rate of convergence for the conjugate gradient algorithm as well as computational speeds in excess of a billion of floating point operations per second. (author). 5 refs., 11 figs., 2 tabs., 1 inset
6th International Parallel Tools Workshop
Brinkmann, Steffen; Gracia, José; Resch, Michael; Nagel, Wolfgang
2013-01-01
The latest advances in the High Performance Computing hardware have significantly raised the level of available compute performance. At the same time, the growing hardware capabilities of modern supercomputing architectures have caused an increasing complexity of the parallel application development. Despite numerous efforts to improve and simplify parallel programming, there is still a lot of manual debugging and tuning work required. This process is supported by special software tools, facilitating debugging, performance analysis, and optimization and thus making a major contribution to the development of robust and efficient parallel software. This book introduces a selection of the tools, which were presented and discussed at the 6th International Parallel Tools Workshop, held in Stuttgart, Germany, 25-26 September 2012.
High performance parallel computers for science
Nash, T.; Areti, H.; Atac, R.; Biel, J.; Cook, A.; Deppe, J.; Edel, M.; Fischler, M.; Gaines, I.; Hance, R.
1989-01-01
This paper reports that Fermilab's Advanced Computer Program (ACP) has been developing cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 Mflops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction
Sung Yong Cho
2014-12-01
Full Text Available Purpose The aims of this study were to investigate the efficacy of combining the systematized behavioral modification program (SBMP with desmopressin therapy and to compare this with desmopressin monotherapy in the treatment of nocturnal polyuria (NPU. Methods Patients were randomized at 8 centers to receive desmopressin monotherapy (group A or combination therapy, comprising desmopressin and the SBMP (group B. Nocturia was defined as an average of 2 or more nightly voids. The primary endpoint was a change in the mean number of nocturnal voids from baseline during the 3-month treatment period. The secondary endpoints were changes in the bladder diary parameters and questionnaires scores, and improvements in self-perception for nocturia. Results A total of 200 patients were screened and 76 were excluded from the study, because they failed the screening process. A total of 124 patients were randomized to receive treatment, with group A comprising 68 patients and group B comprising 56 patients. The patients' characteristics were similar between the groups. Nocturnal voids showed a greater decline in group B (-1.5 compared with group A (-1.2, a difference that was not statistically significant. Significant differences were observed between groups A and B with respect to the NPU index (0.37 vs. 0.29, P=0.028, the change in the maximal bladder capacity (-41.3 mL vs. 13.3 mL, P<0.001, and the rate of patients lost to follow up (10.3% [7/68] vs. 0% [0/56], P=0.016. Self-perception for nocturia significantly improved in both groups. Conclusions Combination treatment did not have any additional benefits in relation to reducing nocturnal voids in patients with NPU; however, combination therapy is helpful because it increases the maximal bladder capacity and decreases the NPI. Furthermore, combination therapy increased the persistence of desmopressin in patients with NPU.
Massively parallel multicanonical simulations
Gross, Jonathan; Zierenberg, Johannes; Weigel, Martin; Janke, Wolfhard
2018-03-01
Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged free-energy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 104 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field.
SPINning parallel systems software
Matlin, O.S.; Lusk, E.; McCune, W.
2002-01-01
We describe our experiences in using Spin to verify parts of the Multi Purpose Daemon (MPD) parallel process management system. MPD is a distributed collection of processes connected by Unix network sockets. MPD is dynamic processes and connections among them are created and destroyed as MPD is initialized, runs user processes, recovers from faults, and terminates. This dynamic nature is easily expressible in the Spin/Promela framework but poses performance and scalability challenges. We present here the results of expressing some of the parallel algorithms of MPD and executing both simulation and verification runs with Spin
Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.
2017-07-01
Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).