MEDUSA - An overset grid flow solver for network-based parallel computer systems
Smith, Merritt H.; Pallis, Jani M.
1993-01-01
Continuing improvement in processing speed has made it feasible to solve the Reynolds-Averaged Navier-Stokes equations for simple three-dimensional flows on advanced workstations. Combining multiple workstations into a network-based heterogeneous parallel computer allows the application of programming principles learned on MIMD (Multiple Instruction Multiple Data) distributed memory parallel computers to the solution of larger problems. An overset-grid flow solution code has been developed which uses a cluster of workstations as a network-based parallel computer. Inter-process communication is provided by the Parallel Virtual Machine (PVM) software. Solution speed equivalent to one-third of a Cray-YMP processor has been achieved from a cluster of nine commonly used engineering workstation processors. Load imbalance and communication overhead are the principal impediments to parallel efficiency in this application.
Optimization on a Network-based Parallel Computer System for Supersonic Laminar Wing Design
Garcia, Joseph A.; Cheung, Samson; Holst, Terry L. (Technical Monitor)
1995-01-01
A set of Computational Fluid Dynamics (CFD) routines and flow transition prediction tools are integrated into a network based parallel numerical optimization routine. Through this optimization routine, the design of a 2-D airfoil and an infinitely swept wing will be studied in order to advance the design cycle capability of supersonic laminar flow wings. The goal of advancing supersonic laminar flow wing design is achieved by wisely choosing the design variables used in the optimization routine. The design variables are represented by the theory of Fourier series and potential theory. These theories, combined with the parallel CFD flow routines and flow transition prediction tools, provide a design space for a global optimal point to be searched. Finally, the parallel optimization routine enables gradient evaluations to be performed in a fast and parallel fashion.
National Aeronautics and Space Administration — Remote Sensing Solutions proposes to develop the Network-based Parallel Retrieval Onboard Computing Environment for Sensor Systems (nPROCESS) for deployment on...
1982-01-01
Parallel Computations focuses on parallel computation, with emphasis on algorithms used in a variety of numerical and physical applications and for many different types of parallel computers. Topics covered range from vectorization of fast Fourier transforms (FFTs) and of the incomplete Cholesky conjugate gradient (ICCG) algorithm on the Cray-1 to calculation of table lookups and piecewise functions. Single tridiagonal linear systems and vectorized computation of reactive flow are also discussed.Comprised of 13 chapters, this volume begins by classifying parallel computers and describing techn
Fox, Geoffrey C; Messina, Guiseppe C
2014-01-01
A clear illustration of how parallel computers can be successfully appliedto large-scale scientific computations. This book demonstrates how avariety of applications in physics, biology, mathematics and other scienceswere implemented on real parallel computers to produce new scientificresults. It investigates issues of fine-grained parallelism relevant forfuture supercomputers with particular emphasis on hypercube architecture. The authors describe how they used an experimental approach to configuredifferent massively parallel machines, design and implement basic systemsoftware, and develop
Energy Technology Data Exchange (ETDEWEB)
1991-10-23
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.
Parallelism in matrix computations
Gallopoulos, Efstratios; Sameh, Ahmed H
2016-01-01
This book is primarily intended as a research monograph that could also be used in graduate courses for the design of parallel algorithms in matrix computations. It assumes general but not extensive knowledge of numerical linear algebra, parallel architectures, and parallel programming paradigms. The book consists of four parts: (I) Basics; (II) Dense and Special Matrix Computations; (III) Sparse Matrix Computations; and (IV) Matrix functions and characteristics. Part I deals with parallel programming paradigms and fundamental kernels, including reordering schemes for sparse matrices. Part II is devoted to dense matrix computations such as parallel algorithms for solving linear systems, linear least squares, the symmetric algebraic eigenvalue problem, and the singular-value decomposition. It also deals with the development of parallel algorithms for special linear systems such as banded ,Vandermonde ,Toeplitz ,and block Toeplitz systems. Part III addresses sparse matrix computations: (a) the development of pa...
Morse, H Stephen
1994-01-01
Practical Parallel Computing provides information pertinent to the fundamental aspects of high-performance parallel processing. This book discusses the development of parallel applications on a variety of equipment.Organized into three parts encompassing 12 chapters, this book begins with an overview of the technology trends that converge to favor massively parallel hardware over traditional mainframes and vector machines. This text then gives a tutorial introduction to parallel hardware architectures. Other chapters provide worked-out examples of programs using several parallel languages. Thi
Algorithmically specialized parallel computers
Snyder, Lawrence; Gannon, Dennis B
1985-01-01
Algorithmically Specialized Parallel Computers focuses on the concept and characteristics of an algorithmically specialized computer.This book discusses the algorithmically specialized computers, algorithmic specialization using VLSI, and innovative architectures. The architectures and algorithms for digital signal, speech, and image processing and specialized architectures for numerical computations are also elaborated. Other topics include the model for analyzing generalized inter-processor, pipelined architecture for search tree maintenance, and specialized computer organization for raster
Parallel reservoir simulator computations
International Nuclear Information System (INIS)
Hemanth-Kumar, K.; Young, L.C.
1995-01-01
The adaptation of a reservoir simulator for parallel computations is described. The simulator was originally designed for vector processors. It performs approximately 99% of its calculations in vector/parallel mode and relative to scalar calculations it achieves speedups of 65 and 81 for black oil and EOS simulations, respectively on the CRAY C-90
Algorithms for parallel computers
International Nuclear Information System (INIS)
Churchhouse, R.F.
1985-01-01
Until relatively recently almost all the algorithms for use on computers had been designed on the (usually unstated) assumption that they were to be run on single processor, serial machines. With the introduction of vector processors, array processors and interconnected systems of mainframes, minis and micros, however, various forms of parallelism have become available. The advantage of parallelism is that it offers increased overall processing speed but it also raises some fundamental questions, including: (i) which, if any, of the existing 'serial' algorithms can be adapted for use in the parallel mode. (ii) How close to optimal can such adapted algorithms be and, where relevant, what are the convergence criteria. (iii) How can we design new algorithms specifically for parallel systems. (iv) For multi-processor systems how can we handle the software aspects of the interprocessor communications. Aspects of these questions illustrated by examples are considered in these lectures. (orig.)
Applied Parallel Computing Industrial Computation and Optimization
DEFF Research Database (Denmark)
Madsen, Kaj; NA NA NA Olesen, Dorte
Proceedings and the Third International Workshop on Applied Parallel Computing in Industrial Problems and Optimization (PARA96)......Proceedings and the Third International Workshop on Applied Parallel Computing in Industrial Problems and Optimization (PARA96)...
Arrasmith, William W.; Sullivan, Sean F.
2008-04-01
Phase diversity imaging methods work well in removing atmospheric turbulence and some system effects from predominantly near-field imaging systems. However, phase diversity approaches can be computationally intensive and slow. We present a recently adapted, high-speed phase diversity method using a conventional, software-based neural network paradigm. This phase-diversity method has the advantage of eliminating many time consuming, computationally heavy calculations and directly estimates the optical transfer function from the entrance pupil phases or phase differences. Additionally, this method is more accurate than conventional Zernike-based, phase diversity approaches and lends itself to implementation on parallel software or hardware architectures. We use computer simulation to demonstrate how this high-speed, phase diverse imaging method can be implemented on a parallel, highspeed, neural network-based architecture-specifically the Cellular Neural Network (CNN). The CNN architecture was chosen as a representative, neural network-based processing environment because 1) the CNN can be implemented in 2-D or 3-D processing schemes, 2) it can be implemented in hardware or software, 3) recent 2-D implementations of CNN technology have shown a 3 orders of magnitude superiority in speed, area, or power over equivalent digital representations, and 4) a complete development environment exists. We also provide a short discussion on processing speed.
Massively parallel quantum computer simulator
De Raedt, K.; Michielsen, K.; De Raedt, H.; Trieu, B.; Arnold, G.; Richter, M.; Lippert, Th.; Watanabe, H.; Ito, N.
2007-01-01
We describe portable software to simulate universal quantum computers on massive parallel Computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray
Introduction to Parallel Computing
1992-05-01
routing, often termed wormhole routing, can eliminate some of these difficulties [Dally:87]. Among the commercially available processors in this category...computational processor. 15 I I Source Router Processor RISCI sProcessor I Figure 2-7. Intel Message Routing The iPSC/860 uses a derivative of wormhole ...FORTRAN attempt to identify actual data dependencies vis-a-vis the artificially imposed dependencies which are caused by the nature of these
Parallel algorithms and cluster computing
Hoffmann, Karl Heinz
2007-01-01
This book presents major advances in high performance computing as well as major advances due to high performance computing. It contains a collection of papers in which results achieved in the collaboration of scientists from computer science, mathematics, physics, and mechanical engineering are presented. From the science problems to the mathematical algorithms and on to the effective implementation of these algorithms on massively parallel and cluster computers we present state-of-the-art methods and technology as well as exemplary results in these fields. This book shows that problems which seem superficially distinct become intimately connected on a computational level.
Wakefield calculations on parallel computers
International Nuclear Information System (INIS)
Schoessow, P.
1990-01-01
The use of parallelism in the solution of wakefield problems is illustrated for two different computer architectures (SIMD and MIMD). Results are given for finite difference codes which have been implemented on a Connection Machine and an Alliant FX/8 and which are used to compute wakefields in dielectric loaded structures. Benchmarks on code performance are presented for both cases. 4 refs., 3 figs., 2 tabs
The science of computing - Parallel computation
Denning, P. J.
1985-01-01
Although parallel computation architectures have been known for computers since the 1920s, it was only in the 1970s that microelectronic components technologies advanced to the point where it became feasible to incorporate multiple processors in one machine. Concommitantly, the development of algorithms for parallel processing also lagged due to hardware limitations. The speed of computing with solid-state chips is limited by gate switching delays. The physical limit implies that a 1 Gflop operational speed is the maximum for sequential processors. A computer recently introduced features a 'hypercube' architecture with 128 processors connected in networks at 5, 6 or 7 points per grid, depending on the design choice. Its computing speed rivals that of supercomputers, but at a fraction of the cost. The added speed with less hardware is due to parallel processing, which utilizes algorithms representing different parts of an equation that can be broken into simpler statements and processed simultaneously. Present, highly developed computer languages like FORTRAN, PASCAL, COBOL, etc., rely on sequential instructions. Thus, increased emphasis will now be directed at parallel processing algorithms to exploit the new architectures.
Directory of Open Access Journals (Sweden)
Jilin Zhang
2017-01-01
Full Text Available With the development of the mobile systems, we gain a lot of benefits and convenience by leveraging mobile devices; at the same time, the information gathered by smartphones, such as location and environment, is also valuable for business to provide more intelligent services for customers. More and more machine learning methods have been used in the field of mobile information systems to study user behavior and classify usage patterns, especially convolutional neural network. With the increasing of model training parameters and data scale, the traditional single machine training method cannot meet the requirements of time complexity in practical application scenarios. The current training framework often uses simple data parallel or model parallel method to speed up the training process, which is why heterogeneous computing resources have not been fully utilized. To solve these problems, our paper proposes a delay synchronization convolutional neural network parallel strategy, which leverages the heterogeneous system. The strategy is based on both synchronous parallel and asynchronous parallel approaches; the model training process can reduce the dependence on the heterogeneous architecture in the premise of ensuring the model convergence, so the convolution neural network framework is more adaptive to different heterogeneous system environments. The experimental results show that the proposed delay synchronization strategy can achieve at least three times the speedup compared to the traditional data parallelism.
Parallel computing: numerics, applications, and trends
National Research Council Canada - National Science Library
Trobec, Roman; Vajteršic, Marián; Zinterhof, Peter
2009-01-01
... and/or distributed systems. The contributions to this book are focused on topics most concerned in the trends of today's parallel computing. These range from parallel algorithmics, programming, tools, network computing to future parallel computing. Particular attention is paid to parallel numerics: linear algebra, differential equations, numerica...
Parallel Pascal - An extended Pascal for parallel computers
Reeves, A. P.
1984-01-01
Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.
Parallel computing in enterprise modeling.
Energy Technology Data Exchange (ETDEWEB)
Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.; Vanderveen, Keith; Ray, Jaideep; Heath, Zach; Allan, Benjamin A.
2008-08-01
This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.
Parallel computation of rotating flows
DEFF Research Database (Denmark)
Lundin, Lars Kristian; Barker, Vincent A.; Sørensen, Jens Nørkær
1999-01-01
This paper deals with the simulation of 3‐D rotating flows based on the velocity‐vorticity formulation of the Navier‐Stokes equations in cylindrical coordinates. The governing equations are discretized by a finite difference method. The solution is advanced to a new time level by a two‐step process...... is that of solving a singular, large, sparse, over‐determined linear system of equations, and the iterative method CGLS is applied for this purpose. We discuss some of the mathematical and numerical aspects of this procedure and report on the performance of our software on a wide range of parallel computers. Darbe...
Advances in randomized parallel computing
Rajasekaran, Sanguthevar
1999-01-01
The technique of randomization has been employed to solve numerous prob lems of computing both sequentially and in parallel. Examples of randomized algorithms that are asymptotically better than their deterministic counterparts in solving various fundamental problems abound. Randomized algorithms have the advantages of simplicity and better performance both in theory and often in practice. This book is a collection of articles written by renowned experts in the area of randomized parallel computing. A brief introduction to randomized algorithms In the aflalysis of algorithms, at least three different measures of performance can be used: the best case, the worst case, and the average case. Often, the average case run time of an algorithm is much smaller than the worst case. 2 For instance, the worst case run time of Hoare's quicksort is O(n ), whereas its average case run time is only O( n log n). The average case analysis is conducted with an assumption on the input space. The assumption made to arrive at t...
Parallel Computing for Brain Simulation.
Pastur-Romay, L A; Porto-Pazos, A B; Cedron, F; Pazos, A
2017-01-01
The human brain is the most complex system in the known universe, it is therefore one of the greatest mysteries. It provides human beings with extraordinary abilities. However, until now it has not been understood yet how and why most of these abilities are produced. For decades, researchers have been trying to make computers reproduce these abilities, focusing on both understanding the nervous system and, on processing data in a more efficient way than before. Their aim is to make computers process information similarly to the brain. Important technological developments and vast multidisciplinary projects have allowed creating the first simulation with a number of neurons similar to that of a human brain. This paper presents an up-to-date review about the main research projects that are trying to simulate and/or emulate the human brain. They employ different types of computational models using parallel computing: digital models, analog models and hybrid models. This review includes the current applications of these works, as well as future trends. It is focused on various works that look for advanced progress in Neuroscience and still others which seek new discoveries in Computer Science (neuromorphic hardware, machine learning techniques). Their most outstanding characteristics are summarized and the latest advances and future plans are presented. In addition, this review points out the importance of considering not only neurons: Computational models of the brain should also include glial cells, given the proven importance of astrocytes in information processing. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Zheng, Ming; Zhang, Shugong; Zhou, You; Liu, Guixia
2018-03-01
Inferring gene regulatory networks (GRNs) is a challenging computational task in system biology. Many inference algorithms have been proposed along with related modifications to various problems. Every algorithm has its own advantages and drawbacks. In particular, the efficiency of each algorithm is not as good as people expect. A novel inference algorithm is proposed in this paper that can be divided into two parts. In the first part, the pre-computational part, two tasks must be accomplished: singular value decomposition for solution space determination and the threshold restriction method for redundant edge deletion. The second part of the algorithm is a hybrid parallel genetic algorithm. In this part, a parallel genetic algorithm is used for a first quick search, after which hill climbing is used for an exact search. The proposed algorithm is validated on both melanoma and type II diabetes GRNs and is compared with other algorithms. The efficiency of our algorithm was tested with different numbers of echoes and nodes. The cross-validation results confirmed the effectiveness of our algorithm, which significantly outperforms other algorithms.
Aspects of computation on asynchronous parallel processors
International Nuclear Information System (INIS)
Wright, M.
1989-01-01
The increasing availability of asynchronous parallel processors has provided opportunities for original and useful work in scientific computing. However, the field of parallel computing is still in a highly volatile state, and researchers display a wide range of opinion about many fundamental questions such as models of parallelism, approaches for detecting and analyzing parallelism of algorithms, and tools that allow software developers and users to make effective use of diverse forms of complex hardware. This volume collects the work of researchers specializing in different aspects of parallel computing, who met to discuss the framework and the mechanics of numerical computing. The far-reaching impact of high-performance asynchronous systems is reflected in the wide variety of topics, which include scientific applications (e.g. linear algebra, lattice gauge simulation, ordinary and partial differential equations), models of parallelism, parallel language features, task scheduling, automatic parallelization techniques, tools for algorithm development in parallel environments, and system design issues
Broadcasting a message in a parallel computer
Berg, Jeremy E [Rochester, MN; Faraj, Ahmad A [Rochester, MN
2011-08-02
Methods, systems, and products are disclosed for broadcasting a message in a parallel computer. The parallel computer includes a plurality of compute nodes connected together using a data communications network. The data communications network optimized for point to point data communications and is characterized by at least two dimensions. The compute nodes are organized into at least one operational group of compute nodes for collective parallel operations of the parallel computer. One compute node of the operational group assigned to be a logical root. Broadcasting a message in a parallel computer includes: establishing a Hamiltonian path along all of the compute nodes in at least one plane of the data communications network and in the operational group; and broadcasting, by the logical root to the remaining compute nodes, the logical root's message along the established Hamiltonian path.
Template based parallel checkpointing in a massively parallel computer system
Energy Technology Data Exchange (ETDEWEB)
Archer, Charles Jens [Rochester, MN; Inglett, Todd Alan [Rochester, MN
2009-01-13
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
An Introduction to Parallel Computation R
Indian Academy of Sciences (India)
found in the Suggested Reading given at the end. Basic Programming Model. A parallel computer can be programmed by providing a program for each processor in it. In most common parallel computer organizations, a processor can only access its local memory. The program provided to each processor may perform ...
Parallel algorithms for mapping pipelined and parallel computations
Nicol, David M.
1988-01-01
Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.
CX: A Scalable, Robust Network for Parallel Computing
Directory of Open Access Journals (Sweden)
Peter Cappello
2002-01-01
Full Text Available CX, a network-based computational exchange, is presented. The system's design integrates variations of ideas from other researchers, such as work stealing, non-blocking tasks, eager scheduling, and space-based coordination. The object-oriented API is simple, compact, and cleanly separates application logic from the logic that supports interprocess communication and fault tolerance. Computations, of course, run to completion in the presence of computational hosts that join and leave the ongoing computation. Such hosts, or producers, use task caching and prefetching to overlap computation with interprocessor communication. To break a potential task server bottleneck, a network of task servers is presented. Even though task servers are envisioned as reliable, the self-organizing, scalable network of n- servers, described as a sibling-connected height-balanced fat tree, tolerates a sequence of n-1 server failures. Tasks are distributed throughout the server network via a simple "diffusion" process. CX is intended as a test bed for research on automated silent auctions, reputation services, authentication services, and bonding services. CX also provides a test bed for algorithm research into network-based parallel computation.
Massively parallel evolutionary computation on GPGPUs
Tsutsui, Shigeyoshi
2013-01-01
Evolutionary algorithms (EAs) are metaheuristics that learn from natural collective behavior and are applied to solve optimization problems in domains such as scheduling, engineering, bioinformatics, and finance. Such applications demand acceptable solutions with high-speed execution using finite computational resources. Therefore, there have been many attempts to develop platforms for running parallel EAs using multicore machines, massively parallel cluster machines, or grid computing environments. Recent advances in general-purpose computing on graphics processing units (GPGPU) have opened u
Collectively loading an application in a parallel computer
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.; Miller, Samuel J.; Mundy, Michael B.
2016-01-05
Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.
Parallel Computing Strategies for Irregular Algorithms
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Parallel quantum computing in a single ensemble quantum computer
International Nuclear Information System (INIS)
Long Guilu; Xiao, L.
2004-01-01
We propose a parallel quantum computing mode for ensemble quantum computer. In this mode, some qubits are in pure states while other qubits are in mixed states. It enables a single ensemble quantum computer to perform 'single-instruction-multidata' type of parallel computation. Parallel quantum computing can provide additional speedup in Grover's algorithm and Shor's algorithm. In addition, it also makes a fuller use of qubit resources in an ensemble quantum computer. As a result, some qubits discarded in the preparation of an effective pure state in the Schulman-Varizani and the Cleve-DiVincenzo algorithms can be reutilized
Massively Parallel Computing: A Sandia Perspective
Energy Technology Data Exchange (ETDEWEB)
Dosanjh, Sudip S.; Greenberg, David S.; Hendrickson, Bruce; Heroux, Michael A.; Plimpton, Steve J.; Tomkins, James L.; Womble, David E.
1999-05-06
The computing power available to scientists and engineers has increased dramatically in the past decade, due in part to progress in making massively parallel computing practical and available. The expectation for these machines has been great. The reality is that progress has been slower than expected. Nevertheless, massively parallel computing is beginning to realize its potential for enabling significant break-throughs in science and engineering. This paper provides a perspective on the state of the field, colored by the authors' experiences using large scale parallel machines at Sandia National Laboratories. We address trends in hardware, system software and algorithms, and we also offer our view of the forces shaping the parallel computing industry.
Structured Parallel Programming Patterns for Efficient Computation
McCool, Michael; Robison, Arch
2012-01-01
Programming is now parallel programming. Much as structured programming revolutionized traditional serial programming decades ago, a new kind of structured programming, based on patterns, is relevant to parallel programming today. Parallel computing experts and industry insiders Michael McCool, Arch Robison, and James Reinders describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give detailed concrete examples using multiple programming models. Examples are primarily given using two of th
Structured grid generator on parallel computers
International Nuclear Information System (INIS)
Muramatsu, Kazuhiro; Murakami, Hiroyuki; Higashida, Akihiro; Yanagisawa, Ichiro.
1997-03-01
A general purpose structured grid generator on parallel computers, which generates a large-scale structured grid efficiently, has been developed. The generator is applicable to Cartesian, cylindrical and BFC (Boundary-Fitted Curvilinear) coordinates. In case of BFC grids, there are three adaptable topologies; L-type, O-type and multi-block type, the last of which enables any combination of L- and O-grids. Internal BFC grid points can be automatically generated and smoothed by either algebraic supplemental method or partial differential equation method. The partial differential equation solver is implemented on parallel computers, because it consumes a large portion of overall execution time. Therefore, high-speed processing of large-scale grid generation can be realized by use of parallel computer. Generated grid data are capable to be adjusted to domain decomposition for parallel analysis. (author)
Computer-Aided Parallelizer and Optimizer
Jin, Haoqiang
2011-01-01
The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.
Graph Partitioning Models for Parallel Computing
Energy Technology Data Exchange (ETDEWEB)
Hendrickson, B.; Kolda, T.G.
1999-03-02
Calculations can naturally be described as graphs in which vertices represent computation and edges reflect data dependencies. By partitioning the vertices of a graph, the calculation can be divided among processors of a parallel computer. However, the standard methodology for graph partitioning minimizes the wrong metric and lacks expressibility. We survey several recently proposed alternatives and discuss their relative merits.
Impact analysis on a massively parallel computer
International Nuclear Information System (INIS)
Zacharia, T.; Aramayo, G.A.
1994-01-01
Advanced mathematical techniques and computer simulation play a major role in evaluating and enhancing the design of beverage cans, industrial, and transportation containers for improved performance. Numerical models are used to evaluate the impact requirements of containers used by the Department of Energy (DOE) for transporting radioactive materials. Many of these models are highly compute-intensive. An analysis may require several hours of computational time on current supercomputers despite the simplicity of the models being studied. As computer simulations and materials databases grow in complexity, massively parallel computers have become important tools. Massively parallel computational research at the Oak Ridge National Laboratory (ORNL) and its application to the impact analysis of shipping containers is briefly described in this paper
Locating hardware faults in a parallel computer
Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.
2010-04-13
Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.
Internode data communications in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Miller, Douglas R.; Parker, Jeffrey J.; Ratterman, Joseph D.; Smith, Brian E.
2013-09-03
Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory.
Link failure detection in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Megerian, Mark G.; Smith, Brian E.
2010-11-09
Methods, apparatus, and products are disclosed for link failure detection in a parallel computer including compute nodes connected in a rectangular mesh network, each pair of adjacent compute nodes in the rectangular mesh network connected together using a pair of links, that includes: assigning each compute node to either a first group or a second group such that adjacent compute nodes in the rectangular mesh network are assigned to different groups; sending, by each of the compute nodes assigned to the first group, a first test message to each adjacent compute node assigned to the second group; determining, by each of the compute nodes assigned to the second group, whether the first test message was received from each adjacent compute node assigned to the first group; and notifying a user, by each of the compute nodes assigned to the second group, whether the first test message was received.
A parallel computational model for GATE simulations.
Rannou, F R; Vega-Acevedo, N; El Bitar, Z
2013-12-01
GATE/Geant4 Monte Carlo simulations are computationally demanding applications, requiring thousands of processor hours to produce realistic results. The classical strategy of distributing the simulation of individual events does not apply efficiently for Positron Emission Tomography (PET) experiments, because it requires a centralized coincidence processing and large communication overheads. We propose a parallel computational model for GATE that handles event generation and coincidence processing in a simple and efficient way by decentralizing event generation and processing but maintaining a centralized event and time coordinator. The model is implemented with the inclusion of a new set of factory classes that can run the same executable in sequential or parallel mode. A Mann-Whitney test shows that the output produced by this parallel model in terms of number of tallies is equivalent (but not equal) to its sequential counterpart. Computational performance evaluation shows that the software is scalable and well balanced. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Parallel visualization on leadership computing resources
International Nuclear Information System (INIS)
Peterka, T; Ross, R B; Shen, H-W; Ma, K-L; Kendall, W; Yu, H
2009-01-01
Changes are needed in the way that visualization is performed, if we expect the analysis of scientific data to be effective at the petascale and beyond. By using similar techniques as those used to parallelize simulations, such as parallel I/O, load balancing, and effective use of interprocess communication, the supercomputers that compute these datasets can also serve as analysis and visualization engines for them. Our team is assessing the feasibility of performing parallel scientific visualization on some of the most powerful computational resources of the U.S. Department of Energy's National Laboratories in order to pave the way for analyzing the next generation of computational results. This paper highlights some of the conclusions of that research.
Traffic Flow Prediction Model for Large-Scale Road Network Based on Cloud Computing
Directory of Open Access Journals (Sweden)
Zhaosheng Yang
2014-01-01
Full Text Available To increase the efficiency and precision of large-scale road network traffic flow prediction, a genetic algorithm-support vector machine (GA-SVM model based on cloud computing is proposed in this paper, which is based on the analysis of the characteristics and defects of genetic algorithm and support vector machine. In cloud computing environment, firstly, SVM parameters are optimized by the parallel genetic algorithm, and then this optimized parallel SVM model is used to predict traffic flow. On the basis of the traffic flow data of Haizhu District in Guangzhou City, the proposed model was verified and compared with the serial GA-SVM model and parallel GA-SVM model based on MPI (message passing interface. The results demonstrate that the parallel GA-SVM model based on cloud computing has higher prediction accuracy, shorter running time, and higher speedup.
The new landscape of parallel computer architecture
Shalf, John
2007-07-01
The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.
Arkin, Ethem; Tekinerdogan, Bedir; Imre, Kayhan M.
2017-01-01
The need for high-performance computing together with the increasing trend from single processor to parallel computer architectures has leveraged the adoption of parallel computing. To benefit from parallel computing power, usually parallel algorithms are defined that can be mapped and executed
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.
2014-08-12
Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.
Frontiers of massively parallel scientific computation
International Nuclear Information System (INIS)
Fischer, J.R.
1987-07-01
Practical applications using massively parallel computer hardware first appeared during the 1980s. Their development was motivated by the need for computing power orders of magnitude beyond that available today for tasks such as numerical simulation of complex physical and biological processes, generation of interactive visual displays, satellite image analysis, and knowledge based systems. Representative of the first generation of this new class of computers is the Massively Parallel Processor (MPP). A team of scientists was provided the opportunity to test and implement their algorithms on the MPP. The first results are presented. The research spans a broad variety of applications including Earth sciences, physics, signal and image processing, computer science, and graphics. The performance of the MPP was very good. Results obtained using the Connection Machine and the Distributed Array Processor (DAP) are presented
Measuring performance of parallel computers. Final report
Energy Technology Data Exchange (ETDEWEB)
Sullivan, F.
1994-07-01
Performance Measurement - the authors have developed a taxonomy of parallel algorithms based on data motion and example applications have been coded for each class of the taxonomy. Computational benchmark kernels have been extracted for several applications, and detailed measurements have been performed. Algorithms for Massively Parallel SIMD machines - measurement results and computational experiences indicate that top performance will be achieved by `iteration` type algorithms running on massively parallel SIMD machines. Reformulation as iteration may entail unorthodox approaches based on probabilistic methods. The authors have developed such methods for some applications. Here they discuss their approach to performance measurement, describe the taxonomy and measurements which have been made, and report on some general conclusions which can be drawn from the results of the measurements.
Vector and parallel processors in computational science
International Nuclear Information System (INIS)
Duff, I.S.; Reid, J.K.
1985-01-01
These proceedings contain the articles presented at the named conference. These concern hardware and software for vector and parallel processors, numerical methods and algorithms for the computation on such processors, as well as applications of such methods to different fields of physics and related sciences. See hints under the relevant topics. (HSI)
Contributions to computational stereology and parallel programming
DEFF Research Database (Denmark)
Rasmusson, Allan
rotator, even without the need for isotropic sections. To meet the need for computational power to perform image restoration of virtual tissue sections, parallel programming on GPUs has also been part of the project. This has lead to a significant change in paradigm for a previously developed surgical...
An Introduction to Parallel Computation R
Indian Academy of Sciences (India)
Parallel computers have also been very useful for solving problems in engineering and biological sciences as well as in business data processing. ..... depends upon whether high speed is the only objective, or whether program portability, ease of program development are also important. For achieving very high speed, ...
Rectilinear partitioning of irregular data parallel computations
Nicol, David M.
1991-01-01
New mapping algorithms for domain oriented data-parallel computations, where the workload is distributed irregularly throughout the domain, but exhibits localized communication patterns are described. Researchers consider the problem of partitioning the domain for parallel processing in such a way that the workload on the most heavily loaded processor is minimized, subject to the constraint that the partition be perfectly rectilinear. Rectilinear partitions are useful on architectures that have a fast local mesh network. Discussed here is an improved algorithm for finding the optimal partitioning in one dimension, new algorithms for partitioning in two dimensions, and optimal partitioning in three dimensions. The application of these algorithms to real problems are discussed.
Java parallel secure stream for grid computing
International Nuclear Information System (INIS)
Chen, J.; Akers, W.; Chen, Y.; Watson, W.
2001-01-01
The emergence of high speed wide area networks makes grid computing a reality. However grid applications that need reliable data transfer still have difficulties to achieve optimal TCP performance due to network tuning of TCP window size to improve the bandwidth and to reduce latency on a high speed wide area network. The authors present a pure Java package called JPARSS (Java Parallel Secure Stream) that divides data into partitions that are sent over several parallel Java streams simultaneously and allows Java or Web applications to achieve optimal TCP performance in a gird environment without the necessity of tuning the TCP window size. Several experimental results are provided to show that using parallel stream is more effective than tuning TCP window size. In addition X.509 certificate based single sign-on mechanism and SSL based connection establishment are integrated into this package. Finally a few applications using this package will be discussed
Intranode data communications in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Ratterman, Joseph D; Smith, Brian E
2014-01-07
Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a computer node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process.
Synchronizing compute node time bases in a parallel computer
Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip
2015-01-27
Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.
On synchronous parallel computations with independent probabilistic choice
International Nuclear Information System (INIS)
Reif, J.H.
1984-01-01
This paper introduces probabilistic choice to synchronous parallel machine models; in particular parallel RAMs. The power of probabilistic choice in parallel computations is illustrate by parallelizing some known probabilistic sequential algorithms. The authors characterize the computational complexity of time, space, and processor bounded probabilistic parallel RAMs in terms of the computational complexity of probabilistic sequential RAMs. They show that parallelism uniformly speeds up time bounded probabilistic sequential RAM computations by nearly a quadratic factor. They also show that probabilistic choice can be eliminated from parallel computations by introducing nonuniformity
Efficient Parallel Engineering Computing on Linux Workstations
Lou, John Z.
2010-01-01
A C software module has been developed that creates lightweight processes (LWPs) dynamically to achieve parallel computing performance in a variety of engineering simulation and analysis applications to support NASA and DoD project tasks. The required interface between the module and the application it supports is simple, minimal and almost completely transparent to the user applications, and it can achieve nearly ideal computing speed-up on multi-CPU engineering workstations of all operating system platforms. The module can be integrated into an existing application (C, C++, Fortran and others) either as part of a compiled module or as a dynamically linked library (DLL).
A computational fluid dynamics algorithm on a massively parallel computer
International Nuclear Information System (INIS)
Jespersen, D.C.; Levit, C.
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. The authors investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicitly time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. The authors find that the Connection Machine can achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as conventional supercomputers
Advanced neural network-based computational schemes for robust fault diagnosis
Mrugalski, Marcin
2014-01-01
The present book is devoted to problems of adaptation of artificial neural networks to robust fault diagnosis schemes. It presents neural networks-based modelling and estimation techniques used for designing robust fault diagnosis schemes for non-linear dynamic systems. A part of the book focuses on fundamental issues such as architectures of dynamic neural networks, methods for designing of neural networks and fault diagnosis schemes as well as the importance of robustness. The book is of a tutorial value and can be perceived as a good starting point for the new-comers to this field. The book is also devoted to advanced schemes of description of neural model uncertainty. In particular, the methods of computation of neural networks uncertainty with robust parameter estimation are presented. Moreover, a novel approach for system identification with the state-space GMDH neural network is delivered. All the concepts described in this book are illustrated by both simple academic illustrative examples and practica...
Parallelized reliability estimation of reconfigurable computer networks
Nicol, David M.; Das, Subhendu; Palumbo, Dan
1990-01-01
A parallelized system, ASSURE, for computing the reliability of embedded avionics flight control systems which are able to reconfigure themselves in the event of failure is described. ASSURE accepts a grammar that describes a reliability semi-Markov state-space. From this it creates a parallel program that simultaneously generates and analyzes the state-space, placing upper and lower bounds on the probability of system failure. ASSURE is implemented on a 32-node Intel iPSC/860, and has achieved high processor efficiencies on real problems. Through a combination of improved algorithms, exploitation of parallelism, and use of an advanced microprocessor architecture, ASSURE has reduced the execution time on substantial problems by a factor of one thousand over previous workstation implementations. Furthermore, ASSURE's parallel execution rate on the iPSC/860 is an order of magnitude faster than its serial execution rate on a Cray-2 supercomputer. While dynamic load balancing is necessary for ASSURE's good performance, it is needed only infrequently; the particular method of load balancing used does not substantially affect performance.
Computational chaos in massively parallel neural networks
Barhen, Jacob; Gulati, Sandeep
1989-01-01
A fundamental issue which directly impacts the scalability of current theoretical neural network models to massively parallel embodiments, in both software as well as hardware, is the inherent and unavoidable concurrent asynchronicity of emerging fine-grained computational ensembles and the possible emergence of chaotic manifestations. Previous analyses attributed dynamical instability to the topology of the interconnection matrix, to parasitic components or to propagation delays. However, researchers have observed the existence of emergent computational chaos in a concurrently asynchronous framework, independent of the network topology. Researcher present a methodology enabling the effective asynchronous operation of large-scale neural networks. Necessary and sufficient conditions guaranteeing concurrent asynchronous convergence are established in terms of contracting operators. Lyapunov exponents are computed formally to characterize the underlying nonlinear dynamics. Simulation results are presented to illustrate network convergence to the correct results, even in the presence of large delays.
(Nearly) portable PIC code for parallel computers
International Nuclear Information System (INIS)
Decyk, V.K.
1993-01-01
As part of the Numerical Tokamak Project, the author has developed a (nearly) portable, one dimensional version of the GCPIC algorithm for particle-in-cell codes on parallel computers. This algorithm uses a spatial domain decomposition for the fields, and passes particles from one domain to another as the particles move spatially. With only minor changes, the code has been run in parallel on the Intel Delta, the Cray C-90, the IBM ES/9000 and a cluster of workstations. After a line by line translation into cmfortran, the code was also run on the CM-200. Impressive speeds have been achieved, both on the Intel Delta and the Cray C-90, around 30 nanoseconds per particle per time step. In addition, the author was able to isolate the data management modules, so that the physics modules were not changed much from their sequential version, and the data management modules can be used as open-quotes black boxes.close quotes
Parallel Scientific Computing in C++ and MPI
Karniadakis, George Em; Kirby, Robert M., II
2003-06-01
This book provides a seamless approach to numerical algorithms, modern programming techniques and parallel computing. These concepts and tools are usually taught serially across different courses and different textbooks, thus observing the connection between them. The necessity of integrating these subjects usually comes after such courses are concluded (e.g., during a first job or a thesis project), thus forcing the student to synthesize what is perceived to be three independent subfields into one in order to produce a solution. The book includes both basic and advanced topics and places equal emphasis on the discretization of partial differential equations and on solvers. Advanced topics include wavelets, high-order methods, non-symmetric systems and parallelization of sparse systems. A CD-ROM accompanies the text.
Computation and parallel implementation for early vision
Gualtieri, J. Anthony
1990-01-01
The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.
Optical Waveguides in General Purpose Parallel Computers.
Davis, Martin H., Jr.
1992-01-01
This thesis examines how optics can be used in general purpose parallel computing systems. Two basic assumptions are made. First, optical waveguide communications technology will continue to mature and become more and more prevalent in smaller and smaller scale environments. Second, electronic computational capabilities will continue to increase for at least the next decade. Thus, this research explores ways in which optical waveguide communications can be combined with traditional electronic computing elements to support general purpose parallel computing. The specific question asked is, "How can the properties of optical waveguides give rise to architectural features useful for general purpose parallel computing?" The answers to this question are developed in the context of a distributed shared memory computing design called OBee. This work defines the OBee design, a specific implementation, based on optical waveguides, of a previously developed, more abstract architecture named Beehive. The basic building block of OBee's physical optical architecture is an Optical Broadcast Ring (OBR). The thesis defines how one or more waveguides (or wavelengths) are arranged in varying topologies; it also defines several different access protocols. Together, a particular combination of topology and access protocol define a given OBR's properties. The OBee design employs a particular OBR to define a specific implementation of Beehive's reader initiated cache coherency protocol. The OBee design uses two different OBRs to define two distinct implementations of Beehive's sole synchronization primitive, locks. As improvements to Beehive, OBee adds two more synchronization primitives, barriers and Fetch -and-OP. The OBee design uses two different OBRs to define two distinct implementations of barriers; similarly, it uses two different OBRs to define two distinct implementations of Fetch-and-OP. Analytical evaluations of the performance of the raw architectural primitives are
Computational fluid dynamics on a massively parallel computer
Jespersen, Dennis C.; Levit, Creon
1989-01-01
A finite difference code was implemented for the compressible Navier-Stokes equations on the Connection Machine, a massively parallel computer. The code is based on the ARC2D/ARC3D program and uses the implicit factored algorithm of Beam and Warming. The codes uses odd-even elimination to solve linear systems. Timings and computation rates are given for the code, and a comparison is made with a Cray XMP.
Utilizing parallel optimization in computational fluid dynamics
Kokkolaras, Michael
1998-12-01
General problems of interest in computational fluid dynamics are investigated by means of optimization. Specifically, in the first part of the dissertation, a method of optimal incremental function approximation is developed for the adaptive solution of differential equations. Various concepts and ideas utilized by numerical techniques employed in computational mechanics and artificial neural networks (e.g. function approximation and error minimization, variational principles and weighted residuals, and adaptive grid optimization) are combined to formulate the proposed method. The basis functions and associated coefficients of a series expansion, representing the solution, are optimally selected by a parallel direct search technique at each step of the algorithm according to appropriate criteria; the solution is built sequentially. In this manner, the proposed method is adaptive in nature, although a grid is neither built nor adapted in the traditional sense using a-posteriori error estimates. Variational principles are utilized for the definition of the objective function to be extremized in the associated optimization problems, ensuring that the problem is well-posed. Complicated data structures and expensive remeshing algorithms and systems solvers are avoided. Computational efficiency is increased by using low-order basis functions and concurrent computing. Numerical results and convergence rates are reported for a range of steady-state problems, including linear and nonlinear differential equations associated with general boundary conditions, and illustrate the potential of the proposed method. Fluid dynamics applications are emphasized. Conclusions are drawn by discussing the method's limitations, advantages, and possible extensions. The second part of the dissertation is concerned with the optimization of the viscous-inviscid-interaction (VII) mechanism in an airfoil flow analysis code. The VII mechanism is based on the concept of a transpiration velocity
Parallel Proximity Detection for Computer Simulation
Steinman, Jeffrey S. (Inventor); Wieland, Frederick P. (Inventor)
1997-01-01
The present invention discloses a system for performing proximity detection in computer simulations on parallel processing architectures utilizing a distribution list which includes movers and sensor coverages which check in and out of grids. Each mover maintains a list of sensors that detect the mover's motion as the mover and sensor coverages check in and out of the grids. Fuzzy grids are includes by fuzzy resolution parameters to allow movers and sensor coverages to check in and out of grids without computing exact grid crossings. The movers check in and out of grids while moving sensors periodically inform the grids of their coverage. In addition, a lookahead function is also included for providing a generalized capability without making any limiting assumptions about the particular application to which it is applied. The lookahead function is initiated so that risk-free synchronization strategies never roll back grid events. The lookahead function adds fixed delays as events are scheduled for objects on other nodes.
Parallel Proximity Detection for Computer Simulations
Steinman, Jeffrey S. (Inventor); Wieland, Frederick P. (Inventor)
1998-01-01
The present invention discloses a system for performing proximity detection in computer simulations on parallel processing architectures utilizing a distribution list which includes movers and sensor coverages which check in and out of grids. Each mover maintains a list of sensors that detect the mover's motion as the mover and sensor coverages check in and out of the grids. Fuzzy grids are included by fuzzy resolution parameters to allow movers and sensor coverages to check in and out of grids without computing exact grid crossings. The movers check in and out of grids while moving sensors periodically inform the grids of their coverage. In addition, a lookahead function is also included for providing a generalized capability without making any limiting assumptions about the particular application to which it is applied. The lookahead function is initiated so that risk-free synchronization strategies never roll back grid events. The lookahead function adds fixed delays as events are scheduled for objects on other nodes.
Optimal dynamic remapping of data parallel computations
Nicol, David M.; Reynolds, Paul F., Jr.
1990-01-01
A large class of data parallel computations is characterized by a sequence of phases, with phase changes occurring unpredictably. Dynamic remapping of the workload to processors may be required to maintain good performance. The problem considered, for which the utility of remapping and the future behavior of the workload are uncertain, arises when phases exhibit stable execution requirements during a given phase, but requirements change radically between phases. For these situations, a workload assignment generated for one phase may hinder performance during the next phase. This problem is treated formally for a probabilistic model of computation with at most two phases. The authors address the fundamental problem of balancing the expected remapping performance gain against the delay cost, and they derive the optimal remapping decision policy. The promise of the approach is shown by application to multiprocessor implementations of an adaptive gridding fluid dynamics program and to a battlefield simulation program.
Optimized data communications in a parallel computer
Faraj, Daniel A.
2014-08-19
A parallel computer includes nodes that include a network adapter that couples the node in a point-to-point network and supports communications in opposite directions of each dimension. Optimized communications include: receiving, by a network adapter of a receiving compute node, a packet--from a source direction--that specifies a destination node and deposit hints. Each hint is associated with a direction within which the packet is to be deposited. If a hint indicates the packet to be deposited in the opposite direction: the adapter delivers the packet to an application on the receiving node; forwards the packet to a next node in the opposite direction if the receiving node is not the destination; and forwards the packet to a node in a direction of a subsequent dimension if the hints indicate that the packet is to be deposited in the direction of the subsequent dimension.
An efficient parallel computing scheme for Monte Carlo criticality calculations
International Nuclear Information System (INIS)
Dufek, Jan; Gudowski, Waclaw
2009-01-01
The existing parallel computing schemes for Monte Carlo criticality calculations suffer from a low efficiency when applied on many processors. We suggest a new fission matrix based scheme for efficient parallel computing. The results are derived from the fission matrix that is combined from all parallel simulations. The scheme allows for a practically ideal parallel scaling as no communication among the parallel simulations is required, and inactive cycles are not needed.
A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.
Accurate modeling of parallel scientific computations
Nicol, David M.; Townsend, James C.
1988-01-01
Scientific codes are usually parallelized by partitioning a grid among processors. To achieve top performance it is necessary to partition the grid so as to balance workload and minimize communication/synchronization costs. This problem is particularly acute when the grid is irregular, changes over the course of the computation, and is not known until load time. Critical mapping and remapping decisions rest on the ability to accurately predict performance, given a description of a grid and its partition. This paper discusses one approach to this problem, and illustrates its use on a one-dimensional fluids code. The models constructed are shown to be accurate, and are used to find optimal remapping schedules.
Broadcasting collective operation contributions throughout a parallel computer
Faraj, Ahmad [Rochester, MN
2012-02-21
Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.
Automatic Parallelization Tool: Classification of Program Code for Parallel Computing
Directory of Open Access Journals (Sweden)
Mustafa Basthikodi
2016-04-01
Full Text Available Performance growth of single-core processors has come to a halt in the past decade, but was re-enabled by the introduction of parallelism in processors. Multicore frameworks along with Graphical Processing Units empowered to enhance parallelism broadly. Couples of compilers are updated to developing challenges forsynchronization and threading issues. Appropriate program and algorithm classifications will have advantage to a great extent to the group of software engineers to get opportunities for effective parallelization. In present work we investigated current species for classification of algorithms, in that related work on classification is discussed along with the comparison of issues that challenges the classification. The set of algorithms are chosen which matches the structure with different issues and perform given task. We have tested these algorithms utilizing existing automatic species extraction toolsalong with Bones compiler. We have added functionalities to existing tool, providing a more detailed characterization. The contributions of our work include support for pointer arithmetic, conditional and incremental statements, user defined types, constants and mathematical functions. With this, we can retain significant data which is not captured by original speciesof algorithms. We executed new theories into the device, empowering automatic characterization of program code.
Applications of the parallel computing system using network
International Nuclear Information System (INIS)
Ido, Shunji; Hasebe, Hiroki
1994-01-01
Parallel programming is applied to multiple processors connected in Ethernet. Data exchanges between tasks located in each processing element are realized by two ways. One is socket which is standard library on recent UNIX operating systems. Another is a network connecting software, named as Parallel Virtual Machine (PVM) which is a free software developed by ORNL, to use many workstations connected to network as a parallel computer. This paper discusses the availability of parallel computing using network and UNIX workstations and comparison between specialized parallel systems (Transputer and iPSC/860) in a Monte Carlo simulation which generally shows high parallelization ratio. (author)
Parallel computing and networking; Heiretsu keisanki to network
Energy Technology Data Exchange (ETDEWEB)
Asakawa, E.; Tsuru, T. [Japan National Oil Corp., Tokyo (Japan); Matsuoka, T. [Japan Petroleum Exploration Co. Ltd., Tokyo (Japan)
1996-05-01
This paper describes the trend of parallel computers used in geophysical exploration. Around 1993 was the early days when the parallel computers began to be used for geophysical exploration. Classification of these computers those days was mainly MIMD (multiple instruction stream, multiple data stream), SIMD (single instruction stream, multiple data stream) and the like. Parallel computers were publicized in the 1994 meeting of the Geophysical Exploration Society as a `high precision imaging technology`. Concerning the library of parallel computers, there was a shift to PVM (parallel virtual machine) in 1993 and to MPI (message passing interface) in 1995. In addition, the compiler of FORTRAN90 was released with support implemented for data parallel and vector computers. In 1993, networks used were Ethernet, FDDI, CDDI and HIPPI. In 1995, the OC-3 products under ATM began to propagate. However, ATM remains to be an interoffice high speed network because the ATM service has not spread yet for the public network. 1 ref.
Arkin, Ethem; Tekinerdogan, Bedir
2016-01-01
Mapping parallel algorithms to parallel computing platforms requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform, the mapping of the algorithm to the logical configuration platform and the implementation of the
Adaptive and Parallel Computational Techniques in Materials Science
National Research Council Canada - National Science Library
Flaherty, Joseph
1998-01-01
This Augmentation Award for Science and Engineering Research Training (AASERT) supported research students to work 0 adaptive and parallel computational techniques associated with crystal growth processing...
Parallel Computing for Terrestrial Ecosystem Carbon Modeling
International Nuclear Information System (INIS)
Wang, Dali; Post, Wilfred M.; Ricciuto, Daniel M.; Berry, Michael
2011-01-01
Terrestrial ecosystems are a primary component of research on global environmental change. Observational and modeling research on terrestrial ecosystems at the global scale, however, has lagged behind their counterparts for oceanic and atmospheric systems, largely because the unique challenges associated with the tremendous diversity and complexity of terrestrial ecosystems. There are 8 major types of terrestrial ecosystem: tropical rain forest, savannas, deserts, temperate grassland, deciduous forest, coniferous forest, tundra, and chaparral. The carbon cycle is an important mechanism in the coupling of terrestrial ecosystems with climate through biological fluxes of CO 2 . The influence of terrestrial ecosystems on atmospheric CO 2 can be modeled via several means at different timescales. Important processes include plant dynamics, change in land use, as well as ecosystem biogeography. Over the past several decades, many terrestrial ecosystem models (see the 'Model developments' section) have been developed to understand the interactions between terrestrial carbon storage and CO 2 concentration in the atmosphere, as well as the consequences of these interactions. Early TECMs generally adapted simple box-flow exchange models, in which photosynthetic CO 2 uptake and respiratory CO 2 release are simulated in an empirical manner with a small number of vegetation and soil carbon pools. Demands on kinds and amount of information required from global TECMs have grown. Recently, along with the rapid development of parallel computing, spatially explicit TECMs with detailed process based representations of carbon dynamics become attractive, because those models can readily incorporate a variety of additional ecosystem processes (such as dispersal, establishment, growth, mortality etc.) and environmental factors (such as landscape position, pest populations, disturbances, resource manipulations, etc.), and provide information to frame policy options for climate change
Fast parallel computation of polynomials using few processors
DEFF Research Database (Denmark)
Valiant, Leslie; Skyum, Sven
1981-01-01
It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors.......It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors....
Fast Parallel Computation of Polynomials Using Few Processors
DEFF Research Database (Denmark)
Valiant, Leslie G.; Skyum, Sven; Berkowitz, S.
1983-01-01
It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors.......It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors....
Evaluation of DEC`s GIGAswitch for distributed parallel computing
Energy Technology Data Exchange (ETDEWEB)
Chen, H.; Hutchins, J.; Brandt, J.
1993-10-01
One of Sandia`s research efforts is to reduce the end-to-end communication delay in a parallel-distributed computing environment. GIGAswitch is DEC`s implementation of a gigabit local area network based on switched FDDI technology. Using the GIGAswitch, the authors intend to minimize the medium access latency suffered by shared-medium FDDI technology. Experimental results show that the GIGAswitch adds 16.5 microseconds of switching and bridging delay to an end-to-end communication. Although the added latency causes a 1.8% throughput degradation and a 5% line efficiency degradation, the availability of dedicated bandwidth is much more than what is available to a workstation on a shared medium. For example, ten directly connected workstations each would have a dedicated bandwidth of 95 Mbps, but if they were sharing the FDDI bandwidth, each would have 10% of the total bandwidth, i.e., less than 10 Mbps. In addition, they have found that when there is no output port contention, the switch`s aggregate bandwidth will scale up to multiples of its port bandwidth. However, with output port contention, the throughput and latency performance suffered significantly. Their mathematical and simulation models indicate that the GIGAswitch line efficiency could be as low as 63% when there are nine input ports contending for the same output port. The data indicate that the delay introduced by contention at the server workstation is 50 times that introduced by the GIGAswitch. The authors conclude that the GIGAswitch meets the performance requirements of today`s high-end workstations and that the switched FDDI technology provides an alternative that utilizes existing workstation interfaces while increasing the aggregate bandwidth. However, because the speed of workstations is increasing by a factor of 2 every 1.5 years, the switched FDDI technology is only good as an interim solution.
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2013-10-29
Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint.
The ongoing investigation of high performance parallel computing in HEP
Peach, Kenneth J; Böck, R K; Dobinson, Robert W; Hansroul, M; Norton, Alan Robert; Willers, Ian Malcolm; Baud, J P; Carminati, F; Gagliardi, F; McIntosh, E; Metcalf, M; Robertson, L; CERN. Geneva. Detector Research and Development Committee
1993-01-01
Past and current exploitation of parallel computing in High Energy Physics is summarized and a list of R & D projects in this area is presented. The applicability of new parallel hardware and software to physics problems is investigated, in the light of the requirements for computing power of LHC experiments and the current trends in the computer industry. Four main themes are discussed (possibilities for a finer grain of parallelism; fine-grain communication mechanism; usable parallel programming environment; different programming models and architectures, using standard commercial products). Parallel computing technology is potentially of interest for offline and vital for real time applications in LHC. A substantial investment in applications development and evaluation of state of the art hardware and software products is needed. A solid development environment is required at an early stage, before mainline LHC program development begins.
A scalable parallel black oil simulator on distributed memory parallel computers
Wang, Kun; Liu, Hui; Chen, Zhangxin
2015-11-01
This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
Directory of Open Access Journals (Sweden)
Dan ZHANG
2010-10-01
Full Text Available Accuracy is paramount for the further development of parallel mechanism in real world, especially in industry. Previous research was focused on the improvement of rigidity and load capacity which is related with the stiffness matrix of closed loop kinematic structure. However, if the mechanical structure has been predefined or fabricated, stiffness matrix only can search for the optimal configuration in the workspace, but fails to enhance the accuracy at a given pose. In this research, the concept of optimal robust calibration is developed as an effective approach to largely reduce various errors of the predefined parallel mechanism. A novel coevolutionary radial basis function (RBF neural network based soft sensor is proposed to implement the optimal robust calibration procedure. A six- degrees-of-freedom parallel kinematics manipulator is selected as the case study to verify the proposed methodology. The results demonstrate that the coevolutionary neural network possesses the excellent performance as a smart soft sensor for the calibration of closed loop kinematic structure.
Improvements on binary coding using parallel computing
Fuentes, P A; Quintas, D G
2011-01-01
The error-correcting codes have many applications in fields related to communications. This paper tackles some partition algorithms to optimize the data encoding. These algorithms are based on sliding windows and allow for a parallel implementation. We analyse them and then expound a comparative study between the different partition methods that we propose.
Dynamic traffic assignment on parallel computers
Energy Technology Data Exchange (ETDEWEB)
Nagel, K.; Frye, R.; Jakob, R.; Rickert, M.; Stretz, P.
1998-12-01
The authors describe part of the current framework of the TRANSIMS traffic research project at the Los Alamos National Laboratory. It includes parallel implementations of a route planner and a microscopic traffic simulation model. They present performance figures and results of an offline load-balancing scheme used in one of the iterative re-planning runs required for dynamic route assignment.
Heuristic framework for parallel sorting computations | Nwanze ...
African Journals Online (AJOL)
Parallel sorting techniques have become of practical interest with the advent of new multiprocessor architectures. The decreasing cost of these processors will probably in the future, make the solutions that are derived thereof to be more appealing. Efficient algorithms for sorting scheme that are encountered in a number of ...
Algorithms for parallel and vector computations
Ortega, James M.
1995-01-01
This is a final report on work performed under NASA grant NAG-1-1112-FOP during the period March, 1990 through February 1995. Four major topics are covered: (1) solution of nonlinear poisson-type equations; (2) parallel reduced system conjugate gradient method; (3) orderings for conjugate gradient preconditioners, and (4) SOR as a preconditioner.
Vector and parallel processors in computational science
International Nuclear Information System (INIS)
Duff, I.S.; Reid, J.K.
1985-01-01
This book presents the papers given at a conference which reviewed the new developments in parallel and vector processing. Topics considered at the conference included hardware (array processors, supercomputers), programming languages, software aids, numerical methods (e.g., Monte Carlo algorithms, iterative methods, finite elements, optimization), and applications (e.g., neutron transport theory, meteorology, image processing)
Parallel computing, failure recovery, and extreme values
DEFF Research Database (Denmark)
Andersen, Lars Nørvang; Asmussen, Søren
A task of random size T is split into M subtasks of lengths T1, . . . , TM, each of which is sent to one out of M parallel processors. Each processor may fail at a random time before completing its allocated task, and then has to restart it from the beginning. If X1, . . . ,XM are the total task ...
Parallel algorithms and architecture for computation of manipulator forward dynamics
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel computation of manipulator forward dynamics is investigated. Considering three classes of algorithms for the solution of the problem, that is, the O(n), the O(n exp 2), and the O(n exp 3) algorithms, parallelism in the problem is analyzed. It is shown that the problem belongs to the class of NC and that the time and processors bounds are of O(log2/2n) and O(n exp 4), respectively. However, the fastest stable parallel algorithms achieve the computation time of O(n) and can be derived by parallelization of the O(n exp 3) serial algorithms. Parallel computation of the O(n exp 3) algorithms requires the development of parallel algorithms for a set of fundamentally different problems, that is, the Newton-Euler formulation, the computation of the inertia matrix, decomposition of the symmetric, positive definite matrix, and the solution of triangular systems. Parallel algorithms for this set of problems are developed which can be efficiently implemented on a unique architecture, a triangular array of n(n+2)/2 processors with a simple nearest-neighbor interconnection. This architecture is particularly suitable for VLSI and WSI implementations. The developed parallel algorithm, compared to the best serial O(n) algorithm, achieves an asymptotic speedup of more than two orders-of-magnitude in the computation the forward dynamics.
Parallel computing in genomic research: advances and applications
Directory of Open Access Journals (Sweden)
Ocaña K
2015-11-01
Full Text Available Kary Ocaña,1 Daniel de Oliveira2 1National Laboratory of Scientific Computing, Petrópolis, Rio de Janeiro, 2Institute of Computing, Fluminense Federal University, Niterói, Brazil Abstract: Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities. Keywords: high-performance computing, genomic research, cloud computing, grid computing, cluster computing, parallel computing
Basic design of parallel computational program for probabilistic structural analysis
International Nuclear Information System (INIS)
Kaji, Yoshiyuki; Arai, Taketoshi; Gu, Wenwei; Nakamura, Hitoshi
1999-06-01
In our laboratory, for 'development of damage evaluation method of structural brittle materials by microscopic fracture mechanics and probabilistic theory' (nuclear computational science cross-over research) we examine computational method related to super parallel computation system which is coupled with material strength theory based on microscopic fracture mechanics for latent cracks and continuum structural model to develop new structural reliability evaluation methods for ceramic structures. This technical report is the review results regarding probabilistic structural mechanics theory, basic terms of formula and program methods of parallel computation which are related to principal terms in basic design of computational mechanics program. (author)
Computing Optimal Cycle Mean in Parallel on CUDA
Directory of Open Access Journals (Sweden)
Jiří Barnat
2011-10-01
Full Text Available Computation of optimal cycle mean in a directed weighted graph has many applications in program analysis, performance verification in particular. In this paper we propose a data-parallel algorithmic solution to the problem and show how the computation of optimal cycle mean can be efficiently accelerated by means of CUDA technology. We show how the problem of computation of optimal cycle mean is decomposed into a sequence of data-parallel graph computation primitives and show how these primitives can be implemented and optimized for CUDA computation. Finally, we report a fivefold experimental speed up on graphs representing models of distributed systems when compared to best sequential algorithms.
Performance of Air Pollution Models on Massively Parallel Computers
DEFF Research Database (Denmark)
Brown, John; Hansen, Per Christian; Wasniewski, Jerzy
1996-01-01
To compare the performance and use of three massively parallel SIMD computers, we implemented a large air pollution model on the computers. Using a realistic large-scale model, we gain detailed insight about the performance of the three computers when used to solve large-scale scientific problems...... that involve several types of numerical computations. The computers considered in our study are the Connection Machines CM-200 and CM-5, and the MasPar MP-2216...
Software metrics for green parallel computing of big data systems
Gurbuz, Havva Gulay; Tekinerdogan, Bedir
2016-01-01
Big Data is typically organized around a distributed file system on top of which the parallel algorithms can be executed for realizing the Big Data analytics. In general, the parallel algorithms can be mapped in different alternative ways to the computing platform. Hereby each alternative will
Research in Parallel Algorithms and Software for Computational Aerosciences
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Combined Scheduling and Mapping for Scalable Computing with Parallel Tasks
Directory of Open Access Journals (Sweden)
Jörg Dümmler
2012-01-01
Full Text Available Recent and future parallel clusters and supercomputers use symmetric multiprocessors (SMPs and multi-core processors as basic nodes, providing a huge amount of parallel resources. These systems often have hierarchically structured interconnection networks combining computing resources at different levels, starting with the interconnect within multi-core processors up to the interconnection network combining nodes of the cluster or supercomputer. The challenge for the programmer is that these computing resources should be utilized efficiently by exploiting the available degree of parallelism of the application program and by structuring the application in a way which is sensitive to the heterogeneous interconnect. In this article, we pursue a parallel programming method using parallel tasks to structure parallel implementations. A parallel task can be executed by multiple processors or cores and, for each activation of a parallel task, the actual number of executing cores can be adapted to the specific execution situation. In particular, we propose a new combined scheduling and mapping technique for parallel tasks with dependencies that takes the hierarchical structure of modern multi-core clusters into account. An experimental evaluation shows that the presented programming approach can lead to a significantly higher performance compared to standard data parallel implementations.
Parallel computing in plasma physics: Nonlinear instabilities
International Nuclear Information System (INIS)
Pohn, E.; Kamelander, G.; Shoucri, M.
2000-01-01
A Vlasov-Poisson-system is used for studying the time evolution of the charge-separation at a spatial one- as well as a two-dimensional plasma-edge. Ions are advanced in time using the Vlasov-equation. The whole three-dimensional velocity-space is considered leading to very time-consuming four-resp. five-dimensional fully kinetic simulations. In the 1D simulations electrons are assumed to behave adiabatic, i.e. they are Boltzmann-distributed, leading to a nonlinear Poisson-equation. In the 2D simulations a gyro-kinetic approximation is used for the electrons. The plasma is assumed to be initially neutral. The simulations are performed at an equidistant grid. A constant time-step is used for advancing the density-distribution function in time. The time-evolution of the distribution function is performed using a splitting scheme. Each dimension (x, y, υ x , υ y , υ z ) of the phase-space is advanced in time separately. The value of the distribution function for the next time is calculated from the value of an - in general - interstitial point at the present time (fractional shift). One-dimensional cubic-spline interpolation is used for calculating the interstitial function values. After the fractional shifts are performed for each dimension of the phase-space, a whole time-step for advancing the distribution function is finished. Afterwards the charge density is calculated, the Poisson-equation is solved and the electric field is calculated before the next time-step is performed. The fractional shift method sketched above was parallelized for p processors as follows. Considering first the shifts in y-direction, a proper parallelization strategy is to split the grid into p disjoint υ z -slices, which are sub-grids, each containing a different 1/p-th part of the υ z range but the whole range of all other dimensions. Each processor is responsible for performing the y-shifts on a different slice, which can be done in parallel without any communication between
Parallel computing solution of Boltzmann neutron transport equation
International Nuclear Information System (INIS)
Ansah-Narh, T.
2010-01-01
The focus of the research was on developing parallel computing algorithm for solving Eigen-values of the Boltzmam Neutron Transport Equation (BNTE) in a slab geometry using multi-grid approach. In response to the problem of slow execution of serial computing when solving large problems, such as BNTE, the study was focused on the design of parallel computing systems which was an evolution of serial computing that used multiple processing elements simultaneously to solve complex physical and mathematical problems. Finite element method (FEM) was used for the spatial discretization scheme, while angular discretization was accomplished by expanding the angular dependence in terms of Legendre polynomials. The eigenvalues representing the multiplication factors in the BNTE were determined by the power method. MATLAB Compiler Version 4.1 (R2009a) was used to compile the MATLAB codes of BNTE. The implemented parallel algorithms were enabled with matlabpool, a Parallel Computing Toolbox function. The option UseParallel was set to 'always' and the default value of the option was 'never'. When those conditions held, the solvers computed estimated gradients in parallel. The parallel computing system was used to handle all the bottlenecks in the matrix generated from the finite element scheme and each domain of the power method generated. The parallel algorithm was implemented on a Symmetric Multi Processor (SMP) cluster machine, which had Intel 32 bit quad-core x 86 processors. Convergence rates and timings for the algorithm on the SMP cluster machine were obtained. Numerical experiments indicated the designed parallel algorithm could reach perfect speedup and had good stability and scalability. (au)
Fluid dynamics parallel computer development at NASA Langley Research Center
Townsend, James C.; Zang, Thomas A.; Dwoyer, Douglas L.
1987-01-01
To accomplish more detailed simulations of highly complex flows, such as the transition to turbulence, fluid dynamics research requires computers much more powerful than any available today. Only parallel processing on multiple-processor computers offers hope for achieving the required effective speeds. Looking ahead to the use of these machines, the fluid dynamicist faces three issues: algorithm development for near-term parallel computers, architecture development for future computer power increases, and assessment of possible advantages of special purpose designs. Two projects at NASA Langley address these issues. Software development and algorithm exploration is being done on the FLEX/32 Parallel Processing Research Computer. New architecture features are being explored in the special purpose hardware design of the Navier-Stokes Computer. These projects are complementary and are producing promising results.
USE OF PARALLEL COMPUTING IN MASS PROCESSING OF LASER DATA
Directory of Open Access Journals (Sweden)
Będkowski Janusz
2015-12-01
Full Text Available he first part of the paper includes a description of the rules used to generate the algorithm needed for the purpose of parallel computing and also discusses the origins of the idea of research on the use of graphics processors in large scale processing of laser scanning data. The next part of the paper includes the results of an efficiency assessment performed for an array of different processing options, all of which were substantially accelerated with parallel computing. The processing options were divided into the generation of orthophotos using point clouds, coloring of point clouds, transformations, and the generation of a regular grid, as well as advanced processes such as the detection of planes and edges, point cloud classification, and the analysis of data for the purpose of quality control. Most algorithms had to be formulated from scratch in the context of the requirements of parallel computing. A few of the algorithms were based on existing technology developed by the Dephos Software Company and then adapted to parallel computing in the course of this research study. Processing time was determined for each process employed for a typical quantity of data processed, which helped confirm the high efficiency of the solutions proposed and the applicability of parallel computing to the processing of laser scanning data. The high efficiency of parallel computing yields new opportunities in the creation and organization of processing methods for laser scanning data
History Matching in Parallel Computational Environments
Energy Technology Data Exchange (ETDEWEB)
Steven Bryant; Sanjay Srinivasan; Alvaro Barrera; Sharad Yadav
2004-08-31
In the probabilistic approach for history matching, the information from the dynamic data is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, fluid properties, well configuration, flow constrains on wells etc. This implies probabilistic approach should then update different regions of the reservoir in different ways. This necessitates delineation of multiple reservoir domains in order to increase the accuracy of the approach. The research focuses on a probabilistic approach to integrate dynamic data that ensures consistency between reservoir models developed from one stage to the next. The algorithm relies on efficient parameterization of the dynamic data integration problem and permits rapid assessment of the updated reservoir model at each stage. The report also outlines various domain decomposition schemes from the perspective of increasing the accuracy of probabilistic approach of history matching. Research progress in three important areas of the project are discussed: {lg_bullet}Validation and testing the probabilistic approach to incorporating production data in reservoir models. {lg_bullet}Development of a robust scheme for identifying reservoir regions that will result in a more robust parameterization of the history matching process. {lg_bullet}Testing commercial simulators for parallel capability and development of a parallel algorithm for history matching.
Parallel ray tracing for one-dimensional discrete ordinate computations
International Nuclear Information System (INIS)
Jarvis, R.D.; Nelson, P.
1996-01-01
The ray-tracing sweep in discrete-ordinates, spatially discrete numerical approximation methods applied to the linear, steady-state, plane-parallel, mono-energetic, azimuthally symmetric, neutral-particle transport equation can be reduced to a parallel prefix computation. In so doing, the often severe penalty in convergence rate of the source iteration, suffered by most current parallel algorithms using spatial domain decomposition, can be avoided while attaining parallelism in the spatial domain to whatever extent desired. In addition, the reduction implies parallel algorithm complexity limits for the ray-tracing sweep. The reduction applies to all closed, linear, one-cell functional (CLOF) spatial approximation methods, which encompasses most in current popular use. Scalability test results of an implementation of the algorithm on a 64-node nCube-2S hypercube-connected, message-passing, multi-computer are described. (author)
Performing a global barrier operation in a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2014-12-09
Executing computing tasks on a parallel computer that includes compute nodes coupled for data communications, where each compute node executes tasks, with one task on each compute node designated as a master task, including: for each task on each compute node until all master tasks have joined a global barrier: determining whether the task is a master task; if the task is not a master task, joining a single local barrier; if the task is a master task, joining the global barrier and the single local barrier only after all other tasks on the compute node have joined the single local barrier.
Parallel algorithms and archtectures for computational structural mechanics
Patrick, Merrell; Ma, Shing; Mahajan, Umesh
1989-01-01
The determination of the fundamental (lowest) natural vibration frequencies and associated mode shapes is a key step used to uncover and correct potential failures or problem areas in most complex structures. However, the computation time taken by finite element codes to evaluate these natural frequencies is significant, often the most computationally intensive part of structural analysis calculations. There is continuing need to reduce this computation time. This study addresses this need by developing methods for parallel computation.
Integrated computer network high-speed parallel interface
International Nuclear Information System (INIS)
Frank, R.B.
1979-03-01
As the number and variety of computers within Los Alamos Scientific Laboratory's Central Computer Facility grows, the need for a standard, high-speed intercomputer interface has become more apparent. This report details the development of a High-Speed Parallel Interface from conceptual through implementation stages to meet current and future needs for large-scle network computing within the Integrated Computer Network. 4 figures
Parallel structures in human and computer memory
Kanerva, Pentti
1986-08-01
If we think of our experiences as being recorded continuously on film, then human memory can be compared to a film library that is indexed by the contents of the film strips stored in it. Moreover, approximate retrieval cues suffice to retrieve information stored in this library: We recognize a familiar person in a fuzzy photograph or a familiar tune played on a strange instrument. This paper is about how to construct a computer memory that would allow a computer to recognize patterns and to recall sequences the way humans do. Such a memory is remarkably similar in structure to a conventional computer memory and also to the neural circuits in the cortex of the cerebellum of the human brain. The paper concludes that the frame problem of artificial intelligence could be solved by the use of such a memory if we were able to encode information about the world properly.
Parallel computer calculation of quantum spin lattices
International Nuclear Information System (INIS)
Lamarcq, J.
1998-01-01
Numerical simulation allows the theorists to convince themselves about the validity of the models they use. Particularly by simulating the spin lattices one can judge about the validity of a conjecture. Simulating a system defined by a large number of degrees of freedom requires highly sophisticated machines. This study deals with modelling the magnetic interactions between the ions of a crystal. Many exact results have been found for spin 1/2 systems but not for systems of other spins for which many simulation have been carried out. The interest for simulations has been renewed by the Haldane's conjecture stipulating the existence of a energy gap between the ground state and the first excited states of a spin 1 lattice. The existence of this gap has been experimentally demonstrated. This report contains the following four chapters: 1. Spin systems; 2. Calculation of eigenvalues; 3. Programming; 4. Parallel calculation
Global seismic tomography and modern parallel computers
Directory of Open Access Journals (Sweden)
A. Piersanti
2006-06-01
Full Text Available A fast technological progress is providing seismic tomographers with computers of rapidly increasing speed and RAM, that are not always properly taken advantage of. Large computers with both shared-memory and distributedmemory architectures have made it possible to approach the tomographic inverse problem more accurately. For example, resolution can be quantified from the resolution matrix rather than checkerboard tests; the covariance matrix can be calculated to evaluate the propagation of errors from data to model parameters; the L-curve method can be applied to determine a range of acceptable regularization schemes. We show how these exercises can be implemented efficiently on different hardware architectures.
Misleading Performance Claims in Parallel Computations
Energy Technology Data Exchange (ETDEWEB)
Bailey, David H.
2009-05-29
In a previous humorous note entitled 'Twelve Ways to Fool the Masses,' I outlined twelve common ways in which performance figures for technical computer systems can be distorted. In this paper and accompanying conference talk, I give a reprise of these twelve 'methods' and give some actual examples that have appeared in peer-reviewed literature in years past. I then propose guidelines for reporting performance, the adoption of which would raise the level of professionalism and reduce the level of confusion, not only in the world of device simulation but also in the larger arena of technical computing.
Finite element electromagnetic field computation on the Sequent Symmetry 81 parallel computer
International Nuclear Information System (INIS)
Ratnajeevan, S.; Hoole, H.
1990-01-01
Finite element field analysis algorithms lend themselves to parallelization and this fact is exploited in this paper to implement a finite element analysis program for electromagnetic field computation on the Sequent Symmetry 81 parallel computer with three processors. In terms of waiting time, the maximum gains are to be made in matrix solution and therefore this paper concentrates on the gains in parallelizing the solution part of finite element analysis. An outline of how parallelization could be exploited in most finite element operations is given in this paper although the actual implemention of parallelism on the Sequent Symmetry 81 parallel computer was in sparsity computation, matrix assembly and the matrix solution areas. In all cases, the algorithms were modified suit the parallel programming application rather than allowing the compiler to parallelize on existing algorithms
Identifying failure in a tree network of a parallel computer
Archer, Charles J.; Pinnow, Kurt W.; Wallenfelt, Brian P.
2010-08-24
Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.
A compositional reservoir simulator on distributed memory parallel computers
International Nuclear Information System (INIS)
Rame, M.; Delshad, M.
1995-01-01
This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. A portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented
A microeconomic scheduler for parallel computers
Stoica, Ion; Abdel-Wahab, Hussein; Pothen, Alex
1995-01-01
We describe a scheduler based on the microeconomic paradigm for scheduling on-line a set of parallel jobs in a multiprocessor system. In addition to the classical objectives of increasing the system throughput and reducing the response time, we consider fairness in allocating system resources among the users, and providing the user with control over the relative performances of his jobs. We associate with every user a savings account in which he receives money at a constant rate. When a user wants to run a job, he creates an expense account for that job to which he transfers money from his savings account. The job uses the funds in its expense account to obtain the system resources it needs for execution. The share of the system resources allocated to the user is directly related to the rate at which the user receives money; the rate at which the user transfers money into a job expense account controls the job's performance. We prove that starvation is not possible in our model. Simulation results show that our scheduler improves both system and user performances in comparison with two different variable partitioning policies. It is also shown to be effective in guaranteeing fairness and providing control over the performance of jobs.
Mighell, Kenneth John
2011-11-01
The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called CRBLASTER, which does cosmic-ray rejection of CCD (charge-coupled device) images using the embarrassingly-parallel L.A.COSMIC algorithm. CRBLASTER is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly-parallel algorithms. The CRBLASTER source code is freely available at the official application website at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800x800 pixel Hubble Space Telescope WFPC2 image takes 44 seconds with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8-GHz quad-core Intel Xeon processors. CRBLASTER is 7.4 times faster processing the same image on a single core on the same machine. Processing the same image with CRBLASTER simultaneously on all 8 cores of the same machine takes 0.875 seconds - which is a speedup factor of 50.3 times faster than the IRAF script. A detailed analysis is presented of the performance of CRBLASTER using between 1 and 57 processors on a low-power Tilera 700-MHz 64-core TILE64 processor.
History Matching in Parallel Computational Environments
Energy Technology Data Exchange (ETDEWEB)
Steven Bryant; Sanjay Srinivasan; Alvaro Barrera; Sharad Yadav
2005-10-01
A novel methodology for delineating multiple reservoir domains for the purpose of history matching in a distributed computing environment has been proposed. A fully probabilistic approach to perturb permeability within the delineated zones is implemented. The combination of robust schemes for identifying reservoir zones and distributed computing significantly increase the accuracy and efficiency of the probabilistic approach. The information pertaining to the permeability variations in the reservoir that is contained in dynamic data is calibrated in terms of a deformation parameter rD. This information is merged with the prior geologic information in order to generate permeability models consistent with the observed dynamic data as well as the prior geology. The relationship between dynamic response data and reservoir attributes may vary in different regions of the reservoir due to spatial variations in reservoir attributes, well configuration, flow constrains etc. The probabilistic approach then has to account for multiple r{sub D} values in different regions of the reservoir. In order to delineate reservoir domains that can be characterized with different rD parameters, principal component analysis (PCA) of the Hessian matrix has been done. The Hessian matrix summarizes the sensitivity of the objective function at a given step of the history matching to model parameters. It also measures the interaction of the parameters in affecting the objective function. The basic premise of PC analysis is to isolate the most sensitive and least correlated regions. The eigenvectors obtained during the PCA are suitably scaled and appropriate grid block volume cut-offs are defined such that the resultant domains are neither too large (which increases interactions between domains) nor too small (implying ineffective history matching). The delineation of domains requires calculation of Hessian, which could be computationally costly and as well as restricts the current approach to
Secure grid-based computing with social-network based trust management in the semantic web
Czech Academy of Sciences Publication Activity Database
Špánek, Roman; Tůma, Miroslav
2006-01-01
Roč. 16, č. 6 (2006), s. 475-488 ISSN 1210-0552 R&D Projects: GA AV ČR 1ET100300419; GA MŠk 1M0554 Institutional research plan: CEZ:AV0Z10300504 Keywords : semantic web * grid computing * trust management * reconfigurable network s * security * hypergraph model * hypergraph algorithms Subject RIV: IN - Informatics, Computer Science
Parallel computation of geometry control in adaptive truss structures
Ramesh, A. V.; Utku, S.; Wada, B. K.
1992-01-01
The fast computation of geometry control in adaptive truss structures involves two distinct parts: the efficient integration of the inverse kinematic differential equations that govern the geometry control and the fast computation of the Jacobian, which appears on the right-hand-side of the inverse kinematic equations. This paper present an efficient parallel implementation of the Jacobian computation on an MIMD machine. Large speedup from the parallel implementation is obtained, which reduces the Jacobian computation to an O(M-squared/n) procedure on an n-processor machine, where M is the number of members in the adaptive truss. The parallel algorithm given here is a good candidate for on-line geometry control of adaptive structures using attached processors.
Tutorial: Parallel Computing of Simulation Models for Risk Analysis.
Reilly, Allison C; Staid, Andrea; Gao, Michael; Guikema, Seth D
2016-10-01
Simulation models are widely used in risk analysis to study the effects of uncertainties on outcomes of interest in complex problems. Often, these models are computationally complex and time consuming to run. This latter point may be at odds with time-sensitive evaluations or may limit the number of parameters that are considered. In this article, we give an introductory tutorial focused on parallelizing simulation code to better leverage modern computing hardware, enabling risk analysts to better utilize simulation-based methods for quantifying uncertainty in practice. This article is aimed primarily at risk analysts who use simulation methods but do not yet utilize parallelization to decrease the computational burden of these models. The discussion is focused on conceptual aspects of embarrassingly parallel computer code and software considerations. Two complementary examples are shown using the languages MATLAB and R. A brief discussion of hardware considerations is located in the Appendix. © 2016 Society for Risk Analysis.
Fast computing global structural balance in signed networks based on memetic algorithm
Sun, Yixiang; Du, Haifeng; Gong, Maoguo; Ma, Lijia; Wang, Shanfeng
2014-12-01
Structural balance is a large area of study in signed networks, and it is intrinsically a global property of the whole network. Computing global structural balance in signed networks, which has attracted some attention in recent years, is to measure how unbalanced a signed network is and it is a nondeterministic polynomial-time hard problem. Many approaches are developed to compute global balance. However, the results obtained by them are partial and unsatisfactory. In this study, the computation of global structural balance is solved as an optimization problem by using the Memetic Algorithm. The optimization algorithm, named Meme-SB, is proposed to optimize an evaluation function, energy function, which is used to compute a distance to exact balance. Our proposed algorithm combines Genetic Algorithm and a greedy strategy as the local search procedure. Experiments on social and biological networks show the excellent effectiveness and efficiency of the proposed method.
Dynamic grid refinement for partial differential equations on parallel computers
International Nuclear Information System (INIS)
Mccormick, S.; Quinlan, D.
1989-01-01
The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids to provide adaptive resolution and fast solution of PDEs. An asynchronous version of FAC, called AFAC, that completely eliminates the bottleneck to parallelism is presented. This paper describes the advantage that this algorithm has in adaptive refinement for moving singularities on multiprocessor computers. This work is applicable to the parallel solution of two- and three-dimensional shock tracking problems. 6 refs
Parallel computation of seismic analysis of high arch dam
Chen, Houqun; Ma, Huaifa; Tu, Jin; Cheng, Guangqing; Tang, Juzhen
2008-03-01
Parallel computation programs are developed for three-dimensional meso-mechanics analysis of fully-graded dam concrete and seismic response analysis of high arch dams (ADs), based on the Parallel Finite Element Program Generator (PFEPG). The computational algorithms of the numerical simulation of the meso-structure of concrete specimens were studied. Taking into account damage evolution, static preload, strain rate effect, and the heterogeneity of the meso-structure of dam concrete, the fracture processes of damage evolution and configuration of the cracks can be directly simulated. In the seismic response analysis of ADs, all the following factors are involved, such as the nonlinear contact due to the opening and slipping of the contraction joints, energy dispersion of the far-field foundation, dynamic interactions of the dam-foundation-reservoir system, and the combining effects of seismic action with all static loads. The correctness, reliability and efficiency of the two parallel computational programs are verified with practical illustrations.
Parallel grid generation algorithm for distributed memory computers
Moitra, Stuti; Moitra, Anutosh
1994-01-01
A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
IPython: components for interactive and parallel computing across disciplines. (Invited)
Perez, F.; Bussonnier, M.; Frederic, J. D.; Froehle, B. M.; Granger, B. E.; Ivanov, P.; Kluyver, T.; Patterson, E.; Ragan-Kelley, B.; Sailer, Z.
2013-12-01
Scientific computing is an inherently exploratory activity that requires constantly cycling between code, data and results, each time adjusting the computations as new insights and questions arise. To support such a workflow, good interactive environments are critical. The IPython project (http://ipython.org) provides a rich architecture for interactive computing with: 1. Terminal-based and graphical interactive consoles. 2. A web-based Notebook system with support for code, text, mathematical expressions, inline plots and other rich media. 3. Easy to use, high performance tools for parallel computing. Despite its roots in Python, the IPython architecture is designed in a language-agnostic way to facilitate interactive computing in any language. This allows users to mix Python with Julia, R, Octave, Ruby, Perl, Bash and more, as well as to develop native clients in other languages that reuse the IPython clients. In this talk, I will show how IPython supports all stages in the lifecycle of a scientific idea: 1. Individual exploration. 2. Collaborative development. 3. Production runs with parallel resources. 4. Publication. 5. Education. In particular, the IPython Notebook provides an environment for "literate computing" with a tight integration of narrative and computation (including parallel computing). These Notebooks are stored in a JSON-based document format that provides an "executable paper": notebooks can be version controlled, exported to HTML or PDF for publication, and used for teaching.
State of the Art in Parallel Computing with R
Directory of Open Access Journals (Sweden)
Markus Schmidberger
2009-06-01
Full Text Available R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing.This paper presents an overview of techniques for parallel computing with R on computer clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance.Two packages (snow, Rmpi stand out as particularly suited to general use on computer clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems five different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix.
Computational acceleration for MR image reconstruction in partially parallel imaging.
Ye, Xiaojing; Chen, Yunmei; Huang, Feng
2011-05-01
In this paper, we present a fast numerical algorithm for solving total variation and l(1) (TVL1) based image reconstruction with application in partially parallel magnetic resonance imaging. Our algorithm uses variable splitting method to reduce computational cost. Moreover, the Barzilai-Borwein step size selection method is adopted in our algorithm for much faster convergence. Experimental results on clinical partially parallel imaging data demonstrate that the proposed algorithm requires much fewer iterations and/or less computational cost than recently developed operator splitting and Bregman operator splitting methods, which can deal with a general sensing matrix in reconstruction framework, to get similar or even better quality of reconstructed images.
Small file aggregation in a parallel computing system
Faibish, Sorin; Bent, John M.; Tzelnic, Percy; Grider, Gary; Zhang, Jingwang
2014-09-02
Techniques are provided for small file aggregation in a parallel computing system. An exemplary method for storing a plurality of files generated by a plurality of processes in a parallel computing system comprises aggregating the plurality of files into a single aggregated file; and generating metadata for the single aggregated file. The metadata comprises an offset and a length of each of the plurality of files in the single aggregated file. The metadata can be used to unpack one or more of the files from the single aggregated file.
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.
2016-03-15
Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.
Requirements for supercomputing in energy research: The transition to massively parallel computing
Energy Technology Data Exchange (ETDEWEB)
1993-02-01
This report discusses: The emergence of a practical path to TeraFlop computing and beyond; requirements of energy research programs at DOE; implementation: supercomputer production computing environment on massively parallel computers; and implementation: user transition to massively parallel computing.
A design methodology for portable software on parallel computers
Nicol, David M.; Miller, Keith W.; Chrisman, Dan A.
1993-01-01
This final report for research that was supported by grant number NAG-1-995 documents our progress in addressing two difficulties in parallel programming. The first difficulty is developing software that will execute quickly on a parallel computer. The second difficulty is transporting software between dissimilar parallel computers. In general, we expect that more hardware-specific information will be included in software designs for parallel computers than in designs for sequential computers. This inclusion is an instance of portability being sacrificed for high performance. New parallel computers are being introduced frequently. Trying to keep one's software on the current high performance hardware, a software developer almost continually faces yet another expensive software transportation. The problem of the proposed research is to create a design methodology that helps designers to more precisely control both portability and hardware-specific programming details. The proposed research emphasizes programming for scientific applications. We completed our study of the parallelizability of a subsystem of the NASA Earth Radiation Budget Experiment (ERBE) data processing system. This work is summarized in section two. A more detailed description is provided in Appendix A ('Programming Practices to Support Eventual Parallelism'). Mr. Chrisman, a graduate student, wrote and successfully defended a Ph.D. dissertation proposal which describes our research associated with the issues of software portability and high performance. The list of research tasks are specified in the proposal. The proposal 'A Design Methodology for Portable Software on Parallel Computers' is summarized in section three and is provided in its entirety in Appendix B. We are currently studying a proposed subsystem of the NASA Clouds and the Earth's Radiant Energy System (CERES) data processing system. This software is the proof-of-concept for the Ph.D. dissertation. We have implemented and measured
Algorithms for computational fluid dynamics n parallel processors
International Nuclear Information System (INIS)
Van de Velde, E.F.
1986-01-01
A study of parallel algorithms for the numerical solution of partial differential equations arising in computational fluid dynamics is presented. The actual implementation on parallel processors of shared and nonshared memory design is discussed. The performance of these algorithms is analyzed in terms of machine efficiency, communication time, bottlenecks and software development costs. For elliptic equations, a parallel preconditioned conjugate gradient method is described, which has been used to solve pressure equations discretized with high order finite elements on irregular grids. A parallel full multigrid method and a parallel fast Poisson solver are also presented. Hyperbolic conservation laws were discretized with parallel versions of finite difference methods like the Lax-Wendroff scheme and with the Random Choice method. Techniques are developed for comparing the behavior of an algorithm on different architectures as a function of problem size and local computational effort. Effective use of these advanced architecture machines requires the use of machine dependent programming. It is shown that the portability problems can be minimized by introducing high level operations on vectors and matrices structured into program libraries
Mighell, Kenneth John
2010-10-01
The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called crblaster, which does cosmic-ray rejection of CCD images using the embarrassingly parallel l.a.cosmic algorithm. crblaster is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. crblaster uses a two-dimensional image partitioning algorithm that partitions an input image into N rectangular subimages of nearly equal area; the subimages include sufficient additional pixels along common image partition edges such that the need for communication between computer processes is eliminated. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly parallel algorithms. The crblaster source code is freely available at the official application Web site at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800 × 800 pixel Hubble Space Telescope WFPC2 image takes 44 s with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8 GHz quad-core Intel Xeon processors. crblaster is 7.4 times faster when processing the same image on a single core on the same machine. Processing the same image with crblaster simultaneously on all eight cores of the same machine takes 0.875 s—which is a speedup factor of 50.3 times faster than the
Analysis of multigrid methods on massively parallel computers: Architectural implications
Matheson, Lesley R.; Tarjan, Robert E.
1993-01-01
We study the potential performance of multigrid algorithms running on massively parallel computers with the intent of discovering whether presently envisioned machines will provide an efficient platform for such algorithms. We consider the domain parallel version of the standard V cycle algorithm on model problems, discretized using finite difference techniques in two and three dimensions on block structured grids of size 10(exp 6) and 10(exp 9), respectively. Our models of parallel computation were developed to reflect the computing characteristics of the current generation of massively parallel multicomputers. These models are based on an interconnection network of 256 to 16,384 message passing, 'workstation size' processors executing in an SPMD mode. The first model accomplishes interprocessor communications through a multistage permutation network. The communication cost is a logarithmic function which is similar to the costs in a variety of different topologies. The second model allows single stage communication costs only. Both models were designed with information provided by machine developers and utilize implementation derived parameters. With the medium grain parallelism of the current generation and the high fixed cost of an interprocessor communication, our analysis suggests an efficient implementation requires the machine to support the efficient transmission of long messages, (up to 1000 words) or the high initiation cost of a communication must be significantly reduced through an alternative optimization technique. Furthermore, with variable length message capability, our analysis suggests the low diameter multistage networks provide little or no advantage over a simple single stage communications network.
A Novel Parallel Algorithm for Edit Distance Computation
Directory of Open Access Journals (Sweden)
Muhammad Murtaza Yousaf
2018-01-01
Full Text Available The edit distance between two sequences is the minimum number of weighted transformation-operations that are required to transform one string into the other. The weighted transformation-operations are insert, remove, and substitute. Dynamic programming solution to find edit distance exists but it becomes computationally intensive when the lengths of strings become very large. This work presents a novel parallel algorithm to solve edit distance problem of string matching. The algorithm is based on resolving dependencies in the dynamic programming solution of the problem and it is able to compute each row of edit distance table in parallel. In this way, it becomes possible to compute the complete table in min(m,n iterations for strings of size m and n whereas state-of-the-art parallel algorithm solves the problem in max(m,n iterations. The proposed algorithm also increases the amount of parallelism in each of its iteration. The algorithm is also capable of exploiting spatial locality while its implementation. Additionally, the algorithm works in a load balanced way that further improves its performance. The algorithm is implemented for multicore systems having shared memory. Implementation of the algorithm in OpenMP shows linear speedup and better execution time as compared to state-of-the-art parallel approach. Efficiency of the algorithm is also proven better in comparison to its competitor.
The science of computing - The evolution of parallel processing
Denning, P. J.
1985-01-01
The present paper is concerned with the approaches to be employed to overcome the set of limitations in software technology which impedes currently an effective use of parallel hardware technology. The process required to solve the arising problems is found to involve four different stages. At the present time, Stage One is nearly finished, while Stage Two is under way. Tentative explorations are beginning on Stage Three, and Stage Four is more distant. In Stage One, parallelism is introduced into the hardware of a single computer, which consists of one or more processors, a main storage system, a secondary storage system, and various peripheral devices. In Stage Two, parallel execution of cooperating programs on different machines becomes explicit, while in Stage Three, new languages will make parallelism implicit. In Stage Four, there will be very high level user interfaces capable of interacting with scientists at the same level of abstraction as scientists do with each other.
General-purpose parallel simulator for quantum computing
International Nuclear Information System (INIS)
Niwa, Jumpei; Matsumoto, Keiji; Imai, Hiroshi
2002-01-01
With current technologies, it seems to be very difficult to implement quantum computers with many qubits. It is therefore of importance to simulate quantum algorithms and circuits on the existing computers. However, for a large-size problem, the simulation often requires more computational power than is available from sequential processing. Therefore, simulation methods for parallel processors are required. We have developed a general-purpose simulator for quantum algorithms/circuits on the parallel computer (Sun Enterprise4500). It can simulate algorithms/circuits with up to 30 qubits. In order to test efficiency of our proposed methods, we have simulated Shor's factorization algorithm and Grover's database search, and we have analyzed robustness of the corresponding quantum circuits in the presence of both decoherence and operational errors. The corresponding results, statistics, and analyses are presented in this paper
Parallel algorithms for computation of the manipulator inertia matrix
Amin-Javaheri, Masoud; Orin, David E.
1989-01-01
The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.
Quantum neural network-based EEG filtering for a brain-computer interface.
Gandhi, Vaibhav; Prasad, Girijesh; Coyle, Damien; Behera, Laxmidhar; McGinnity, Thomas Martin
2014-02-01
A novel neural information processing architecture inspired by quantum mechanics and incorporating the well-known Schrodinger wave equation is proposed in this paper. The proposed architecture referred to as recurrent quantum neural network (RQNN) can characterize a nonstationary stochastic signal as time-varying wave packets. A robust unsupervised learning algorithm enables the RQNN to effectively capture the statistical behavior of the input signal and facilitates the estimation of signal embedded in noise with unknown characteristics. The results from a number of benchmark tests show that simple signals such as dc, staircase dc, and sinusoidal signals embedded within high noise can be accurately filtered and particle swarm optimization can be employed to select model parameters. The RQNN filtering procedure is applied in a two-class motor imagery-based brain-computer interface where the objective was to filter electroencephalogram (EEG) signals before feature extraction and classification to increase signal separability. A two-step inner-outer fivefold cross-validation approach is utilized to select the algorithm parameters subject-specifically for nine subjects. It is shown that the subject-specific RQNN EEG filtering significantly improves brain-computer interface performance compared to using only the raw EEG or Savitzky-Golay filtered EEG across multiple sessions.
Hao, Stephanie; Subramanian, Sandya; Jordan, Austin; Santaniello, Sabato; Yaffe, Robert; Jouny, Christophe C; Bergey, Gregory K; Anderson, William S; Sarma, Sridevi V
2014-01-01
The surgical resection of the epileptogenic zone (EZ) is the only effective treatment for many drug-resistant epilepsy (DRE) patients, but the pre-surgical identification of the EZ is challenging. This study investigates whether the EZ exhibits a computationally identifiable signature during seizures. In particular, we compute statistics of the brain network from intracranial EEG (iEEG) recordings and track the evolution of network connectivity before, during, and after seizures. We define each node in the network as an electrode and weight each edge connecting a pair of nodes by the gamma band cross power of the corresponding iEEG signals. The eigenvector centrality (EVC) of each node is tracked over two seizures per patient and the electrodes are ranked according to the corresponding EVC value. We hypothesize that electrodes covering the EZ have a signature EVC rank evolution during seizure that differs from electrodes outside the EZ. We tested this hypothesis on multi-channel iEEG recordings from 2 DRE patients who had successful surgery (i.e., seizures were under control with or without medications) and 1 patient who had unsuccessful surgery. In the successful cases, we assumed that the resected region contained the EZ and found that the EVC rank evolution of the electrodes within the resected region had a distinct "arc" signature, i.e., the EZ ranks first rose together shortly after seizure onset and then fell later during seizure.
Measuring performance of parallel computers. Progress report, 1989
Energy Technology Data Exchange (ETDEWEB)
Sullivan, F.
1994-07-01
Performance Measurement - the authors have developed a taxonomy of parallel algorithms based on data motion and example applications have been coded for each class of the taxonomy. Computational benchmark kernels have been extracted for several applications, and detailed measurements have been performed. Algorithms for Massively Parallel SIMD machines - measurement results and computational experiences indicate that top performance will be achieved by `iteration` type algorithms running on massively parallel SIMD machines. Reformulation as iteration may entail unorthodox approaches based on probabilistic methods. The authors have developed such methods for some applications. Here they discuss their approach to performance measurement, describe the taxonomy and measurements which have been made, and report on some general conclusions which can be drawn from the results of the measurements.
Image processing with massively parallel computer Quadrics Q1
International Nuclear Information System (INIS)
Della Rocca, A.B.; La Porta, L.; Ferriani, S.
1995-05-01
Aimed to evaluate the image processing capabilities of the massively parallel computer Quadrics Q1, a convolution algorithm that has been implemented is described in this report. At first the discrete convolution mathematical definition is recalled together with the main Q1 h/w and s/w features. Then the different codification forms of the algorythm are described and the Q1 performances are compared with those obtained by different computers. Finally, the conclusions report on main results and suggestions
WEKA-G: Parallel data mining on computational grids
Directory of Open Access Journals (Sweden)
PIMENTA, A.
2009-12-01
Full Text Available Data mining is a technology that can extract useful information from large amounts of data. However, mining a database often requires a high computational power. To resolve this problem, this paper presents a tool (Weka-G, which runs in parallel algorithms used in the mining process data. As the environment for doing so, we use a computational grid by adding several features within a WAN.
Fencing data transfers in a parallel active messaging interface of a parallel computer
Blocksome, Michael A.; Mamidala, Amith R.
2015-06-02
Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
On the efficient parallel computation of Legendre transforms
Inda, M.A.; Bisseling, R.H.; Maslen, D.K.
2001-01-01
In this article, we discuss a parallel implementation of efficient algorithms for computation of Legendre polynomial transforms and other orthogonal polynomial transforms. We develop an approach to the Driscoll-Healy algorithm using polynomial arithmetic and present experimental results on the
On the efficient parallel computation of Legendre transforms
Inda, M.A.; Bisseling, R.H.; Maslen, D.K.
1999-01-01
In this article we discuss a parallel implementation of efficient algorithms for computation of Legendre polynomial transforms and other orthogonal polynomial transforms. We develop an approach to the Driscoll-Healy algorithm using polynomial arithmetic and present experimental results on the
Hardware packet pacing using a DMA in a parallel computer
Chen, Dong; Heidelberger, Phillip; Vranas, Pavlos
2013-08-13
Method and system for hardware packet pacing using a direct memory access controller in a parallel computer which, in one aspect, keeps track of a total number of bytes put on the network as a result of a remote get operation, using a hardware token counter.
Computationally efficient implementation of combustion chemistry in parallel PDF calculations
International Nuclear Information System (INIS)
Lu Liuyan; Lantz, Steven R.; Ren Zhuyin; Pope, Stephen B.
2009-01-01
In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f m pi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel
Computationally efficient implementation of combustion chemistry in parallel PDF calculations
Lu, Liuyan; Lantz, Steven R.; Ren, Zhuyin; Pope, Stephen B.
2009-08-01
In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f_mpi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel
Abstract Computation in Schizophrenia Detection through Artificial Neural Network Based Systems
Directory of Open Access Journals (Sweden)
L. Cardoso
2015-01-01
Full Text Available Schizophrenia stands for a long-lasting state of mental uncertainty that may bring to an end the relation among behavior, thought, and emotion; that is, it may lead to unreliable perception, not suitable actions and feelings, and a sense of mental fragmentation. Indeed, its diagnosis is done over a large period of time; continuos signs of the disturbance persist for at least 6 (six months. Once detected, the psychiatrist diagnosis is made through the clinical interview and a series of psychic tests, addressed mainly to avoid the diagnosis of other mental states or diseases. Undeniably, the main problem with identifying schizophrenia is the difficulty to distinguish its symptoms from those associated to different untidiness or roles. Therefore, this work will focus on the development of a diagnostic support system, in terms of its knowledge representation and reasoning procedures, based on a blended of Logic Programming and Artificial Neural Networks approaches to computing, taking advantage of a novel approach to knowledge representation and reasoning, which aims to solve the problems associated in the handling (i.e., to stand for and reason of defective information.
Abstract computation in schizophrenia detection through artificial neural network based systems.
Cardoso, L; Marins, F; Magalhães, R; Marins, N; Oliveira, T; Vicente, H; Abelha, A; Machado, J; Neves, J
2015-01-01
Schizophrenia stands for a long-lasting state of mental uncertainty that may bring to an end the relation among behavior, thought, and emotion; that is, it may lead to unreliable perception, not suitable actions and feelings, and a sense of mental fragmentation. Indeed, its diagnosis is done over a large period of time; continuos signs of the disturbance persist for at least 6 (six) months. Once detected, the psychiatrist diagnosis is made through the clinical interview and a series of psychic tests, addressed mainly to avoid the diagnosis of other mental states or diseases. Undeniably, the main problem with identifying schizophrenia is the difficulty to distinguish its symptoms from those associated to different untidiness or roles. Therefore, this work will focus on the development of a diagnostic support system, in terms of its knowledge representation and reasoning procedures, based on a blended of Logic Programming and Artificial Neural Networks approaches to computing, taking advantage of a novel approach to knowledge representation and reasoning, which aims to solve the problems associated in the handling (i.e., to stand for and reason) of defective information.
Aggregating job exit statuses of a plurality of compute nodes executing a parallel application
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.; Mundy, Michael B.
2015-07-21
Aggregating job exit statuses of a plurality of compute nodes executing a parallel application, including: identifying a subset of compute nodes in the parallel computer to execute the parallel application; selecting one compute node in the subset of compute nodes in the parallel computer as a job leader compute node; initiating execution of the parallel application on the subset of compute nodes; receiving an exit status from each compute node in the subset of compute nodes, where the exit status for each compute node includes information describing execution of some portion of the parallel application by the compute node; aggregating each exit status from each compute node in the subset of compute nodes; and sending an aggregated exit status for the subset of compute nodes in the parallel computer.
Directory of Open Access Journals (Sweden)
Allan R. Larrabee
1993-01-01
Full Text Available The first digital computers consisted of a single processor acting on a single stream of data. In this so-called "von Neumann" architecture, computation speed is limited mainly by the time required to transfer data between the processor and memory. This limiting factor has been referred to as the "von Neumann bottleneck". The concern that the miniaturization of silicon-based integrated circuits will soon reach theoretical limits of size and gate times has led to increased interest in parallel architectures and also spurred research into alternatives to silicon-based implementations of processors. Meanwhile, sequential processors continue to be produced that have increased clock rates and an increase in memory locally available to a processor, and an increase in the rate at which data can be transferred to and from memories, networks, and remote storage. The efficiency of compilers and operating systems is also improving over time. Although such characteristics limit maximum performance, a large improvement in the speed of scientific computations can often be achieved by utilizing more efficient algorithms, particularly those that support parallel computation. This work discusses experiences with two tools for large grain (or "macro task" parallelism.
Parallelization of ITS on an IBM SP2 computer
International Nuclear Information System (INIS)
Monte Carlo simulation of complex phenomena that occur in coupled electron-photon radiation transport can tax even the most powerful serial computers. Depending on the number of materials and the required resolution, there is a need to use up to millions of histories to adequately represent three-dimensional transport. The greatest limitation of most Monte Carlo simulations is the large computational time and computing memory required for very large scale problems. Emergence of new parallel computers, such as the IBM SP2, may lessen this difficulty for radiation transport simulation. This study presents the implementation of the Integrated Tiger Series, (ITS) of codes on the IBM SP2 using the PVMe message-passing library to achieve parallel efficiencies as high as 98% on eight computer nodes. ITS is a general purpose state-of-the-art Monte Carlo method for solving linear, time-independent, coupled electron-photon radiation transport problems with or without the presence of external electric and magnetic fields. A batch decomposition approach has been taken toward parallelization of the code. In ITS, the particle histories are divided into open-quotes batchesclose quotes of equal size and the evaluation of the estimated quantities are performed using batch-averaged sample statistics
Parallel peak pruning for scalable SMP contour tree computation
Energy Technology Data Exchange (ETDEWEB)
Carr, Hamish A. [Univ. of Leeds (United Kingdom); Weber, Gunther H. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Davis, CA (United States); Sewell, Christopher M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Ahrens, James P. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
2017-03-09
As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this form of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.
Computing NLTE Opacities -- Node Level Parallel
Energy Technology Data Exchange (ETDEWEB)
Holladay, Daniel [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
2015-09-11
Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.
Work-Efficient Parallel Skyline Computation for the GPU
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira
2015-01-01
The skyline operator returns records in a dataset that provide optimal trade-offs of multiple dimensions. State-of-the-art skyline computation involves complex tree traversals, data-ordering, and conditional branching to minimize the number of point-to-point comparisons. Meanwhile, GPGPU computing...... offers the potential for parallelizing skyline computation across thousands of cores. However, attempts to port skyline algorithms to the GPU have prioritized throughput and failed to outperform sequential algorithms. In this paper, we introduce a new skyline algorithm, designed for the GPU, that uses...
Parallel computing in genomic research: advances and applications.
Ocaña, Kary; de Oliveira, Daniel
2015-01-01
Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities.
High spatial resolution CT image reconstruction using parallel computing
International Nuclear Information System (INIS)
Yin Yin; Liu Li; Sun Gongxing
2003-01-01
Using the PC cluster system with 16 dual CPU nodes, we accelerate the FBP and OR-OSEM reconstruction of high spatial resolution image (2048 x 2048). Based on the number of projections, we rewrite the reconstruction algorithms into parallel format and dispatch the tasks to each CPU. By parallel computing, the speedup factor is roughly equal to the number of CPUs, which can be up to about 25 times when 25 CPUs used. This technique is very suitable for real-time high spatial resolution CT image reconstruction. (authors)
Parallelization of the NASA Goddard Cumulus Ensemble Model for Massively Parallel Computing
Directory of Open Access Journals (Sweden)
Hann-Ming Henry Juang
2007-01-01
Full Text Available Massively parallel computing, using a message passing interface (MPI, has been implemented into a three-dimensional version of the Goddard Cumulus Ensemble (GCE model. The implementation uses the domainresemble concept to design a code structure for both the whole domain and sub-domains after decomposition. Instead of inserting a group of MPI related statements into the model routine, these statements are packed into a single routine. In other words, only a single call statement to the model code is utilized once in a place, thus there is minimal impact on the original code. Therefore, the model is easily modified and/or managed by the model developers and/or users, who have little knowledge of massively parallel computing.
The 2nd Symposium on the Frontiers of Massively Parallel Computations
Mills, Ronnie (Editor)
1988-01-01
Programming languages, computer graphics, neural networks, massively parallel computers, SIMD architecture, algorithms, digital terrain models, sort computation, simulation of charged particle transport on the massively parallel processor and image processing are among the topics discussed.
Distributed parallel computing in stochastic modeling of groundwater systems.
Dong, Yanhui; Li, Guomin; Xu, Haizhen
2013-03-01
Stochastic modeling is a rapidly evolving, popular approach to the study of the uncertainty and heterogeneity of groundwater systems. However, the use of Monte Carlo-type simulations to solve practical groundwater problems often encounters computational bottlenecks that hinder the acquisition of meaningful results. To improve the computational efficiency, a system that combines stochastic model generation with MODFLOW-related programs and distributed parallel processing is investigated. The distributed computing framework, called the Java Parallel Processing Framework, is integrated into the system to allow the batch processing of stochastic models in distributed and parallel systems. As an example, the system is applied to the stochastic delineation of well capture zones in the Pinggu Basin in Beijing. Through the use of 50 processing threads on a cluster with 10 multicore nodes, the execution times of 500 realizations are reduced to 3% compared with those of a serial execution. Through this application, the system demonstrates its potential in solving difficult computational problems in practical stochastic modeling. © 2012, The Author(s). Groundwater © 2012, National Ground Water Association.
A discrete ordinate response matrix method for massively parallel computers
International Nuclear Information System (INIS)
Hanebutte, U.R.; Lewis, E.E.
1991-01-01
A discrete ordinate response matrix method is formulated for the solution of neutron transport problems on massively parallel computers. The response matrix formulation eliminates iteration on the scattering source. The nodal matrices which result from the diamond-differenced equations are utilized in a factored form which minimizes memory requirements and significantly reduces the required number of algorithm utilizes massive parallelism by assigning each spatial node to a processor. The algorithm is accelerated effectively by a synthetic method in which the low-order diffusion equations are also solved by massively parallel red/black iterations. The method has been implemented on a 16k Connection Machine-2, and S 8 and S 16 solutions have been obtained for fixed-source benchmark problems in X--Y geometry
Implementation of PHENIX trigger algorithms on massively parallel computers
International Nuclear Information System (INIS)
Petridis, A.N.; Wohn, F.K.
1995-01-01
The event selection requirements of contemporary high energy and nuclear physics experiments are met by the introduction of on-line trigger algorithms which identify potentially interesting events and reduce the data acquisition rate to levels that are manageable by the electronics. Such algorithms being parallel in nature can be simulated off-line using massively parallel computers. The PHENIX experiment intends to investigate the possible existence of a new phase of matter called the quark gluon plasma which has been theorized to have existed in very early stages of the evolution of the universe by studying collisions of heavy nuclei at ultra-relativistic energies. Such interactions can also reveal important information regarding the structure of the nucleus and mandate a thorough investigation of the simpler proton-nucleus collisions at the same energies. The complexity of PHENIX events and the need to analyze and also simulate them at rates similar to the data collection ones imposes enormous computation demands. This work is a first effort to implement PHENIX trigger algorithms on parallel computers and to study the feasibility of using such machines to run the complex programs necessary for the simulation of the PHENIX detector response. Fine and coarse grain approaches have been studied and evaluated. Depending on the application the performance of a massively parallel computer can be much better or much worse than that of a serial workstation. A comparison between single instruction and multiple instruction computers is also made and possible applications of the single instruction machines to high energy and nuclear physics experiments are outlined. copyright 1995 American Institute of Physics
Executing a gather operation on a parallel computer
Archer, Charles J [Rochester, MN; Ratterman, Joseph D [Rochester, MN
2012-03-20
Methods, apparatus, and computer program products are disclosed for executing a gather operation on a parallel computer according to embodiments of the present invention. Embodiments include configuring, by the logical root, a result buffer or the logical root, the result buffer having positions, each position corresponding to a ranked node in the operational group and for storing contribution data gathered from that ranked node. Embodiments also include repeatedly for each position in the result buffer: determining, by each compute node of an operational group, whether the current position in the result buffer corresponds with the rank of the compute node, if the current position in the result buffer corresponds with the rank of the compute node, contributing, by that compute node, the compute node's contribution data, if the current position in the result buffer does not correspond with the rank of the compute node, contributing, by that compute node, a value of zero for the contribution data, and storing, by the logical root in the current position in the result buffer, results of a bitwise OR operation of all the contribution data by all compute nodes of the operational group for the current position, the results received through the global combining network.
A Parallel Computational Model for Multichannel Phase Unwrapping Problem
Imperatore, Pasquale; Pepe, Antonio; Lanari, Riccardo
2015-05-01
In this paper, a parallel model for the solution of the computationally intensive multichannel phase unwrapping (MCh-PhU) problem is proposed. Firstly, the Extended Minimum Cost Flow (EMCF) algorithm for solving MCh-PhU problem is revised within the rigorous mathematical framework of the discrete calculus ; thus permitting to capture its topological structure in terms of meaningful discrete differential operators. Secondly, emphasis is placed on those methodological and practical aspects, which lead to a parallel reformulation of the EMCF algorithm. Thus, a novel dual-level parallel computational model, in which the parallelism is hierarchically implemented at two different (i.e., process and thread) levels, is presented. The validity of our approach has been demonstrated through a series of experiments that have revealed a significant speedup. Therefore, the attained high-performance prototype is suitable for the solution of large-scale phase unwrapping problems in reasonable time frames, with a significant impact on the systematic exploitation of the existing, and rapidly growing, large archives of SAR data.
Establishing a group of endpoints in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.; Xue, Hanhong
2016-02-02
A parallel computer executes a number of tasks, each task includes a number of endpoints and the endpoints are configured to support collective operations. In such a parallel computer, establishing a group of endpoints receiving a user specification of a set of endpoints included in a global collection of endpoints, where the user specification defines the set in accordance with a predefined virtual representation of the endpoints, the predefined virtual representation is a data structure setting forth an organization of tasks and endpoints included in the global collection of endpoints and the user specification defines the set of endpoints without a user specification of a particular endpoint; and defining a group of endpoints in dependence upon the predefined virtual representation of the endpoints and the user specification.
Performing a local reduction operation on a parallel computer
Blocksome, Michael A.; Faraj, Daniel A.
2012-12-11
A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.
Final Report: Center for Programming Models for Scalable Parallel Computing
Energy Technology Data Exchange (ETDEWEB)
Mellor-Crummey, John [William Marsh Rice University
2011-09-13
As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.
Development of unstructured mesh generator on parallel computers
International Nuclear Information System (INIS)
Muramatsu, Kazuhiro; Shimada, Akio; Murakami, Hiroyuki; Higashida, Akihiro; Wakatsuki, Shigeto
2000-01-01
A general-purpose unstructured mesh generator, 'GRID3D/UNST', has been developed on parallel computers. High-speed operations and large-scale memory capacity of parallel computers enable the system to generate a large-scale mesh at high speed. As a matter of fact, the system generates large-scale mesh composed of 2,400,000 nodes and 14,000,000 elements about 1.5 hours on HITACHI SR2201, 64 PEs (Processing Elements) through 2.5 hours pre-process on SUN. Also the system is built on standard FORTRAN, C and Motif, and therefore has high portability. The system enables us to solve a large-scale problem that has been impossible to be solved, and to break new ground in the field of science and engineering. (author)
Improved algorithms for mapping pipelined and parallel computations
Nicol, David M.; O'Hallaron, David R.
1991-01-01
Recent work on the problem of mapping pipelined or parallel computations onto linear array, shared memory, and host-satellite systems is extended. It is shown how these problems can be solved even more efficiently when computation module execution times are bounded from below, intermodule communication times are bounded from above, and the processors satisfy certain homogeneity constraints. The improved algorithms have significantly lower time and space complexities than the more general algorithms: in one case, an O(nm3) time algorithm for mapping m modules onto n processors is replaced with an O(nm log m) time algorithm, and the space requirements are reduced from O(nm2) to O(m). Run-time complexity is reduced further with parallel mapping algorithms based on these improvements, which run on the architectures for which they create mappings.
Fast Evaluation of Segmentation Quality with Parallel Computing
Directory of Open Access Journals (Sweden)
Henry Cruz
2017-01-01
Full Text Available In digital image processing and computer vision, a fairly frequent task is the performance comparison of different algorithms on enormous image databases. This task is usually time-consuming and tedious, such that any kind of tool to simplify this work is welcome. To achieve an efficient and more practical handling of a normally tedious evaluation, we implemented the automatic detection system, with the help of MATLAB®’s Parallel Computing Toolbox™. The key parts of the system have been parallelized to achieve simultaneous execution and analysis of segmentation algorithms on the one hand and the evaluation of detection accuracy for the nonforested regions, such as a study case, on the other hand. As a positive side effect, CPU usage was reduced and processing time was significantly decreased by 68.54% compared to sequential processing (i.e., executing the system with each algorithm one by one.
Parallel fast multipole boundary element method applied to computational homogenization
Ptaszny, Jacek
2018-01-01
In the present work, a fast multipole boundary element method (FMBEM) and a parallel computer code for 3D elasticity problem is developed and applied to the computational homogenization of a solid containing spherical voids. The system of equation is solved by using the GMRES iterative solver. The boundary of the body is dicretized by using the quadrilateral serendipity elements with an adaptive numerical integration. Operations related to a single GMRES iteration, performed by traversing the corresponding tree structure upwards and downwards, are parallelized by using the OpenMP standard. The assignment of tasks to threads is based on the assumption that the tree nodes at which the moment transformations are initialized can be partitioned into disjoint sets of equal or approximately equal size and assigned to the threads. The achieved speedup as a function of number of threads is examined.
Mechatronic Model Based Computed Torque Control of a Parallel Manipulator
Directory of Open Access Journals (Sweden)
Zhiyong Yang
2008-03-01
Full Text Available With high speed and accuracy the parallel manipulators have wide application in the industry, but there still exist many difficulties in the actual control process because of the time-varying and coupling. Unfortunately, the present-day commercial controlles cannot provide satisfying performance for its single axis linear control only. Therefore, aimed at a novel 2-DOF (Degree of Freedom parallel manipulator called Diamond 600, a motor-mechanism coupling dynamic model based control scheme employing the computed torque control algorithm are presented in this paper. First, the integrated dynamic coupling model is deduced, according to equivalent torques between the mechanical structure and the PM (Permanent Magnetism servomotor. Second, computed torque controller is described in detail for the above proposed model. At last, a series of numerical simulations and experiments are carried out to test the effectiveness of the system, and the results verify the favourable tracking ability and robustness.
Mechatronic Model Based Computed Torque Control of a Parallel Manipulator
Directory of Open Access Journals (Sweden)
Zhiyong Yang
2008-11-01
Full Text Available With high speed and accuracy the parallel manipulators have wide application in the industry, but there still exist many difficulties in the actual control process because of the time-varying and coupling. Unfortunately, the present-day commercial controlles cannot provide satisfying performance for its single axis linear control only. Therefore, aimed at a novel 2-DOF (Degree of Freedom parallel manipulator called Diamond 600, a motor-mechanism coupling dynamic model based control scheme employing the computed torque control algorithm are presented in this paper. First, the integrated dynamic coupling model is deduced, according to equivalent torques between the mechanical structure and the PM (Permanent Magnetism servomotor. Second, computed torque controller is described in detail for the above proposed model. At last, a series of numerical simulations and experiments are carried out to test the effectiveness of the system, and the results verify the favourable tracking ability and robustness.
Noise simulation in cone beam CT imaging with parallel computing
International Nuclear Information System (INIS)
Tu, S.-J.; Shaw, Chris C; Chen, Lingyun
2006-01-01
We developed a computer noise simulation model for cone beam computed tomography imaging using a general purpose PC cluster. This model uses a mono-energetic x-ray approximation and allows us to investigate three primary performance components, specifically quantum noise, detector blurring and additive system noise. A parallel random number generator based on the Weyl sequence was implemented in the noise simulation and a visualization technique was accordingly developed to validate the quality of the parallel random number generator. In our computer simulation model, three-dimensional (3D) phantoms were mathematically modelled and used to create 450 analytical projections, which were then sampled into digital image data. Quantum noise was simulated and added to the analytical projection image data, which were then filtered to incorporate flat panel detector blurring. Additive system noise was generated and added to form the final projection images. The Feldkamp algorithm was implemented and used to reconstruct the 3D images of the phantoms. A 24 dual-Xeon PC cluster was used to compute the projections and reconstructed images in parallel with each CPU processing 10 projection views for a total of 450 views. Based on this computer simulation system, simulated cone beam CT images were generated for various phantoms and technique settings. Noise power spectra for the flat panel x-ray detector and reconstructed images were then computed to characterize the noise properties. As an example among the potential applications of our noise simulation model, we showed that images of low contrast objects can be produced and used for image quality evaluation
An Alternative Algorithm for Computing Watersheds on Shared Memory Parallel Computers
Meijster, A.; Roerdink, J.B.T.M.
1995-01-01
In this paper a parallel implementation of a watershed algorithm is proposed. The algorithm can easily be implemented on shared memory parallel computers. The watershed transform is generally considered to be inherently sequential since the discrete watershed of an image is defined using recursion.
A Parallel Iterative Method for Computing Molecular Absorption Spectra.
Koval, Peter; Foerster, Dietrich; Coulaud, Olivier
2010-09-14
We describe a fast parallel iterative method for computing molecular absorption spectra within TDDFT linear response and using the LCAO method. We use a local basis of "dominant products" to parametrize the space of orbital products that occur in the LCAO approach. In this basis, the dynamic polarizability is computed iteratively within an appropriate Krylov subspace. The iterative procedure uses a matrix-free GMRES method to determine the (interacting) density response. The resulting code is about 1 order of magnitude faster than our previous full-matrix method. This acceleration makes the speed of our TDDFT code comparable with codes based on Casida's equation. The implementation of our method uses hybrid MPI and OpenMP parallelization in which load balancing and memory access are optimized. To validate our approach and to establish benchmarks, we compute spectra of large molecules on various types of parallel machines. The methods developed here are fairly general, and we believe they will find useful applications in molecular physics/chemistry, even for problems that are beyond TDDFT, such as organic semiconductors, particularly in photovoltaics.
Local rollback for fault-tolerance in parallel computing systems
Blumrich, Matthias A [Yorktown Heights, NY; Chen, Dong [Yorktown Heights, NY; Gara, Alan [Yorktown Heights, NY; Giampapa, Mark E [Yorktown Heights, NY; Heidelberger, Philip [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Steinmacher-Burow, Burkhard [Boeblingen, DE; Sugavanam, Krishnan [Yorktown Heights, NY
2012-01-24
A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.
Performing an allreduce operation on a plurality of compute nodes of a parallel computer
Faraj, Ahmad [Rochester, MN
2012-04-17
Methods, apparatus, and products are disclosed for performing an allreduce operation on a plurality of compute nodes of a parallel computer. Each compute node includes at least two processing cores. Each processing core has contribution data for the allreduce operation. Performing an allreduce operation on a plurality of compute nodes of a parallel computer includes: establishing one or more logical rings among the compute nodes, each logical ring including at least one processing core from each compute node; performing, for each logical ring, a global allreduce operation using the contribution data for the processing cores included in that logical ring, yielding a global allreduce result for each processing core included in that logical ring; and performing, for each compute node, a local allreduce operation using the global allreduce results for each processing core on that compute node.
Runtime optimization of an application executing on a parallel computer
Faraj, Daniel A.; Smith, Brian E.
2013-01-29
Identifying a collective operation within an application executing on a parallel computer; identifying a call site of the collective operation; determining whether the collective operation is root-based; if the collective operation is not root-based: establishing a tuning session and executing the collective operation in the tuning session; if the collective operation is root-based, determining whether all compute nodes executing the application identified the collective operation at the same call site; if all compute nodes identified the collective operation at the same call site, establishing a tuning session and executing the collective operation in the tuning session; and if all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session.
Development of real-time visualization system for Computational Fluid Dynamics on parallel computers
International Nuclear Information System (INIS)
Muramatsu, Kazuhiro; Otani, Takayuki; Matsumoto, Hideki; Takei, Toshifumi; Doi, Shun
1998-03-01
A real-time visualization system for computational fluid dynamics in a network connecting between a parallel computing server and the client terminal was developed. Using the system, a user can visualize the results of a CFD (Computational Fluid Dynamics) simulation on the parallel computer as a client terminal during the actual computation on a server. Using GUI (Graphical User Interface) on the client terminal, to user is also able to change parameters of the analysis and visualization during the real-time of the calculation. The system carries out both of CFD simulation and generation of a pixel image data on the parallel computer, and compresses the data. Therefore, the amount of data from the parallel computer to the client is so small in comparison with no compression that the user can enjoy the swift image appearance comfortably. Parallelization of image data generation is based on Owner Computation Rule. GUI on the client is built on Java applet. A real-time visualization is thus possible on the client PC only if Web browser is implemented on it. (author)
Event parallelism: Distributed memory parallel computing for high energy physics experiments
International Nuclear Information System (INIS)
Nash, T.
1989-05-01
This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC systems, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described. 6 figs
Blocksome, Michael A.; Mamidala, Amith R.
2013-09-03
Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.
Faraj, Daniel A
2013-07-16
Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Data communications in a parallel active messaging interface of a parallel computer
Davis, Kristan D; Faraj, Daniel A
2013-07-09
Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Centaure: an heterogeneous parallel architecture for computer vision
International Nuclear Information System (INIS)
Peythieux, Marc
1997-01-01
This dissertation deals with the architecture of parallel computers dedicated to computer vision. In the first chapter, the problem to be solved is presented, as well as the architecture of the Sympati and Symphonie computers, on which this work is based. The second chapter is about the state of the art of computers and integrated processors that can execute computer vision and image processing codes. The third chapter contains a description of the architecture of Centaure. It has an heterogeneous structure: it is composed of a multiprocessor system based on Analog Devices ADSP21060 Sharc digital signal processor, and of a set of Symphonie computers working in a multi-SIMD fashion. Centaure also has a modular structure. Its basic node is composed of one Symphonie computer, tightly coupled to a Sharc thanks to a dual ported memory. The nodes of Centaure are linked together by the Sharc communication links. The last chapter deals with a performance validation of Centaure. The execution times on Symphonie and on Centaure of a benchmark which is typical of industrial vision, are presented and compared. In the first place, these results show that the basic node of Centaure allows a faster execution than Symphonie, and that increasing the size of the tested computer leads to a better speed-up with Centaure than with Symphonie. In the second place, these results validate the choice of running the low level structure of Centaure in a multi- SIMD fashion. (author) [fr
Semi-coarsening multigrid methods for parallel computing
Energy Technology Data Exchange (ETDEWEB)
Jones, J.E.
1996-12-31
Standard multigrid methods are not well suited for problems with anisotropic coefficients which can occur, for example, on grids that are stretched to resolve a boundary layer. There are several different modifications of the standard multigrid algorithm that yield efficient methods for anisotropic problems. In the paper, we investigate the parallel performance of these multigrid algorithms. Multigrid algorithms which work well for anisotropic problems are based on line relaxation and/or semi-coarsening. In semi-coarsening multigrid algorithms a grid is coarsened in only one of the coordinate directions unlike standard or full-coarsening multigrid algorithms where a grid is coarsened in each of the coordinate directions. When both semi-coarsening and line relaxation are used, the resulting multigrid algorithm is robust and automatic in that it requires no knowledge of the nature of the anisotropy. This is the basic multigrid algorithm whose parallel performance we investigate in the paper. The algorithm is currently being implemented on an IBM SP2 and its performance is being analyzed. In addition to looking at the parallel performance of the basic semi-coarsening algorithm, we present algorithmic modifications with potentially better parallel efficiency. One modification reduces the amount of computational work done in relaxation at the expense of using multiple coarse grids. This modification is also being implemented with the aim of comparing its performance to that of the basic semi-coarsening algorithm.
International Nuclear Information System (INIS)
Kimura, Toshiya.
1997-03-01
A two-dimensional explicit Euler solver has been implemented for five MIMD parallel computers of different machine architectures in Center for Promotion of Computational Science and Engineering of Japan Atomic Energy Research Institute. These parallel computers are Fujitsu VPP300, NEC SX-4, CRAY T94, IBM SP2, and Hitachi SR2201. The code was parallelized by several parallelization methods, and a typical compressible flow problem has been calculated for different grid sizes changing the number of processors. Their effective performances for parallel calculations, such as calculation speed, speed-up ratio and parallel efficiency, have been investigated and evaluated. The communication time among processors has been also measured and evaluated. As a result, the differences on the performance and the characteristics between vector-parallel and scalar-parallel computers can be pointed, and it will present the basic data for efficient use of parallel computers and for large scale CFD simulations on parallel computers. (author)
International Nuclear Information System (INIS)
Nash, T.; Areti, H.; Atac, R.
1988-08-01
Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs
Overview of Parallel Platforms for Common High Performance Computing
Directory of Open Access Journals (Sweden)
T. Fryza
2012-04-01
Full Text Available The paper deals with various parallel platforms used for high performance computing in the signal processing domain. More precisely, the methods exploiting the multicores central processing units such as message passing interface and OpenMP are taken into account. The properties of the programming methods are experimentally proved in the application of a fast Fourier transform and a discrete cosine transform and they are compared with the possibilities of MATLAB's built-in functions and Texas Instruments digital signal processors with very long instruction word architectures. New FFT and DCT implementations were proposed and tested. The implementation phase was compared with CPU based computing methods and with possibilities of the Texas Instruments digital signal processing library on C6747 floating-point DSPs. The optimal combination of computing methods in the signal processing domain and new, fast routines' implementation is proposed as well.
Final Report: Super Instruction Architecture for Scalable Parallel Computations
Energy Technology Data Exchange (ETDEWEB)
Sanders, Beverly Ann [Univ. of Florida, Gainesville, FL (United States)
2013-12-02
The most advanced methods for reliable and accurate computation of the electronic structure of molecular and nano systems are the coupled-cluster techniques. These high-accuracy methods help us to understand, for example, how biological enzymes operate and contribute to the design of new organic explosives. The ACES III software provides a modern, high-performance implementation of these methods optimized for high performance parallel computer systems, ranging from small clusters typical in individual research groups, through larger clusters available in campus and regional computer centers, all the way to high-end petascale systems at national labs, including exploiting GPUs if available. This project enhanced the ACESIII software package and used it to study interesting scientific problems.
DMA shared byte counters in a parallel computer
Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos
2010-04-06
A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts
Energy Technology Data Exchange (ETDEWEB)
NONE
1997-12-31
This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.
Parallel computation of automatic differentiation applied to magnetic field calculations
International Nuclear Information System (INIS)
Hinkins, R.L.; Lawrence Berkeley Lab., CA
1994-09-01
The author presents a parallelization of an accelerator physics application to simulate magnetic field in three dimensions. The problem involves the evaluation of high order derivatives with respect to two variables of a multivariate function. Automatic differentiation software had been used with some success, but the computation time was prohibitive. The implementation runs on several platforms, including a network of workstations using PVM, a MasPar using MPFortran, and a CM-5 using CMFortran. A careful examination of the code led to several optimizations that improved its serial performance by a factor of 8.7. The parallelization produced further improvements, especially on the MasPar with a speedup factor of 620. As a result a problem that took six days on a SPARC 10/41 now runs in minutes on the MasPar, making it feasible for physicists at Lawrence Berkeley Laboratory to simulate larger magnets
Center for Programming Models for Scalable Parallel Computing
Energy Technology Data Exchange (ETDEWEB)
John Mellor-Crummey
2008-02-29
Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf.
A scalable implementation of RI-SCF on parallel computers
International Nuclear Information System (INIS)
Fruechtl, H.A.; Kendall, R.A.; Harrison, R.J.
1996-01-01
In order to avoid the integral bottleneck of conventional SCF calculations, the Resolution of the Identity (RI) method is used to obtain an approximate solution to the Hartree-Fock equations. In this approximation only three-center integrals are needed to build the Fock matrix. It has been implemented as part of the NWChem package of portable and scalable ab initio programs for parallel computers. Utilizing the V-approximation, both the Coulomb and exchange contribution to the Fock matrix can be calculated from a transformed set of three-center integrals which have to be precalculated and stored. A distributed in-core method as well as a disk based implementation have been programmed. Details of the implementation as well as the parallel programming tools used are described. We also give results and timings from benchmark calculations
Simulation of partially coherent light propagation using parallel computing devices
Magalhães, Tiago C.; Rebordão, José M.
2017-08-01
Light acquires or loses coherence and coherence is one of the few optical observables. Spectra can be derived from coherence functions and understanding any interferometric experiment is also relying upon coherence functions. Beyond the two limiting cases (full coherence or incoherence) the coherence of light is always partial and it changes with propagation. We have implemented a code to compute the propagation of partially coherent light from the source plane to the observation plane using parallel computing devices (PCDs). In this paper, we restrict the propagation in free space only. To this end, we used the Open Computing Language (OpenCL) and the open-source toolkit PyOpenCL, which gives access to OpenCL parallel computation through Python. To test our code, we chose two coherence source models: an incoherent source and a Gaussian Schell-model source. In the former case, we divided into two different source shapes: circular and rectangular. The results were compared to the theoretical values. Our implemented code allows one to choose between the PyOpenCL implementation and a standard one, i.e using the CPU only. To test the computation time for each implementation (PyOpenCL and standard), we used several computer systems with different CPUs and GPUs. We used powers of two for the dimensions of the cross-spectral density matrix (e.g. 324, 644) and a significant speed increase is observed in the PyOpenCL implementation when compared to the standard one. This can be an important tool for studying new source models.
Cluster implementation for parallel computation within MATLAB software environment
International Nuclear Information System (INIS)
Santana, Antonio O. de; Dantas, Carlos C.; Charamba, Luiz G. da R.; Souza Neto, Wilson F. de; Melo, Silvio B. Melo; Lima, Emerson A. de O.
2013-01-01
A cluster for parallel computation with MATLAB software the COCGT - Cluster for Optimizing Computing in Gamma ray Transmission methods, is implemented. The implementation correspond to creation of a local net of computers, facilities and configurations of software, as well as the accomplishment of cluster tests for determine and optimizing of performance in the data processing. The COCGT implementation was required by data computation from gamma transmission measurements applied to fluid dynamic and tomography reconstruction in a FCC-Fluid Catalytic Cracking cold pilot unity, and simulation data as well. As an initial test the determination of SVD - Singular Values Decomposition - of random matrix with dimension (n , n), n=1000, using the Girco's law modified, revealed that COCGT was faster in comparison to the literature [1] cluster, which is similar and operates at the same conditions. Solution of a system of linear equations provided a new test for the COCGT performance by processing a square matrix with n=10000, computing time was 27 s and for square matrix with n=12000, computation time was 45 s. For determination of the cluster behavior in relation to 'parfor' (parallel for-loop) and 'spmd' (single program multiple data), two codes were used containing those two commands and the same problem: determination of SVD of a square matrix with n= 1000. The execution of codes by means of COCGT proved: 1) for the code with 'parfor', the performance improved with the labs number from 1 to 8 labs; 2) for the code 'spmd', just 1 lab (core) was enough to process and give results in less than 1 s. In similar situation, with the difference that now the SVD will be determined from square matrix with n1500, for code with 'parfor', and n=7000, for code with 'spmd'. That results take to conclusions: 1) for the code with 'parfor', the behavior was the same already described above; 2) for code with 'spmd', the same besides having produced a larger performance, it supports a
Review of An Introduction to Parallel and Vector Scientific Computing
Energy Technology Data Exchange (ETDEWEB)
Bailey, David H.; Lefton, Lew
2006-06-30
On one hand, the field of high-performance scientific computing is thriving beyond measure. Performance of leading-edge systems on scientific calculations, as measured say by the Top500 list, has increased by an astounding factor of 8000 during the 15-year period from 1993 to 2008, which is slightly faster even than Moore's Law. Even more importantly, remarkable advances in numerical algorithms, numerical libraries and parallel programming environments have led to improvements in the scope of what can be computed that are entirely on a par with the advances in computing hardware. And these successes have spread far beyond the confines of large government-operated laboratories, many universities, modest-sized research institutes and private firms now operate clusters that differ only in scale from the behemoth systems at the large-scale facilities. In the wake of these recent successes, researchers from fields that heretofore have not been part of the scientific computing world have been drawn into the arena. For example, at the recent SC07 conference, the exhibit hall, which long has hosted displays from leading computer systems vendors and government laboratories, featured some 70 exhibitors who had not previously participated. In spite of all these exciting developments, and in spite of the clear need to present these concepts to a much broader technical audience, there is a perplexing dearth of training material and textbooks in the field, particularly at the introductory level. Only a handful of universities offer coursework in the specific area of highly parallel scientific computing, and instructors of such courses typically rely on custom-assembled material. For example, the present reviewer and Robert F. Lucas relied on materials assembled in a somewhat ad-hoc fashion from colleagues and personal resources when presenting a course on parallel scientific computing at the University of California, Berkeley, a few years ago. Thus it is indeed refreshing
A hybrid method for the parallel computation of Green's functions
International Nuclear Information System (INIS)
Petersen, Dan Erik; Li Song; Stokbro, Kurt; Sorensen, Hans Henrik B.; Hansen, Per Christian; Skelboe, Stig; Darve, Eric
2009-01-01
Quantum transport models for nanodevices using the non-equilibrium Green's function method require the repeated calculation of the block tridiagonal part of the Green's and lesser Green's function matrices. This problem is related to the calculation of the inverse of a sparse matrix. Because of the large number of times this calculation needs to be performed, this is computationally very expensive even on supercomputers. The classical approach is based on recurrence formulas which cannot be efficiently parallelized. This practically prevents the solution of large problems with hundreds of thousands of atoms. We propose new recurrences for a general class of sparse matrices to calculate Green's and lesser Green's function matrices which extend formulas derived by Takahashi and others. We show that these recurrences may lead to a dramatically reduced computational cost because they only require computing a small number of entries of the inverse matrix. Then, we propose a parallelization strategy for block tridiagonal matrices which involves a combination of Schur complement calculations and cyclic reduction. It achieves good scalability even on problems of modest size.
SPINET: A Parallel Computing Approach to Spine Simulations
Directory of Open Access Journals (Sweden)
Peter G. Kropf
1996-01-01
Full Text Available Research in scientitic programming enables us to realize more and more complex applications, and on the other hand, application-driven demands on computing methods and power are continuously growing. Therefore, interdisciplinary approaches become more widely used. The interdisciplinary SPINET project presented in this article applies modern scientific computing tools to biomechanical simulations: parallel computing and symbolic and modern functional programming. The target application is the human spine. Simulations of the spine help us to investigate and better understand the mechanisms of back pain and spinal injury. Two approaches have been used: the first uses the finite element method for high-performance simulations of static biomechanical models, and the second generates a simulation developmenttool for experimenting with different dynamic models. A finite element program for static analysis has been parallelized for the MUSIC machine. To solve the sparse system of linear equations, a conjugate gradient solver (iterative method and a frontal solver (direct method have been implemented. The preprocessor required for the frontal solver is written in the modern functional programming language SML, the solver itself in C, thus exploiting the characteristic advantages of both functional and imperative programming. The speedup analysis of both solvers show very satisfactory results for this irregular problem. A mixed symbolic-numeric environment for rigid body system simulations is presented. It automatically generates C code from a problem specification expressed by the Lagrange formalism using Maple.
Large-scale parallel genome assembler over cloud computing environment.
Das, Arghya Kusum; Koppa, Praveen Kumar; Goswami, Sayan; Platania, Richard; Park, Seung-Jong
2017-06-01
The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.
Moon, Hongsik
What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the
Particle orbit tracking on a parallel computer: Hypertrack
International Nuclear Information System (INIS)
Cole, B.; Bourianoff, G.; Pilat, F.; Talman, R.
1991-05-01
A program has been written which performs particle orbit tracking on the Intel iPSC/860 distributed memory parallel computer. The tracking is performed using a thin element approach. A brief description of the structure and performance of the code is presented, along with applications of the code to the analysis of accelerator lattices for the SSC. The concept of ''ensemble tracking'', i.e. the tracking of ensemble averages of noninteracting particles, such as the emittance, is presented. Preliminary results of such studies will be presented. 2 refs., 6 figs
Routing performance analysis and optimization within a massively parallel computer
Archer, Charles Jens; Peters, Amanda; Pinnow, Kurt Walter; Swartz, Brent Allen
2013-04-16
An apparatus, program product and method optimize the operation of a massively parallel computer system by, in part, receiving actual performance data concerning an application executed by the plurality of interconnected nodes, and analyzing the actual performance data to identify an actual performance pattern. A desired performance pattern may be determined for the application, and an algorithm may be selected from among a plurality of algorithms stored within a memory, the algorithm being configured to achieve the desired performance pattern based on the actual performance data.
Parallel Computation of Persistent Homology using the Blowup Complex
Energy Technology Data Exchange (ETDEWEB)
Lewis, Ryan [Stanford Univ., CA (United States); Morozov, Dmitriy [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
2015-04-27
We describe a parallel algorithm that computes persistent homology, an algebraic descriptor of a filtered topological space. Our algorithm is distinguished by operating on a spatial decomposition of the domain, as opposed to a decomposition with respect to the filtration. We rely on a classical construction, called the Mayer--Vietoris blowup complex, to glue global topological information about a space from its disjoint subsets. We introduce an efficient algorithm to perform this gluing operation, which may be of independent interest, and describe how to process the domain hierarchically. We report on a set of experiments that help assess the strengths and identify the limitations of our method.
Representing and computing regular languages on massively parallel networks.
Miller, M I; Roysam, B; Smith, K R; O'Sullivan, J A
1991-01-01
A general method is proposed for incorporating rule-based constraints corresponding to regular languages into stochastic inference problems, thereby allowing for a unified representation of stochastic and syntactic pattern constraints. The authors' approach establishes the formal connection of rules to Chomsky grammars and generalizes the original work of Shannon on the encoding of rule-based channel sequences to Markov chains of maximum entropy. This maximum entropy probabilistic view leads to Gibbs representations with potentials which have their number of minima growing at precisely the exponential rate that the language of deterministically constrained sequences grow. These representations are coupled to stochastic diffusion algorithms, which sample the language-constrained sequences by visiting the energy minima according to the underlying Gibbs probability law. This coupling yields the result that fully parallel stochastic cellular automata can be derived to generate samples from the rule-based constraint sets. The production rules and neighborhood state structure of the language of sequences directly determine the necessary connection structures of the required parallel computing surface. Representations of this type have been mapped to the DAP-510 massively parallel processor consisting of 1024 mesh-connected bit-serial processing elements for performing automated segmentation of electron-micrograph images.
Representing and computing regular languages on massively parallel networks
Energy Technology Data Exchange (ETDEWEB)
Miller, M.I.; O' Sullivan, J.A. (Electronic Systems and Research Lab., of Electrical Engineering, Washington Univ., St. Louis, MO (US)); Boysam, B. (Dept. of Electrical, Computer and Systems Engineering, Rensselaer Polytechnic Inst., Troy, NY (US)); Smith, K.R. (Dept. of Electrical Engineering, Southern Illinois Univ., Edwardsville, IL (US))
1991-01-01
This paper proposes a general method for incorporating rule-based constraints corresponding to regular languages into stochastic inference problems, thereby allowing for a unified representation of stochastic and syntactic pattern constraints. The authors' approach first established the formal connection of rules to Chomsky grammars, and generalizes the original work of Shannon on the encoding of rule-based channel sequences to Markov chains of maximum entropy. This maximum entropy probabilistic view leads to Gibb's representations with potentials which have their number of minima growing at precisely the exponential rate that the language of deterministically constrained sequences grow. These representations are coupled to stochastic diffusion algorithms, which sample the language-constrained sequences by visiting the energy minima according to the underlying Gibbs' probability law. The coupling to stochastic search methods yields the all-important practical result that fully parallel stochastic cellular automata may be derived to generate samples from the rule-based constraint sets. The production rules and neighborhood state structure of the language of sequences directly determines the necessary connection structures of the required parallel computing surface. Representations of this type have been mapped to the DAP-510 massively-parallel processor consisting of 1024 mesh-connected bit-serial processing elements for performing automated segmentation of electron-micrograph images.
Parallel Computer System for 3D Visualization Stereo on GPU
Al-Oraiqat, Anas M.; Zori, Sergii A.
2018-03-01
This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
Signal processing applications of massively parallel charge domain computing devices
Fijany, Amir (Inventor); Barhen, Jacob (Inventor); Toomarian, Nikzad (Inventor)
1999-01-01
The present invention is embodied in a charge coupled device (CCD)/charge injection device (CID) architecture capable of performing a Fourier transform by simultaneous matrix vector multiplication (MVM) operations in respective plural CCD/CID arrays in parallel in O(1) steps. For example, in one embodiment, a first CCD/CID array stores charge packets representing a first matrix operator based upon permutations of a Hartley transform and computes the Fourier transform of an incoming vector. A second CCD/CID array stores charge packets representing a second matrix operator based upon different permutations of a Hartley transform and computes the Fourier transform of an incoming vector. The incoming vector is applied to the inputs of the two CCD/CID arrays simultaneously, and the real and imaginary parts of the Fourier transform are produced simultaneously in the time required to perform a single MVM operation in a CCD/CID array.
Parallel computing-based sclera recognition for human identification
Lin, Yong; Du, Eliza Y.; Zhou, Zhi
2012-06-01
Compared to iris recognition, sclera recognition which uses line descriptor can achieve comparable recognition accuracy in visible wavelengths. However, this method is too time-consuming to be implemented in a real-time system. In this paper, we propose a GPU-based parallel computing approach to reduce the sclera recognition time. We define a new descriptor in which the information of KD tree structure and sclera edge are added. Registration and matching task is divided into subtasks in various sizes according to their computation complexities. Every affine transform parameters are generated by searching on KD tree. Texture memory, constant memory, and shared memory are used to store templates and transform matrixes. The experiment results show that the proposed method executed on GPU can dramatically improve the sclera matching speed in hundreds of times without accuracy decreasing.
Progress in Parallel Schur Complement Preconditioning for Computational Fluid Dynamics
Barth, Timothy J.; Chan, Tony F.; Tang, Wei-Pai; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
We consider preconditioning methods for nonself-adjoint advective-diffusive systems based on a non-overlapping Schur complement procedure for arbitrary triangulated domains. The ultimate goal of this research is to develop scalable preconditioning algorithms for fluid flow discretizations on parallel computing architectures. In our implementation of the Schur complement preconditioning technique, the triangulation is first partitioned into a number of subdomains using the METIS multi-level k-way partitioning code. This partitioning induces a natural 2X2 partitioning of the p.d.e. discretization matrix. By considering various inverse approximations of the 2X2 system, we have developed a family of robust preconditioning techniques. A computer code based on these ideas has been developed and tested on the IBM SP2 and the SGI Power Challenge array using MPI message passing protocol. A number of example CFD calculations will be presented to illustrate and assess various Schur complement approximations.
Application Specific Performance Technology for Productive Parallel Computing
Energy Technology Data Exchange (ETDEWEB)
Malony, Allen D. [Univ. of Oregon, Eugene, OR (United States); Shende, Sameer [Univ. of Oregon, Eugene, OR (United States)
2008-09-30
Our accomplishments over the last three years of the DOE project Application- Specific Performance Technology for Productive Parallel Computing (DOE Agreement: DE-FG02-05ER25680) are described below. The project will have met all of its objectives by the time of its completion at the end of September, 2008. Two extensive yearly progress reports were produced in in March 2006 and 2007 and were previously submitted to the DOE Office of Advanced Scientific Computing Research (OASCR). Following an overview of the objectives of the project, we summarize for each of the project areas the achievements in the first two years, and then describe in some more detail the project accomplishments this past year. At the end, we discuss the relationship of the proposed renewal application to the work done on the current project.
Energy Proportionality and Performance in Data Parallel Computing Clusters
Energy Technology Data Exchange (ETDEWEB)
Kim, Jinoh; Chou, Jerry; Rotem, Doron
2011-02-14
Energy consumption in datacenters has recently become a major concern due to the rising operational costs andscalability issues. Recent solutions to this problem propose the principle of energy proportionality, i.e., the amount of energy consumedby the server nodes must be proportional to the amount of work performed. For data parallelism and fault tolerancepurposes, most common file systems used in MapReduce-type clusters maintain a set of replicas for each data block. A coveringset is a group of nodes that together contain at least one replica of the data blocks needed for performing computing tasks. In thiswork, we develop and analyze algorithms to maintain energy proportionality by discovering a covering set that minimizesenergy consumption while placing the remaining nodes in lowpower standby mode. Our algorithms can also discover coveringsets in heterogeneous computing environments. In order to allow more data parallelism, we generalize our algorithms so that itcan discover k-covering sets, i.e., a set of nodes that contain at least k replicas of the data blocks. Our experimental results showthat we can achieve substantial energy saving without significant performance loss in diverse cluster configurations and workingenvironments.
Gaudiani, Adriana; Carusela, Florencia; Soba, Alejandro
2013-01-01
A great challenge for scientists is to execute their computational applications efficiently. Nowadays, parallel programming has become a fundamental key to achieve this goal. High-performance computing provides a solution to exploit parallel architectures in order to get optimal performance. Both parallel programming model and the system architecture will maximize the benefits if both together are suitable to the inherent parallelism of the problem. We compared three parallelized versions ...
Concurrent computation of attribute filters on shared memory parallel machines
Wilkinson, Michael H.F.; Gao, Hui; Hesselink, Wim H.; Jonker, Jan-Eppo; Meijster, Arnold
2008-01-01
Morphological attribute filters have not previously been parallelized mainly because they are both global and nonseparable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings, and thickenings,
Parallel Computation of Unsteady Flows on a Network of Workstations
1997-01-01
Parallel computation of unsteady flows requires significant computational resources. The utilization of a network of workstations seems an efficient solution to the problem where large problems can be treated at a reasonable cost. This approach requires the solution of several problems: 1) the partitioning and distribution of the problem over a network of workstation, 2) efficient communication tools, 3) managing the system efficiently for a given problem. Of course, there is the question of the efficiency of any given numerical algorithm to such a computing system. NPARC code was chosen as a sample for the application. For the explicit version of the NPARC code both two- and three-dimensional problems were studied. Again both steady and unsteady problems were investigated. The issues studied as a part of the research program were: 1) how to distribute the data between the workstations, 2) how to compute and how to communicate at each node efficiently, 3) how to balance the load distribution. In the following, a summary of these activities is presented. Details of the work have been presented and published as referenced.
Simple, parallel, high-performance virtual machines for extreme computations
International Nuclear Information System (INIS)
Chokoufe Nejad, Bijan; Ohl, Thorsten; Reuter, Jurgen
2014-11-01
We introduce a high-performance virtual machine (VM) written in a numerically fast language like Fortran or C to evaluate very large expressions. We discuss the general concept of how to perform computations in terms of a VM and present specifically a VM that is able to compute tree-level cross sections for any number of external legs, given the corresponding byte code from the optimal matrix element generator, O'Mega. Furthermore, this approach allows to formulate the parallel computation of a single phase space point in a simple and obvious way. We analyze hereby the scaling behaviour with multiple threads as well as the benefits and drawbacks that are introduced with this method. Our implementation of a VM can run faster than the corresponding native, compiled code for certain processes and compilers, especially for very high multiplicities, and has in general runtimes in the same order of magnitude. By avoiding the tedious compile and link steps, which may fail for source code files of gigabyte sizes, new processes or complex higher order corrections that are currently out of reach could be evaluated with a VM given enough computing power.
Parallel processing using an optical delay-based reservoir computer
Van der Sande, Guy; Nguimdo, Romain Modeste; Verschaffelt, Guy
2016-04-01
Delay systems subject to delayed optical feedback have recently shown great potential in solving computationally hard tasks. By implementing a neuro-inspired computational scheme relying on the transient response to optical data injection, high processing speeds have been demonstrated. However, reservoir computing systems based on delay dynamics discussed in the literature are designed by coupling many different stand-alone components which lead to bulky, lack of long-term stability, non-monolithic systems. Here we numerically investigate the possibility of implementing reservoir computing schemes based on semiconductor ring lasers. Semiconductor ring lasers are semiconductor lasers where the laser cavity consists of a ring-shaped waveguide. SRLs are highly integrable and scalable, making them ideal candidates for key components in photonic integrated circuits. SRLs can generate light in two counterpropagating directions between which bistability has been demonstrated. We demonstrate that two independent machine learning tasks , even with different nature of inputs with different input data signals can be simultaneously computed using a single photonic nonlinear node relying on the parallelism offered by photonics. We illustrate the performance on simultaneous chaotic time series prediction and a classification of the Nonlinear Channel Equalization. We take advantage of different directional modes to process individual tasks. Each directional mode processes one individual task to mitigate possible crosstalk between the tasks. Our results indicate that prediction/classification with errors comparable to the state-of-the-art performance can be obtained even with noise despite the two tasks being computed simultaneously. We also find that a good performance is obtained for both tasks for a broad range of the parameters. The results are discussed in detail in [Nguimdo et al., IEEE Trans. Neural Netw. Learn. Syst. 26, pp. 3301-3307, 2015
Cloud identification using genetic algorithms and massively parallel computation
Buckles, Bill P.; Petry, Frederick E.
1996-01-01
As a Guest Computational Investigator under the NASA administered component of the High Performance Computing and Communication Program, we implemented a massively parallel genetic algorithm on the MasPar SIMD computer. Experiments were conducted using Earth Science data in the domains of meteorology and oceanography. Results obtained in these domains are competitive with, and in most cases better than, similar problems solved using other methods. In the meteorological domain, we chose to identify clouds using AVHRR spectral data. Four cloud speciations were used although most researchers settle for three. Results were remarkedly consistent across all tests (91% accuracy). Refinements of this method may lead to more timely and complete information for Global Circulation Models (GCMS) that are prevalent in weather forecasting and global environment studies. In the oceanographic domain, we chose to identify ocean currents from a spectrometer having similar characteristics to AVHRR. Here the results were mixed (60% to 80% accuracy). Given that one is willing to run the experiment several times (say 10), then it is acceptable to claim the higher accuracy rating. This problem has never been successfully automated. Therefore, these results are encouraging even though less impressive than the cloud experiment. Successful conclusion of an automated ocean current detection system would impact coastal fishing, naval tactics, and the study of micro-climates. Finally we contributed to the basic knowledge of GA (genetic algorithm) behavior in parallel environments. We developed better knowledge of the use of subpopulations in the context of shared breeding pools and the migration of individuals. Rigorous experiments were conducted based on quantifiable performance criteria. While much of the work confirmed current wisdom, for the first time we were able to submit conclusive evidence. The software developed under this grant was placed in the public domain. An extensive user
Low latency, high bandwidth data communications between compute nodes in a parallel computer
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.
2010-11-02
Methods, parallel computers, and computer program products are disclosed for low latency, high bandwidth data communications between compute nodes in a parallel computer. Embodiments include receiving, by an origin direct memory access (`DMA`) engine of an origin compute node, data for transfer to a target compute node; sending, by the origin DMA engine of the origin compute node to a target DMA engine on the target compute node, a request to send (`RTS`) message; transferring, by the origin DMA engine, a predetermined portion of the data to the target compute node using memory FIFO operation; determining, by the origin DMA engine whether an acknowledgement of the RTS message has been received from the target DMA engine; if the an acknowledgement of the RTS message has not been received, transferring, by the origin DMA engine, another predetermined portion of the data to the target compute node using a memory FIFO operation; and if the acknowledgement of the RTS message has been received by the origin DMA engine, transferring, by the origin DMA engine, any remaining portion of the data to the target compute node using a direct put operation.
Pacing a data transfer operation between compute nodes on a parallel computer
Blocksome, Michael A [Rochester, MN
2011-09-13
Methods, systems, and products are disclosed for pacing a data transfer between compute nodes on a parallel computer that include: transferring, by an origin compute node, a chunk of an application message to a target compute node; sending, by the origin compute node, a pacing request to a target direct memory access (`DMA`) engine on the target compute node using a remote get DMA operation; determining, by the origin compute node, whether a pacing response to the pacing request has been received from the target DMA engine; and transferring, by the origin compute node, a next chunk of the application message if the pacing response to the pacing request has been received from the target DMA engine.
Nordic Summer School on Parallel Computing in Optimization
Pardalos, Panos; Storøy, Sverre
1997-01-01
During the last three decades, breakthroughs in computer technology have made a tremendous impact on optimization. In particular, parallel computing has made it possible to solve larger and computationally more difficult prob lems. This volume contains mainly lecture notes from a Nordic Summer School held at the Linkoping Institute of Technology, Sweden in August 1995. In order to make the book more complete, a few authors were invited to contribute chapters that were not part of the course on this first occasion. The purpose of this Nordic course in advanced studies was three-fold. One goal was to introduce the students to the new achievements in a new and very active field, bring them close to world leading researchers, and strengthen their competence in an area with internationally explosive rate of growth. A second goal was to strengthen the bonds between students from different Nordic countries, and to encourage collaboration and joint research ventures over the borders. In this respect, the course bui...
Exploiting optical waveguides in general-purpose parallel computing
Davis, Martin H., Jr.; Ramachandran, Umakishore
1993-02-01
We motivate our interest in examining how optics can be used in a class of general purpose parallel computer architectures called Distributed Shared Memory (DSM). We describe an abstract DSM architecture called Beehive that incorporates a weak memory model called Buffered Consistency. We propose a specific optical implementation of Beehive called OBee. This optical implementation uses optical waveguides to implement an interconnection network called Optical Broadcast Rings (OBRs). The OBRs are used in OBee as part of the hybrid electronic/optical hardware support for cache coherency and three types of synchronization (locks, barriers, and combining F&OPs). We also use the OBRs to propose purely optical hardware support for the locks and barriers.
Optimized collectives using a DMA on a parallel computer
Energy Technology Data Exchange (ETDEWEB)
Chen, Dong [Croton On Hudson, NY; Gabor, Dozsa [Ardsley, NY; Giampapa, Mark E [Irvington, NY; Heidelberger,; Phillip, [Cortlandt Manor, NY
2011-02-08
Optimizing collective operations using direct memory access controller on a parallel computer, in one aspect, may comprise establishing a byte counter associated with a direct memory access controller for each submessage in a message. The byte counter includes at least a base address of memory and a byte count associated with a submessage. A byte counter associated with a submessage is monitored to determine whether at least a block of data of the submessage has been received. The block of data has a predetermined size, for example, a number of bytes. The block is processed when the block has been fully received, for example, when the byte count indicates all bytes of the block have been received. The monitoring and processing may continue for all blocks in all submessages in the message.
Design of a Parallel Robotic Manipulator Using Evolutionary Computing
Directory of Open Access Journals (Sweden)
António M. Lopes
2012-05-01
Full Text Available In this paper the kinematic design of a 6-dof parallel robotic manipulator is analysed. Firstly, the condition number of the inverse kinematic jacobian is considered as the objective function, measuring the manipulator's dexterity and a genetic algorithm is used to solve the optimization problem. In a second approach, a neural network model of the analytical objective function is developed and subsequently used as the objective function in the genetic algorithm optimization search process. It is shown that the neuro-genetic algorithm can find close to optimal solutions for maximum dexterity, significantly reducing the computational burden. The sensitivity of the condition number in the robot's workspace is analysed and used to guide the designer in choosing the best structural configuration. Finally, a global optimization problem is also addressed.
Ford, Eric B.; Dindar, Saleh; Peters, Jorg
2015-08-01
The realism of astrophysical simulations and statistical analyses of astronomical data are set by the available computational resources. Thus, astronomers and astrophysicists are constantly pushing the limits of computational capabilities. For decades, astronomers benefited from massive improvements in computational power that were driven primarily by increasing clock speeds and required relatively little attention to details of the computational hardware. For nearly a decade, increases in computational capabilities have come primarily from increasing the degree of parallelism, rather than increasing clock speeds. Further increases in computational capabilities will likely be led by many-core architectures such as Graphical Processing Units (GPUs) and Intel Xeon Phi. Successfully harnessing these new architectures, requires significantly more understanding of the hardware architecture, cache hierarchy, compiler capabilities and network network characteristics.I will provide an astronomer's overview of the opportunities and challenges provided by modern many-core architectures and elastic cloud computing. The primary goal is to help an astronomical audience understand what types of problems are likely to yield more than order of magnitude speed-ups and which problems are unlikely to parallelize sufficiently efficiently to be worth the development time and/or costs.I will draw on my experience leading a team in developing the Swarm-NG library for parallel integration of large ensembles of small n-body systems on GPUs, as well as several smaller software projects. I will share lessons learned from collaborating with computer scientists, including both technical and soft skills. Finally, I will discuss the challenges of training the next generation of astronomers to be proficient in this new era of high-performance computing, drawing on experience teaching a graduate class on High-Performance Scientific Computing for Astrophysics and organizing a 2014 advanced summer
Low cost, highly effective parallel computing achieved through a Beowulf cluster.
Bitner, Marc; Skelton, Gordon
2003-01-01
A Beowulf cluster is a means of bringing together several computers and using software and network components to make this cluster of computers appear and function as one computer with multiple parallel computing processors. A cluster of computers can provide comparable computing power usually found only in very expensive super computers or servers.
An Accurate liver segmentation method using parallel computing algorithm
International Nuclear Information System (INIS)
Elbasher, Eiman Mohammed Khalied
2014-12-01
Computed Tomography (CT or CAT scan) is a noninvasive diagnostic imaging procedure that uses a combination of X-rays and computer technology to produce horizontal, or axial, images (often called slices) of the body. A CT scan shows detailed images of any part of the body, including the bones muscles, fat and organs CT scans are more detailed than standard x-rays. CT scans may be done with or without c ontrast Contrast refers to a substance taken by mouth and/ or injected into an intravenous (IV) line that causes the particular organ or tissue under study to be seen more clearly. CT scan of the liver and biliary tract are used in the diagnosis of many diseases in the abdomen structures, particularly when another type of examination, such as X-rays, physical examination, and ultra sound is not conclusive. Unfortunately, the presence of noise and artifact in the edges and fine details in the CT images limit the contrast resolution and make diagnostic procedure more difficult. This experimental study was conducted at the College of Medical Radiological Science, Sudan University of Science and Technology and Fidel Specialist Hospital. The sample of study was included 50 patients. The main objective of this research was to study an accurate liver segmentation method using a parallel computing algorithm, and to segment liver and adjacent organs using image processing technique. The main technique of segmentation used in this study was watershed transform. The scope of image processing and analysis applied to medical application is to improve the quality of the acquired image and extract quantitative information from medical image data in an efficient and accurate way. The results of this technique agreed wit the results of Jarritt et al, (2010), Kratchwil et al, (2010), Jover et al, (2011), Yomamoto et al, (1996), Cai et al (1999), Saudha and Jayashree (2010) who used different segmentation filtering based on the methods of enhancing the computed tomography images. Anther
Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong
2010-10-01
Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
2009-01-01
At the 19th Annual Conference on Parallel Computational Fluid Dynamics held in Antalya, Turkey, in May 2007, the most recent developments and implementations of large-scale and grid computing were presented. This book, comprised of the invited and selected papers of this conference, details those advances, which are of particular interest to CFD and CFD-related communities. It also offers the results related to applications of various scientific and engineering problems involving flows and flow-related topics. Intended for CFD researchers and graduate students, this book is a state-of-the-art presentation of the relevant methodology and implementation techniques of large-scale computing.
Iterative algorithms for large sparse linear systems on parallel computers
Adams, L. M.
1982-01-01
Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
Quasi-monochromatic parallel radiography utilizing a computed radiography system
International Nuclear Information System (INIS)
Sato, E.; Hayasi, Y.; Germer, R.; Tanaka, E.; Mori, H.; Kawai, T.; Ichimaru, T.; Sato, S.; Takayama, K.; Ido, H.
2004-01-01
A fundamental study on quasi-monochromatic parallel radiography using a polycapillary plate and a copper-target X-ray tube is described. The X-ray generator consists of a negative high-voltage power supply, a filament (hot cathode) power supply, and an X-ray tube. The negative high-voltage is applied to the cathode electrode, and the anode electrode is connected to the ground. In this experiment, the tube voltage was regulated from12-25 kV, and the tube current was regulated within 3.0 mA by the filament temperature. The exposure time was controlled in order to obtain optimum X-ray intensity, and the maximum focal spot dimensions were approximately 2 mmx1.5 mm. The polycapillary plate was J5022-21 (Hamamatsu Photonics Inc.), and the plate thickness was 1.0 mm. The outer, effective, and hole diameters were 87 mm, 77 mm, and 25 μm, respectively. Quasi-monochromatic X-rays were produced using a 10 μm-thick copper filter, and these rays were formed into parallel beams by the polycapillary, and the radiogram was taken using a computed radiography system utilizing imaging plates. In the measurement of image resolution, the resolution fell according to increases in the distance between the chart and imaging plate using a polycapillary. We could observe a 50 μm tungsten wire clearly, and fine blood vessels of approximately 100 μm were visible in angiography
CERN. Geneva
2016-01-01
Large scale scientific computing raises questions on different levels ranging from the fomulation of the problems to the choice of the best algorithms and their implementation for a specific platform. There are similarities in these different topics that can be exploited by modern-style C++ template metaprogramming techniques to produce readable, maintainable and generic code. Traditional low-level code tend to be fast but platform-dependent, and it obfuscates the meaning of the algorithm. On the other hand, object-oriented approach is nice to read, but may come with an inherent performance penalty. These lectures aim to present he basics of the Expression Template (ET) idiom which allows us to keep the object-oriented approach without sacrificing performance. We will in particular show to to enhance ET to include SIMD vectorization. We will then introduce techniques for abstracting iteration, and introduce thread-level parallelism for use in heavy data-centric loads. We will show to to apply these methods i...
A class of parallel algorithms for computation of the manipulator inertia matrix
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
Parallel In Situ Indexing for Data-intensive Computing
Energy Technology Data Exchange (ETDEWEB)
Kim, Jinoh; Abbasi, Hasan; Chacon, Luis; Docan, Ciprian; Klasky, Scott; Liu, Qing; Podhorszki, Norbert; Shoshani, Arie; Wu, Kesheng
2011-09-09
As computing power increases exponentially, vast amount of data is created by many scientific re- search activities. However, the bandwidth for storing the data to disks and reading the data from disks has been improving at a much slower pace. These two trends produce an ever-widening data access gap. Our work brings together two distinct technologies to address this data access issue: indexing and in situ processing. From decades of database research literature, we know that indexing is an effective way to address the data access issue, particularly for accessing relatively small fraction of data records. As data sets increase in sizes, more and more analysts need to use selective data access, which makes indexing an even more important for improving data access. The challenge is that most implementations of in- dexing technology are embedded in large database management systems (DBMS), but most scientific datasets are not managed by any DBMS. In this work, we choose to include indexes with the scientific data instead of requiring the data to be loaded into a DBMS. We use compressed bitmap indexes from the FastBit software which are known to be highly effective for query-intensive workloads common to scientific data analysis. To use the indexes, we need to build them first. The index building procedure needs to access the whole data set and may also require a significant amount of compute time. In this work, we adapt the in situ processing technology to generate the indexes, thus removing the need of read- ing data from disks and to build indexes in parallel. The in situ data processing system used is ADIOS, a middleware for high-performance I/O. Our experimental results show that the indexes can improve the data access time up to 200 times depending on the fraction of data selected, and using in situ data processing system can effectively reduce the time needed to create the indexes, up to 10 times with our in situ technique when using identical parallel settings.
Parallel multiphysics algorithms and software for computational nuclear engineering
International Nuclear Information System (INIS)
Gaston, D; Hansen, G; Kadioglu, S; Knoll, D A; Newman, C; Park, H; Permann, C; Taitano, W
2009-01-01
There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallel computing. Many of the coupling efforts to date utilize simple code coupling or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE-based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.
I - Template Metaprogramming for Massively Parallel Scientific Computing - Expression Templates
CERN. Geneva
2016-01-01
Large scale scientific computing raises questions on different levels ranging from the fomulation of the problems to the choice of the best algorithms and their implementation for a specific platform. There are similarities in these different topics that can be exploited by modern-style C++ template metaprogramming techniques to produce readable, maintainable and generic code. Traditional low-level code tend to be fast but platform-dependent, and it obfuscates the meaning of the algorithm. On the other hand, object-oriented approach is nice to read, but may come with an inherent performance penalty. These lectures aim to present he basics of the Expression Template (ET) idiom which allows us to keep the object-oriented approach without sacrificing performance. We will in particular show to to enhance ET to include SIMD vectorization. We will then introduce techniques for abstracting iteration, and introduce thread-level parallelism for use in heavy data-centric loads. We will show to to apply these methods i...
Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB
Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.
2017-01-01
Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.
Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R; Ratterman, Joseph D; Smith, Brian E
2014-11-11
Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.
Energy Technology Data Exchange (ETDEWEB)
Lober, R.R.; Tautges, T.J.; Vaughan, C.T.
1997-03-01
Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.
Applications of the Aurora parallel Prolog system to computational molecular biology
Energy Technology Data Exchange (ETDEWEB)
Lusk, E.L.; Overbeek, R. [Argonne National Lab., IL (United States); Mudambi, S. [Knox Coll., Galesburg, IL (United States); Szeredi, P. [IQSOFT, Budapest (Hungary)
1993-09-01
We describe an investigation into the use of the Aurora parallel Prolog system in two applications within the area of computational molecular biology. The computational requirements were large, due to the nature of the applications, and were large, due to the nature of the applications, and were carried out on a scalable parallel computer the BBN ``Butterfly`` TC-2000. Results include both a demonstration that logic programming can be effective in the context of demanding applications on large-scale parallel machines, and some insights into parallel programming in Prolog.
International Nuclear Information System (INIS)
Rajagopalan, S.; Jethra, A.; Khare, A.N.; Ghodgaonkar, M.D.; Srivenkateshan, R.; Menon, S.V.G.
1990-01-01
Issues relating to implementing iterative procedures, for numerical solution of elliptic partial differential equations, on a distributed parallel computing system are discussed. Preliminary investigations show that a speed-up of about 3.85 is achievable on a four transputer pipeline network. (author). 2 figs., 3 a ppendixes., 7 refs
Lee, Jae H; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T; Seo, Youngho
2014-11-01
The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting.
Parallel MOPEX: Computing Mosaics of Large-Area Spitzer Surveys on a Cluster Computer
Directory of Open Access Journals (Sweden)
Joseph C. Jacob
2007-01-01
Full Text Available The Spitzer Science Center's MOPEX software is a part of the Spitzer Space Telescope's operational pipeline that enables detection of cosmic ray collisions with the detector array, masking of the corrupted pixels due to these collisions, subsequent mosaicking of image fields, and extraction of point sources to create catalogs of celestial objects. This paper reports on our experiences in parallelizing the parts of MOPEX related to cosmic ray rejection and mosaicking on a 1,024-processor cluster computer at NASA's Jet Propulsion Laboratory. The architecture and performance of the new Parallel MOPEX software are described. This work was done in order to rapidly mosaic the IRAC shallow survey data, covering a region of the sky observed with one of Spitzer's infrared instruments for the study of galaxy clusters, large-scale structure, and brown dwarfs.
Two applications of parallel processing in power system computation
Energy Technology Data Exchange (ETDEWEB)
Lemaitre, C.; Thomas, B. [Electricite de France, 92 - Clamart (France). Research and Development Div.
1996-12-31
Performance improvements are discussed achieved in two power system software modules through the use of parallel processing techniques. The first software module, EVARISTE, outputs a voltage stability indicator for various power system situations. The second module, MEXICO, assesses power system reliability and operating costs by simulating a large number of contingencies for generation and transmission equipment. Both software modules are well-suited to coarse-grain parallel processing. The first module was parallelized on a distributed-memory machine and the second on a shared-memory machine. A description of the parallelization process used in these two cases is presented, and details on the performance levels achieved are discussed, including aspects of programming, parameter selection, and machine characteristics. (author) 7 refs.
The specification of Stampi, a message passing library for distributed parallel computing
International Nuclear Information System (INIS)
Imamura, Toshiyuki; Takemiya, Hiroshi; Koide, Hiroshi
2000-03-01
At CCSE, Center for Promotion of Computational Science and Engineering, a new message passing library for heterogeneous and distributed parallel computing has been developed, and it is called as Stampi. Stampi enables us to communicate between any combination of parallel computers as well as workstations. Currently, a Stampi system is constructed from Stampi library and Stampi/Java. It provides functions to connect a Stampi application with not only those on COMPACS, COMplex Parallel Computer System, but also applets which work on WWW browsers. This report summarizes the specifications of Stampi and details the development of its system. (author)
Jiang, Yuning; Kang, Jinfeng; Wang, Xinan
2017-03-24
Resistive switching memory (RRAM) is considered as one of the most promising devices for parallel computing solutions that may overcome the von Neumann bottleneck of today's electronic systems. However, the existing RRAM-based parallel computing architectures suffer from practical problems such as device variations and extra computing circuits. In this work, we propose a novel parallel computing architecture for pattern recognition by implementing k-nearest neighbor classification on metal-oxide RRAM crossbar arrays. Metal-oxide RRAM with gradual RESET behaviors is chosen as both the storage and computing components. The proposed architecture is tested by the MNIST database. High speed (~100 ns per example) and high recognition accuracy (97.05%) are obtained. The influence of several non-ideal device properties is also discussed, and it turns out that the proposed architecture shows great tolerance to device variations. This work paves a new way to achieve RRAM-based parallel computing hardware systems with high performance.
High Performance Input/Output for Parallel Computer Systems
Ligon, W. B.
1996-01-01
The goal of our project is to study the I/O characteristics of parallel applications used in Earth Science data processing systems such as Regional Data Centers (RDCs) or EOSDIS. Our approach is to study the runtime behavior of typical programs and the effect of key parameters of the I/O subsystem both under simulation and with direct experimentation on parallel systems. Our three year activity has focused on two items: developing a test bed that facilitates experimentation with parallel I/O, and studying representative programs from the Earth science data processing application domain. The Parallel Virtual File System (PVFS) has been developed for use on a number of platforms including the Tiger Parallel Architecture Workbench (TPAW) simulator, The Intel Paragon, a cluster of DEC Alpha workstations, and the Beowulf system (at CESDIS). PVFS provides considerable flexibility in configuring I/O in a UNIX- like environment. Access to key performance parameters facilitates experimentation. We have studied several key applications fiom levels 1,2 and 3 of the typical RDC processing scenario including instrument calibration and navigation, image classification, and numerical modeling codes. We have also considered large-scale scientific database codes used to organize image data.
Advances in Parallel Computing and Databases for Digital Pathology in Cancer Research
2016-11-13
Through a number of efforts such as the National Strategic Computing Initiative (NSCI), there has been a push to merge these “ Big Data ” and...potential for applications in cancer research. II. PARALLEL COMPUTING AND BIG DATA Parallel computing is the ability to take a given program and split it...in-database analytics , hard- ware accelerated DBMS operations, and data models that more closely resemble the type of data being stored. For example
Event Based Simulator for Parallel Computing over the Wide Area Network for Real Time Visualization
Sundararajan, Elankovan; Harwood, Aaron; Kotagiri, Ramamohanarao; Satria Prabuwono, Anton
As the computational requirement of applications in computational science continues to grow tremendously, the use of computational resources distributed across the Wide Area Network (WAN) becomes advantageous. However, not all applications can be executed over the WAN due to communication overhead that can drastically slowdown the computation. In this paper, we introduce an event based simulator to investigate the performance of parallel algorithms executed over the WAN. The event based simulator known as SIMPAR (SIMulator for PARallel computation), simulates the actual computations and communications involved in parallel computation over the WAN using time stamps. Visualization of real time applications require steady stream of processed data flow for visualization purposes. Hence, SIMPAR may prove to be a valuable tool to investigate types of applications and computing resource requirements to provide uninterrupted flow of processed data for real time visualization purposes. The results obtained from the simulation show concurrence with the expected performance using the L-BSP model.
Locating hardware faults in a data communications network of a parallel computer
Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.
2010-01-12
Hardware faults location in a data communications network of a parallel computer. Such a parallel computer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.
Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM
de La Fuente Marcos, Carlos; Barge, Pierre; de La Fuente Marcos, Raúl
2002-03-01
We describe a parallel version of our high-order-accuracy particle-mesh code for the simulation of collisionless protoplanetary disks. We use this code to carry out a massively parallel, two-dimensional, time-dependent, numerical simulation, which includes dust particles, to study the potential role of large-scale, gaseous vortices in protoplanetary disks. This noncollisional problem is easy to parallelize on message-passing multicomputer architectures. We performed the simulations on a cache-coherent nonuniform memory access Origin 2000 machine, using both the parallel virtual machine (PVM) and message-passing interface (MPI) message-passing libraries. Our performance analysis suggests that, for our problem, PVM is about 25% faster than MPI. Using PVM and MPI made it possible to reduce CPU time and increase code performance. This allows for simulations with a large number of particles (N ~ 105-106) in reasonable CPU times. The performances of our implementation of the pa! rallel code on an Origin 2000 supercomputer are presented and discussed. They exhibit very good speedup behavior and low load unbalancing. Our results confirm that giant gaseous vortices can play a dominant role in giant planet formation.
Computation of watersheds based on parallel graph algorithms
Meijster, A.; Roerdink, J.B.T.M.; Maragos, P; Schafer, RW; Butt, MA
1996-01-01
In this paper the implementation of a parallel watershed algorithm is described. The algorithm has been implemented on a Cray J932, which is a shared memory architecture with 32 processors. The watershed transform has generally been considered to be inherently sequential, but recently a few research
Weeks, Cindy Lou
1986-01-01
Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.
Byun, Chansup; Guruswamy, Guru P.; Kutler, Paul (Technical Monitor)
1994-01-01
In recent years significant advances have been made for parallel computers in both hardware and software. Now parallel computers have become viable tools in computational mechanics. Many application codes developed on conventional computers have been modified to benefit from parallel computers. Significant speedups in some areas have been achieved by parallel computations. For single-discipline use of both fluid dynamics and structural dynamics, computations have been made on wing-body configurations using parallel computers. However, only a limited amount of work has been completed in combining these two disciplines for multidisciplinary applications. The prime reason is the increased level of complication associated with a multidisciplinary approach. In this work, procedures to compute aeroelasticity on parallel computers using direct coupling of fluid and structural equations will be investigated for wing-body configurations. The parallel computer selected for computations is an Intel iPSC/860 computer which is a distributed-memory, multiple-instruction, multiple data (MIMD) computer with 128 processors. In this study, the computational efficiency issues of parallel integration of both fluid and structural equations will be investigated in detail. The fluid and structural domains will be modeled using finite-difference and finite-element approaches, respectively. Results from the parallel computer will be compared with those from the conventional computers using a single processor. This study will provide an efficient computational tool for the aeroelastic analysis of wing-body structures on MIMD type parallel computers.
Lee, Wei-Po; Hsiao, Yu-Ting; Hwang, Wei-Che
2014-01-16
To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high
2014-01-01
Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel
Yim, Keun Soo
This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of
A parallel finite-difference method for computational aerodynamics
International Nuclear Information System (INIS)
Swisshelm, J.M.
1989-01-01
A finite-difference scheme for solving complex three-dimensional aerodynamic flow on parallel-processing supercomputers is presented. The method consists of a basic flow solver with multigrid convergence acceleration, embedded grid refinements, and a zonal equation scheme. Multitasking and vectorization have been incorporated into the algorithm. Results obtained include multiprocessed flow simulations from the Cray X-MP and Cray-2. Speedups as high as 3.3 for the two-dimensional case and 3.5 for segments of the three-dimensional case have been achieved on the Cray-2. The entire solver attained a factor of 2.7 improvement over its unitasked version on the Cray-2. The performance of the parallel algorithm on each machine is analyzed. 14 refs
Directory of Open Access Journals (Sweden)
WenBo Xiao
Full Text Available In this article, we introduced an artificial neural network (ANN based computational model to predict the output power of three types of photovoltaic cells, mono-crystalline (mono-, multi-crystalline (multi-, and amorphous (amor- crystalline. The prediction results are very close to the experimental data, and were also influenced by numbers of hidden neurons. The order of the solar generation power output influenced by the external conditions from smallest to biggest is: multi-, mono-, and amor- crystalline silicon cells. In addition, the dependences of power prediction on the number of hidden neurons were studied. For multi- and amorphous crystalline cell, three or four hidden layer units resulted in the high correlation coefficient and low MSEs. For mono-crystalline cell, the best results were achieved at the hidden layer unit of 8.
Paging memory from random access memory to backing storage in a parallel computer
Archer, Charles J; Blocksome, Michael A; Inglett, Todd A; Ratterman, Joseph D; Smith, Brian E
2013-05-21
Paging memory from random access memory (`RAM`) to backing storage in a parallel computer that includes a plurality of compute nodes, including: executing a data processing application on a virtual machine operating system in a virtual machine on a first compute node; providing, by a second compute node, backing storage for the contents of RAM on the first compute node; and swapping, by the virtual machine operating system in the virtual machine on the first compute node, a page of memory from RAM on the first compute node to the backing storage on the second compute node.
National Research Council Canada - National Science Library
Hisley, Dixie
1999-01-01
.... The goals of this report are: (1) to investigate the performance of message passing and loop level parallelization techniques, as they were implemented in the computational fluid dynamics (CFD...
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].
Furuta, Takuya; Sato, Tatsuhiko
2015-01-01
Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
Massively Parallel Computing at Sandia and Its Application to National Defense
National Research Council Canada - National Science Library
Dosanjh, Sudip
1991-01-01
Two years ago, researchers at Sandia National Laboratories showed that a massively parallel computer with 1024 processors could solve scientific problems more than 1000 times faster than a single processor...
Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Peters, Amanda E [Cambridge, MA; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN
2012-04-17
Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.
Line-plane broadcasting in a data communications network of a parallel computer
Archer, Charles J.; Berg, Jeremy E.; Blocksome, Michael A.; Smith, Brian E.
2010-06-08
Methods, apparatus, and products are disclosed for line-plane broadcasting in a data communications network of a parallel computer, the parallel computer comprising a plurality of compute nodes connected together through the network, the network optimized for point to point data communications and characterized by at least a first dimension, a second dimension, and a third dimension, that include: initiating, by a broadcasting compute node, a broadcast operation, including sending a message to all of the compute nodes along an axis of the first dimension for the network; sending, by each compute node along the axis of the first dimension, the message to all of the compute nodes along an axis of the second dimension for the network; and sending, by each compute node along the axis of the second dimension, the message to all of the compute nodes along an axis of the third dimension for the network.
Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Peters, Amanda A [Rochester, MN; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN
2012-01-10
Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.
Applications of parallel computer architectures to the real-time simulation of nuclear power systems
International Nuclear Information System (INIS)
Doster, J.M.; Sills, E.D.
1988-01-01
In this paper the authors report on efforts to utilize parallel computer architectures for the thermal-hydraulic simulation of nuclear power systems and current research efforts toward the development of advanced reactor operator aids and control systems based on this new technology. Many aspects of reactor thermal-hydraulic calculations are inherently parallel, and the computationally intensive portions of these calculations can be effectively implemented on modern computers. Timing studies indicate faster-than-real-time, high-fidelity physics models can be developed when the computational algorithms are designed to take advantage of the computer's architecture. These capabilities allow for the development of novel control systems and advanced reactor operator aids. Coupled with an integral real-time data acquisition system, evolving parallel computer architectures can provide operators and control room designers improved control and protection capabilities. Current research efforts are currently under way in this area
Automatic Choice of Scheduling Heuristics for Parallel/Distributed Computing
Directory of Open Access Journals (Sweden)
Clayton S. Ferner
1999-01-01
Full Text Available Task mapping and scheduling are two very difficult problems that must be addressed when a sequential program is transformed into a parallel program. Since these problems are NP‐hard, compiler writers have opted to concentrate their efforts on optimizations that produce immediate gains in performance. As a result, current parallelizing compilers either use very simple methods to deal with task scheduling or they simply ignore it altogether. Unfortunately, the programmer does not have this luxury. The burden of repartitioning or rescheduling, should the compiler produce inefficient parallel code, lies entirely with the programmer. We were able to create an algorithm (called a metaheuristic, which automatically chooses a scheduling heuristic for each input program. The metaheuristic produces better schedules in general than the heuristics upon which it is based. This technique was tested on a suite of real scientific programs written in SISAL and simulated on four different network configurations. Averaged over all of the test cases, the metaheuristic out‐performed all eight underlying scheduling algorithms; beating the best one by 2%, 12%, 13%, and 3% on the four separate network configurations. It is able to do this, not always by picking the best heuristic, but rather by avoiding the heuristics when they would produce very poor schedules. For example, while the metaheuristic only picked the best algorithm about 50% of the time for the 100 Gbps Ethernet, its worst decision was only 49% away from optimal. In contrast, the best of the eight scheduling algorithms was optimal 30% of the time, but its worst decision was 844% away from optimal.
Computer model of a reverberant and parallel circuit coupling
Kalil, Camila de Andrade; de Castro, Maria Clícia Stelling; Cortez, Célia Martins
2017-11-01
The objective of the present study was to deepen the knowledge about the functioning of the neural circuits by implementing a signal transmission model using the Graph Theory in a small network of neurons composed of an interconnected reverberant and parallel circuit, in order to investigate the processing of the signals in each of them and the effects on the output of the network. For this, a program was developed in C language and simulations were done using neurophysiological data obtained in the literature.
Low latency, high bandwidth data communications between compute nodes in a parallel computer
Blocksome, Michael A
2013-07-02
Methods, systems, and products are disclosed for data transfers between nodes in a parallel computer that include: receiving, by an origin DMA on an origin node, a buffer identifier for a buffer containing data for transfer to a target node; sending, by the origin DMA to the target node, a RTS message; transferring, by the origin DMA, a data portion to the target node using a memory FIFO operation that specifies one end of the buffer from which to begin transferring the data; receiving, by the origin DMA, an acknowledgement of the RTS message from the target node; and transferring, by the origin DMA in response to receiving the acknowledgement, any remaining data portion to the target node using a direct put operation that specifies the other end of the buffer from which to begin transferring the data, including initiating the direct put operation without invoking an origin processing core.
Self-pacing direct memory access data transfer operations for compute nodes in a parallel computer
Blocksome, Michael A
2015-02-17
Methods, apparatus, and products are disclosed for self-pacing DMA data transfer operations for nodes in a parallel computer that include: transferring, by an origin DMA on an origin node, a RTS message to a target node, the RTS message specifying an message on the origin node for transfer to the target node; receiving, in an origin injection FIFO for the origin DMA from a target DMA on the target node in response to transferring the RTS message, a target RGET descriptor followed by a DMA transfer operation descriptor, the DMA descriptor for transmitting a message portion to the target node, the target RGET descriptor specifying an origin RGET descriptor on the origin node that specifies an additional DMA descriptor for transmitting an additional message portion to the target node; processing, by the origin DMA, the target RGET descriptor; and processing, by the origin DMA, the DMA transfer operation descriptor.
Non-parallel processing: Gendered attrition in academic computer science
Cohoon, Joanne Louise Mcgrath
2000-10-01
This dissertation addresses the issue of disproportionate female attrition from computer science as an instance of gender segregation in higher education. By adopting a theoretical framework from organizational sociology, it demonstrates that the characteristics and processes of computer science departments strongly influence female retention. The empirical data identifies conditions under which women are retained in the computer science major at comparable rates to men. The research for this dissertation began with interviews of students, faculty, and chairpersons from five computer science departments. These exploratory interviews led to a survey of faculty and chairpersons at computer science and biology departments in Virginia. The data from these surveys are used in comparisons of the computer science and biology disciplines, and for statistical analyses that identify which departmental characteristics promote equal attrition for male and female undergraduates in computer science. This three-pronged methodological approach of interviews, discipline comparisons, and statistical analyses shows that departmental variation in gendered attrition rates can be explained largely by access to opportunity, relative numbers, and other characteristics of the learning environment. Using these concepts, this research identifies nine factors that affect the differential attrition of women from CS departments. These factors are: (1) The gender composition of enrolled students and faculty; (2) Faculty turnover; (3) Institutional support for the department; (4) Preferential attitudes toward female students; (5) Mentoring and supervising by faculty; (6) The local job market, starting salaries, and competitiveness of graduates; (7) Emphasis on teaching; and (8) Joint efforts for student success. This work contributes to our understanding of the gender segregation process in higher education. In addition, it contributes information that can lead to effective solutions for an
Valasek, Lukas; Glasa, Jan
2017-12-01
Current fire simulation systems are capable to utilize advantages of high-performance computer (HPC) platforms available and to model fires efficiently in parallel. In this paper, efficiency of a corridor fire simulation on a HPC computer cluster is discussed. The parallel MPI version of Fire Dynamics Simulator is used for testing efficiency of selected strategies of allocation of computational resources of the cluster using a greater number of computational cores. Simulation results indicate that if the number of cores used is not equal to a multiple of the total number of cluster node cores there are allocation strategies which provide more efficient calculations.
Parallel Computational Intelligence-Based Multi-Camera Surveillance System
Directory of Open Access Journals (Sweden)
Sergio Orts-Escolano
2014-04-01
Full Text Available In this work, we present a multi-camera surveillance system based on the use of self-organizing neural networks to represent events on video. The system processes several tasks in parallel using GPUs (graphic processor units. It addresses multiple vision tasks at various levels, such as segmentation, representation or characterization, analysis and monitoring of the movement. These features allow the construction of a robust representation of the environment and interpret the behavior of mobile agents in the scene. It is also necessary to integrate the vision module into a global system that operates in a complex environment by receiving images from multiple acquisition devices at video frequency. Offering relevant information to higher level systems, monitoring and making decisions in real time, it must accomplish a set of requirements, such as: time constraints, high availability, robustness, high processing speed and re-configurability. We have built a system able to represent and analyze the motion in video acquired by a multi-camera network and to process multi-source data in parallel on a multi-GPU architecture.
Chen, Ying; Balla, Apuroop; Rayford II, Cleveland E; Zhou, Weihua; Fang, Jian; Cong, Linlin
2010-01-01
Digital tomosynthesis is a novel technology that has been developed for various clinical applications. Parallel imaging configuration is utilised in a few tomosynthesis imaging areas such as digital chest tomosynthesis. Recently, parallel imaging configuration for breast tomosynthesis began to appear too. In this paper, we present the investigation on computational analysis of impulse response characterisation as the start point of our important research efforts to optimise the parallel imaging configurations. Results suggest that impulse response computational analysis is an effective method to compare and optimise imaging configurations.
A homotopy method for solving Riccati equations on a shared memory parallel computer
International Nuclear Information System (INIS)
Zigic, D.; Watson, L.T.; Collins, E.G. Jr.; Davis, L.D.
1993-01-01
Although there are numerous algorithms for solving Riccati equations, there still remains a need for algorithms which can operate efficiently on large problems and on parallel machines. This paper gives a new homotopy-based algorithm for solving Riccati equations on a shared memory parallel computer. The central part of the algorithm is the computation of the kernel of the Jacobian matrix, which is essential for the corrector iterations along the homotopy zero curve. Using a Schur decomposition the tensor product structure of various matrices can be efficiently exploited. The algorithm allows for efficient parallelization on shared memory machines
Managing internode data communications for an uninitialized process in a parallel computer
Archer, Charles J; Blocksome, Michael A; Miller, Douglas R; Parker, Jeffrey J; Ratterman, Joseph D; Smith, Brian E
2014-05-20
A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory.
Blocksome, Michael A.; Mamidala, Amith R.
2015-07-07
Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to a deterministic data communications network through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and the deterministic data communications network; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.
Comparison of Three Different Parallel Computation Methods for a Two-Dimensional Dam-Break Model
Directory of Open Access Journals (Sweden)
Shanghong Zhang
2017-01-01
Full Text Available Three parallel methods (OpenMP, MPI, and OpenACC are evaluated for the computation of a two-dimensional dam-break model using the explicit finite volume method. A dam-break event in the Pangtoupao flood storage area in China is selected as a case study to demonstrate the key technologies for implementing parallel computation. The subsequent acceleration of the methods is also evaluated. The simulation results show that the OpenMP and MPI parallel methods achieve a speedup factor of 9.8× and 5.1×, respectively, on a 32-core computer, whereas the OpenACC parallel method achieves a speedup factor of 20.7× on NVIDIA Tesla K20c graphics card. The results show that if the memory required by the dam-break simulation does not exceed the memory capacity of a single computer, the OpenMP parallel method is a good choice. Moreover, if GPU acceleration is used, the acceleration of the OpenACC parallel method is the best. Finally, the MPI parallel method is suitable for a model that requires little data exchange and large-scale calculation. This study compares the efficiency and methodology of accelerating algorithms for a dam-break model and can also be used as a reference for selecting the best acceleration method for a similar hydrodynamic model.
Massively Parallel Computing at the Large Hadron Collider up to the HL-LHC
AUTHOR|(CDS)2080997; Halyo, Valerie
2015-01-01
As the Large Hadron Collider (LHC) continues its upward progression in energy and luminosity towards the planned High-Luminosity LHC (HL-LHC) in 2025, the challenges of the experiments in processing increasingly complex events will also continue to increase. Improvements in computing technologies and algorithms will be a key part of the advances necessary to meet this challenge. Parallel computing techniques, especially those using massively parallel computing (MPC), promise to be a significant part of this effort. In these proceedings, we discuss these algorithms in the specific context of a particularly important problem: the reconstruction of charged particle tracks in the trigger algorithms in an experiment, in which high computing performance is critical for executing the track reconstruction in the available time. We discuss some areas where parallel computing has already shown benefits to the LHC experiments, and also demonstrate how a MPC-based trigger at the CMS experiment could not only improve perf...
A Parallel and Distributed Surrogate Model Implementation for Computational Steering
Butnaru, Daniel
2012-06-01
Understanding the influence of multiple parameters in a complex simulation setting is a difficult task. In the ideal case, the scientist can freely steer such a simulation and is immediately presented with the results for a certain configuration of the input parameters. Such an exploration process is however not possible if the simulation is computationally too expensive. For these cases we present in this paper a scalable computational steering approach utilizing a fast surrogate model as substitute for the time-consuming simulation. The surrogate model we propose is based on the sparse grid technique, and we identify the main computational tasks associated with its evaluation and its extension. We further show how distributed data management combined with the specific use of accelerators allows us to approximate and deliver simulation results to a high-resolution visualization system in real-time. This significantly enhances the steering workflow and facilitates the interactive exploration of large datasets. © 2012 IEEE.
Bostic, Susan W.; Fulton, Robert E.
1987-01-01
Eigenvalue analyses of complex structures is a computationally intensive task which can benefit significantly from new and impending parallel computers. This study reports on a parallel computer implementation of the Lanczos method for free vibration analysis. The approach used here subdivides the major Lanczos calculation tasks into subtasks and introduces parallelism down to the subtask levels such as matrix decomposition and forward/backward substitution. The method was implemented on a commercial parallel computer and results were obtained for a long flexible space structure. While parallel computing efficiency is problem and computer dependent, the efficiency for the Lanczos method was good for a moderate number of processors for the test problem. The greatest reduction in time was realized for the decomposition of the stiffness matrix, a calculation which took 70 percent of the time in the sequential program and which took 25 percent of the time on eight processors. For a sample calculation of the twenty lowest frequencies of a 486 degree of freedom problem, the total sequential computing time was reduced by almost a factor of ten using 16 processors.
Monte Carlo calculations on a parallel computer using MORSE-C.G
International Nuclear Information System (INIS)
Wood, J.
1995-01-01
The general purpose particle transport Monte Carlo code, MORSE-C.G., is implemented on a parallel computing transputer-based system having MIMD architecture. Example problems are solved which are representative of the 3-principal types of problem that can be solved by the original serial code, namely, fixed source, eigenvalue (k-eff) and time-dependent. The results from the parallelized version of the code are compared in tables with the serial code run on a mainframe serial computer, and with an independent, deterministic transport code. The performance of the parallel computer as the number of processors is varied is shown graphically. For the parallel strategy used, the loss of efficiency as the number of processors is increased, is investigated. (author)
International Nuclear Information System (INIS)
Laurent, C.; Chassery, J.M.; Peyrin, F.; Girerd, C.
1996-01-01
This paper deals with the parallel implementations of reconstruction methods in 3D tomography. 3D tomography requires voluminous data and long computation times. Parallel computing, on MIMD computers, seems to be a good approach to manage this problem. In this study, we present the different steps of the parallelization on an abstract parallel computer. Depending on the method, we use two main approaches to parallelize the algorithms: the local approach and the global approach. Experimental results on MIMD computers are presented. Two 3D images reconstructed from realistic data are showed
Dynamic remapping of parallel computations with varying resource demands
Nicol, David M.; Saltz, Joel H.
1988-01-01
The issue of deciding when to invoke a global load remapping mechanism is studied. Such a decision policy must effectively weigh the costs of remapping against the performance benefits, and should be general enough to apply automatically to a wide range of computations. The authors propose a general mapping decision heuristic, then study its effectiveness and its anticipated behavior on two very different models of load evolution. Assuming only that the remapping cost is known, this policy dynamically minimizes system degradation (including the cost of remapping) for each computation step. This policy is quite simple, choosing to remap when the first local minimum in the degradation function is detected. Simulations show that the decision obtained provides significantly better performance than that achieved by never remapping. The authors also observe that the average intermapping frequency is quite close to the optimal fixed remapping frequency.
Solving the Stokes problem on a massively parallel computer
DEFF Research Database (Denmark)
Axelsson, Owe; Barker, Vincent A.; Neytcheva, Maya
2001-01-01
We describe a numerical procedure for solving the stationary two‐dimensional Stokes problem based on piecewise linear finite element approximations for both velocity and pressure, a regularization technique for stability, and a defect‐correction technique for improving accuracy. Eliminating...... boundary value problem for each velocity component, are solved by the conjugate gradient method with a preconditioning based on the algebraic multi‐level iteration (AMLI) technique. The velocity is found from the computed pressure. The method is optimal in the sense that the computational work...... the velocity unknowns from the algebraic system yields a symmetric positive semidefinite system for pressure which is solved by an inner‐outer iteration. The outer iterations consist of the unpreconditioned conjugate gradient method. The inner iterations, each of which corresponds to solving an elliptic...
Parallel computer processing and modeling: applications for the ICU
Baxter, Grant; Pranger, L. Alex; Draghic, Nicole; Sims, Nathaniel M.; Wiesmann, William P.
2003-07-01
Current patient monitoring procedures in hospital intensive care units (ICUs) generate vast quantities of medical data, much of which is considered extemporaneous and not evaluated. Although sophisticated monitors to analyze individual types of patient data are routinely used in the hospital setting, this equipment lacks high order signal analysis tools for detecting long-term trends and correlations between different signals within a patient data set. Without the ability to continuously analyze disjoint sets of patient data, it is difficult to detect slow-forming complications. As a result, the early onset of conditions such as pneumonia or sepsis may not be apparent until the advanced stages. We report here on the development of a distributed software architecture test bed and software medical models to analyze both asynchronous and continuous patient data in real time. Hardware and software has been developed to support a multi-node distributed computer cluster capable of amassing data from multiple patient monitors and projecting near and long-term outcomes based upon the application of physiologic models to the incoming patient data stream. One computer acts as a central coordinating node; additional computers accommodate processing needs. A simple, non-clinical model for sepsis detection was implemented on the system for demonstration purposes. This work shows exceptional promise as a highly effective means to rapidly predict and thereby mitigate the effect of nosocomial infections.
Parallel Genetic Algorithms with Dynamic Topology using Cluster Computing
Directory of Open Access Journals (Sweden)
ADAR, N.
2016-08-01
Full Text Available A parallel genetic algorithm (PGA conducts a distributed meta-heuristic search by employing genetic algorithms on more than one subpopulation simultaneously. PGAs migrate a number of individuals between subpopulations over generations. The layout that facilitates the interactions of the subpopulations is called the topology. Static migration topologies have been widely incorporated into PGAs. In this article, a PGA with a dynamic migration topology (D-PGA is proposed. D-PGA generates a new migration topology in every epoch based on the average fitness values of the subpopulations. The D-PGA has been tested against ring and fully connected migration topologies in a Beowulf Cluster. The D-PGA has outperformed the ring migration topology with comparable communication cost and has provided competitive or better results than a fully connected migration topology with significantly lower communication cost. PGA convergence behaviors have been analyzed in terms of the diversities within and between subpopulations. Conventional diversity can be considered as the diversity within a subpopulation. A new concept of permeability has been introduced to measure the diversity between subpopulations. It is shown that the success of the proposed D-PGA can be attributed to maintaining a high level of permeability while preserving diversity within subpopulations.
Fazanaro, Filipe I.; Soriano, Diogo C.; Suyama, Ricardo; Madrid, Marconi K.; Oliveira, José Raimundo de; Muñoz, Ignacio Bravo; Attux, Romis
2016-08-01
The characterization of nonlinear dynamical systems and their attractors in terms of invariant measures, basins of attractions and the structure of their vector fields usually outlines a task strongly related to the underlying computational cost. In this work, the practical aspects related to the use of parallel computing - specially the use of Graphics Processing Units (GPUS) and of the Compute Unified Device Architecture (CUDA) - are reviewed and discussed in the context of nonlinear dynamical systems characterization. In this work such characterization is performed by obtaining both local and global Lyapunov exponents for the classical forced Duffing oscillator. The local divergence measure was employed by the computation of the Lagrangian Coherent Structures (LCSS), revealing the general organization of the flow according to the obtained separatrices, while the global Lyapunov exponents were used to characterize the attractors obtained under one or more bifurcation parameters. These simulation sets also illustrate the required computation time and speedup gains provided by different parallel computing strategies, justifying the employment and the relevance of GPUS and CUDA in such extensive numerical approach. Finally, more than simply providing an overview supported by a representative set of simulations, this work also aims to be a unified introduction to the use of the mentioned parallel computing tools in the context of nonlinear dynamical systems, providing codes and examples to be executed in MATLAB and using the CUDA environment, something that is usually fragmented in different scientific communities and restricted to specialists on parallel computing strategies.
Implementation of QR up- and downdating on a massively parallel |computer
DEFF Research Database (Denmark)
Bendtsen, Claus; Hansen, Per Christian; Madsen, Kaj
1995-01-01
We describe an implementation of QR up- and downdating on a massively parallel computer (the Connection Machine CM-200) and show that the algorithm maps well onto the computer. In particular, we show how the use of corrected semi-normal equations for downdating can be efficiently implemented. We...... also illustrate the use of our algorithms in a new LP algorithm....
Triggering and data analysis for the VIRGO experiment on the APEmille parallel computer
Beccaria, M.; Cella, G.; Ciampa, A.; Cuoco, E.; Curci, G.; Vicerè, A.
1997-03-01
We give a brief resume of some possible strategies for the real-time data analisys in the framework of the VIRGO experiment. We discuss in particular the utility and the feasibility of their implementation on parallel computers, focusing on the APEmille SIMD machine. We evaluate the computational power required in two cases: the monitoring of known pulsars and the detection of binary coalescences.
Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones
Directory of Open Access Journals (Sweden)
Norman Matloff
2016-07-01
Full Text Available The growth in the use of computationally intensive statistical procedures, especially with big data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPUs, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP algorithms, especially on non-shared memory platforms. This paper develops a broadlyapplicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations.
Locating abnormalities in brain blood vessels using parallel computing architecture.
Adeshina, A M; Hashim, R; Khalid, N E A; Abidin, S Z Z
2012-09-01
CT and MRI scans are widely used in medical diagnosis procedures, but they only produce 2-D images. However, the human anatomical structure, the abnormalities, tumors, tissues and organs are in 3-D. 2-D images from these devices are difficult to interpret because they only show cross-sectional views of the human structure. Consequently, such circumstances require doctors to use their expert experiences in the interpretation of the possible location, size or shape of the abnormalities, even for large datasets of enormous amount of slices. Previously, the concept of reconstructing 2-D images to 3-D was introduced. However, such reconstruction model requires high performance computation, may either be time-consuming or costly. Furthermore, detecting the internal features of human anatomical structure, such as the imaging of the blood vessels, is still an open topic in the computer-aided diagnosis of disorders and pathologies. This paper proposes a volume visualization framework using Compute Unified Device Architecture (CUDA), augmenting the widely proven ray casting technique in terms of superior qualities of images but with slow speed. Considering the rapid development of technology in the medical community, our framework is implemented on Microsoft.NET environment for easy interoperability with other emerging revolutionary tools. The framework was evaluated with brain datasets from the department of Surgery, University of North Carolina, United States, containing around 109 MRA datasets. Uniquely, at a reasonably cheaper cost, our framework achieves immediate reconstruction and obvious mappings of the internal features of human brain, reliable enough for instantaneous locations of possible blockages in the brain blood vessels.
Accelerating Neuroimage Registration through Parallel Computation of Similarity Metric.
Directory of Open Access Journals (Sweden)
Yun-Gang Luo
Full Text Available Neuroimage registration is crucial for brain morphometric analysis and treatment efficacy evaluation. However, existing advanced registration algorithms such as FLIRT and ANTs are not efficient enough for clinical use. In this paper, a GPU implementation of FLIRT with the correlation ratio (CR as the similarity metric and a GPU accelerated correlation coefficient (CC calculation for the symmetric diffeomorphic registration of ANTs have been developed. The comparison with their corresponding original tools shows that our accelerated algorithms can greatly outperform the original algorithm in terms of computational efficiency. This paper demonstrates the great potential of applying these registration tools in clinical applications.
National Research Council Canada - National Science Library
Friman, Henrik
2006-01-01
... (extended from Leavitt, 1965). This text identifies aspects of network-based effectiveness that can benefit from a better understanding of leadership and management development of people, procedures, technology, and organizations...
Weighted Local Active Pixel Pattern (WLAPP for Face Recognition in Parallel Computation Environment
Directory of Open Access Journals (Sweden)
Gundavarapu Mallikarjuna Rao
2013-10-01
Full Text Available Abstract - The availability of multi-core technology resulted totally new computational era. Researchers are keen to explore available potential in state of art-machines for breaking the bearer imposed by serial computation. Face Recognition is one of the challenging applications on so ever computational environment. The main difficulty of traditional Face Recognition algorithms is lack of the scalability. In this paper Weighted Local Active Pixel Pattern (WLAPP, a new scalable Face Recognition Algorithm suitable for parallel environment is proposed. Local Active Pixel Pattern (LAPP is found to be simple and computational inexpensive compare to Local Binary Patterns (LBP. WLAPP is developed based on concept of LAPP. The experimentation is performed on FG-Net Aging Database with deliberately introduced 20% distortion and the results are encouraging. Keywords — Active pixels, Face Recognition, Local Binary Pattern (LBP, Local Active Pixel Pattern (LAPP, Pattern computing, parallel workers, template, weight computation.
Parallel computation with molecular-motor-propelled agents in nanofabricated networks.
Nicolau, Dan V; Lard, Mercy; Korten, Till; van Delft, Falco C M J M; Persson, Malin; Bengtsson, Elina; Månsson, Alf; Diez, Stefan; Linke, Heiner; Nicolau, Dan V
2016-03-08
The combinatorial nature of many important mathematical problems, including nondeterministic-polynomial-time (NP)-complete problems, places a severe limitation on the problem size that can be solved with conventional, sequentially operating electronic computers. There have been significant efforts in conceiving parallel-computation approaches in the past, for example: DNA computation, quantum computation, and microfluidics-based computation. However, these approaches have not proven, so far, to be scalable and practical from a fabrication and operational perspective. Here, we report the foundations of an alternative parallel-computation system in which a given combinatorial problem is encoded into a graphical, modular network that is embedded in a nanofabricated planar device. Exploring the network in a parallel fashion using a large number of independent, molecular-motor-propelled agents then solves the mathematical problem. This approach uses orders of magnitude less energy than conventional computers, thus addressing issues related to power consumption and heat dissipation. We provide a proof-of-concept demonstration of such a device by solving, in a parallel fashion, the small instance {2, 5, 9} of the subset sum problem, which is a benchmark NP-complete problem. Finally, we discuss the technical advances necessary to make our system scalable with presently available technology.
Teaching Scientific Computing: A Model-Centered Approach to Pipeline and Parallel Programming with C
Directory of Open Access Journals (Sweden)
Vladimiras Dolgopolovas
2015-01-01
Full Text Available The aim of this study is to present an approach to the introduction into pipeline and parallel computing, using a model of the multiphase queueing system. Pipeline computing, including software pipelines, is among the key concepts in modern computing and electronics engineering. The modern computer science and engineering education requires a comprehensive curriculum, so the introduction to pipeline and parallel computing is the essential topic to be included in the curriculum. At the same time, the topic is among the most motivating tasks due to the comprehensive multidisciplinary and technical requirements. To enhance the educational process, the paper proposes a novel model-centered framework and develops the relevant learning objects. It allows implementing an educational platform of constructivist learning process, thus enabling learners’ experimentation with the provided programming models, obtaining learners’ competences of the modern scientific research and computational thinking, and capturing the relevant technical knowledge. It also provides an integral platform that allows a simultaneous and comparative introduction to pipelining and parallel computing. The programming language C for developing programming models and message passing interface (MPI and OpenMP parallelization tools have been chosen for implementation.
10th International Workshop on Parallel Tools for High Performance Computing
Gracia, José; Hilbrich, Tobias; Knüpfer, Andreas; Resch, Michael; Nagel, Wolfgang
2017-01-01
This book presents the proceedings of the 10th International Parallel Tools Workshop, held October 4-5, 2016 in Stuttgart, Germany – a forum to discuss the latest advances in parallel tools. High-performance computing plays an increasingly important role for numerical simulation and modelling in academic and industrial research. At the same time, using large-scale parallel systems efficiently is becoming more difficult. A number of tools addressing parallel program development and analysis have emerged from the high-performance computing community over the last decade, and what may have started as collection of small helper script has now matured to production-grade frameworks. Powerful user interfaces and an extensive body of documentation allow easy usage by non-specialists.
Directory of Open Access Journals (Sweden)
Chao Dong
2012-01-01
Full Text Available Benefiting from the kernel skill and the sparse property, the relevance vector machine (RVM could acquire a sparse solution, with an equivalent generalization ability compared with the support vector machine. The sparse property requires much less time in the prediction, making RVM potential in classifying the large-scale hyperspectral image. However, RVM is not widespread influenced by its slow training procedure. To solve the problem, the classification of the hyperspectral image using RVM is accelerated by the parallel computing technique in this paper. The parallelization is revealed from the aspects of the multiclass strategy, the ensemble of multiple weak classifiers, and the matrix operations. The parallel RVMs are implemented using the C language plus the parallel functions of the linear algebra packages and the message passing interface library. The proposed methods are evaluated by the AVIRIS Indian Pines data set on the Beowulf cluster and the multicore platforms. It shows that the parallel RVMs accelerate the training procedure obviously.
Programming a massively parallel, computation universal system: static behavior
Energy Technology Data Exchange (ETDEWEB)
Lapedes, A.; Farber, R.
1986-01-01
In previous work by the authors, the ''optimum finding'' properties of Hopfield neural nets were applied to the nets themselves to create a ''neural compiler.'' This was done in such a way that the problem of programming the attractors of one neural net (called the Slave net) was expressed as an optimization problem that was in turn solved by a second neural net (the Master net). In this series of papers that approach is extended to programming nets that contain interneurons (sometimes called ''hidden neurons''), and thus deals with nets capable of universal computation. 22 refs.
Distributed and cloud computing from parallel processing to the Internet of Things
Hwang, Kai; Fox, Geoffrey C
2012-01-01
Distributed and Cloud Computing, named a 2012 Outstanding Academic Title by the American Library Association's Choice publication, explains how to create high-performance, scalable, reliable systems, exposing the design principles, architecture, and innovative applications of parallel, distributed, and cloud computing systems. Starting with an overview of modern distributed models, the book provides comprehensive coverage of distributed and cloud computing, including: Facilitating management, debugging, migration, and disaster recovery through virtualization Clustered systems for resear
Massively parallel computation of PARASOL code on the Origin 3800 system
Energy Technology Data Exchange (ETDEWEB)
Hosokawa, Masanari [Research Organization for Information Science and Technology, Tokai, Ibaraki (Japan); Takizuka, Tomonori [Japan Atomic Energy Research Inst., Naka, Ibaraki (Japan). Naka Fusion Research Establishment
2001-10-01
The divertor particle simulation code named PARASOL simulates open-field plasmas between divertor walls self-consistently by using an electrostatic PIC method and a binary collision Monte Carlo model. The PARASOL parallelized with MPI-1.1 for scalar parallel computer worked on Intel Paragon XP/S system. A system SGI Origin 3800 was newly installed (May, 2001). The parallel programming was improved at this swithover. As a result of the high-performance new hardware and this improvement, the PARASOL is speeded up by about 60 times with the same number of processors. (author)
The Design and Evaluation of "CAPTools"--A Computer Aided Parallelization Toolkit
Yan, Jerry; Frumkin, Michael; Hribar, Michelle; Jin, Haoqiang; Waheed, Abdul; Johnson, Steve; Cross, Jark; Evans, Emyr; Ierotheou, Constantinos; Leggett, Pete;
1998-01-01
Writing applications for high performance computers is a challenging task. Although writing code by hand still offers the best performance, it is extremely costly and often not very portable. The Computer Aided Parallelization Tools (CAPTools) are a toolkit designed to help automate the mapping of sequential FORTRAN scientific applications onto multiprocessors. CAPTools consists of the following major components: an inter-procedural dependence analysis module that incorporates user knowledge; a 'self-propagating' data partitioning module driven via user guidance; an execution control mask generation and optimization module for the user to fine tune parallel processing of individual partitions; a program transformation/restructuring facility for source code clean up and optimization; a set of browsers through which the user interacts with CAPTools at each stage of the parallelization process; and a code generator supporting multiple programming paradigms on various multiprocessors. Besides describing the rationale behind the architecture of CAPTools, the parallelization process is illustrated via case studies involving structured and unstructured meshes. The programming process and the performance of the generated parallel programs are compared against other programming alternatives based on the NAS Parallel Benchmarks, ARC3D and other scientific applications. Based on these results, a discussion on the feasibility of constructing architectural independent parallel applications is presented.
Modeling the Fracture of Ice Sheets on Parallel Computers
Energy Technology Data Exchange (ETDEWEB)
Waisman, Haim [Columbia Univ., New York, NY (United States); Tuminaro, Ray [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2013-10-10
The objective of this project was to investigate the complex fracture of ice and understand its role within larger ice sheet simulations and global climate change. This objective was achieved by developing novel physics based models for ice, novel numerical tools to enable the modeling of the physics and by collaboration with the ice community experts. At the present time, ice fracture is not explicitly considered within ice sheet models due in part to large computational costs associated with the accurate modeling of this complex phenomena. However, fracture not only plays an extremely important role in regional behavior but also influences ice dynamics over much larger zones in ways that are currently not well understood. To this end, our research findings through this project offers significant advancement to the field and closes a large gap of knowledge in understanding and modeling the fracture of ice sheets in the polar regions. Thus, we believe that our objective has been achieved and our research accomplishments are significant. This is corroborated through a set of published papers, posters and presentations at technical conferences in the field. In particular significant progress has been made in the mechanics of ice, fracture of ice sheets and ice shelves in polar regions and sophisticated numerical methods that enable the solution of the physics in an efficient way.
Parallel Fully-Implicit Computation of Magnetohydrodynamics Acceleration Experiments
Wan, Tian; Candler, Graham
2010-05-01
A three-dimensional MHD solver is described in the paper. The solver simulates reacting flows with nonequilibrium between translational-rotational, vibrational and electron translational modes. The conservation equations are discretized with implicit time marching and the second-order modified Steger-Warming scheme, and the resulted linear system is solved iteratively with Newton-Krylov-Schwarz method that is implemented by PETSc package. The results of convergence tests are plotted, which show good scalability and convergence around twice faster when compared with the DPLR method. Then five test runs are conducted simulating the experiments done at the NASA Ames MHD channel, and the calculated pressures, temperatures, electrical conductivity, back EMF, load factors and flow accelerations are shown to agree with the experimental data. Our computation shows that the electrical conductivity distribution is not uniform in the powered section of the MHD channel, and that it is important to include Joule heating in order to calculate the correct conductivity and the MHD acceleration.
SPATIOTEMPORAL DOMAIN DECOMPOSITION FOR MASSIVE PARALLEL COMPUTATION OF SPACE-TIME KERNEL DENSITY
Directory of Open Access Journals (Sweden)
A. Hohl
2015-07-01
Full Text Available Accelerated processing capabilities are deemed critical when conducting analysis on spatiotemporal datasets of increasing size, diversity and availability. High-performance parallel computing offers the capacity to solve computationally demanding problems in a limited timeframe, but likewise poses the challenge of preventing processing inefficiency due to workload imbalance between computing resources. Therefore, when designing new algorithms capable of implementing parallel strategies, careful spatiotemporal domain decomposition is necessary to account for heterogeneity in the data. In this study, we perform octtree-based adaptive decomposition of the spatiotemporal domain for parallel computation of space-time kernel density. In order to avoid edge effects near subdomain boundaries, we establish spatiotemporal buffers to include adjacent data-points that are within the spatial and temporal kernel bandwidths. Then, we quantify computational intensity of each subdomain to balance workloads among processors. We illustrate the benefits of our methodology using a space-time epidemiological dataset of Dengue fever, an infectious vector-borne disease that poses a severe threat to communities in tropical climates. Our parallel implementation of kernel density reaches substantial speedup compared to sequential processing, and achieves high levels of workload balance among processors due to great accuracy in quantifying computational intensity. Our approach is portable of other space-time analytical tests.
Spatiotemporal Domain Decomposition for Massive Parallel Computation of Space-Time Kernel Density
Hohl, A.; Delmelle, E. M.; Tang, W.
2015-07-01
Accelerated processing capabilities are deemed critical when conducting analysis on spatiotemporal datasets of increasing size, diversity and availability. High-performance parallel computing offers the capacity to solve computationally demanding problems in a limited timeframe, but likewise poses the challenge of preventing processing inefficiency due to workload imbalance between computing resources. Therefore, when designing new algorithms capable of implementing parallel strategies, careful spatiotemporal domain decomposition is necessary to account for heterogeneity in the data. In this study, we perform octtree-based adaptive decomposition of the spatiotemporal domain for parallel computation of space-time kernel density. In order to avoid edge effects near subdomain boundaries, we establish spatiotemporal buffers to include adjacent data-points that are within the spatial and temporal kernel bandwidths. Then, we quantify computational intensity of each subdomain to balance workloads among processors. We illustrate the benefits of our methodology using a space-time epidemiological dataset of Dengue fever, an infectious vector-borne disease that poses a severe threat to communities in tropical climates. Our parallel implementation of kernel density reaches substantial speedup compared to sequential processing, and achieves high levels of workload balance among processors due to great accuracy in quantifying computational intensity. Our approach is portable of other space-time analytical tests.
Architecture-Adaptive Computing Environment: A Tool for Teaching Parallel Programming
Dorband, John E.; Aburdene, Maurice F.
2002-01-01
Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architecture-adaptive Computing Environment) is to encourage programmers to implement applications on parallel architectures by providing them the assurance that future architectures will be able to run their applications with a minimum of modification. A secondary purpose is to encourage computer architects to develop new types of architectures by providing an easily implemented software development environment and a library of test applications. This new language should be an ideal tool to teach parallel programming. In this paper, we will focus on some fundamental features of aCe C.
8th International Workshop on Parallel Tools for High Performance Computing
Gracia, José; Knüpfer, Andreas; Resch, Michael; Nagel, Wolfgang
2015-01-01
Numerical simulation and modelling using High Performance Computing has evolved into an established technique in academic and industrial research. At the same time, the High Performance Computing infrastructure is becoming ever more complex. For instance, most of the current top systems around the world use thousands of nodes in which classical CPUs are combined with accelerator cards in order to enhance their compute power and energy efficiency. This complexity can only be mastered with adequate development and optimization tools. Key topics addressed by these tools include parallelization on heterogeneous systems, performance optimization for CPUs and accelerators, debugging of increasingly complex scientific applications, and optimization of energy usage in the spirit of green IT. This book represents the proceedings of the 8th International Parallel Tools Workshop, held October 1-2, 2014 in Stuttgart, Germany – which is a forum to discuss the latest advancements in the parallel tools.
Fluid/Structure Interaction Studies of Aircraft Using High Fidelity Equations on Parallel Computers
Guruswamy, Guru; VanDalsem, William (Technical Monitor)
1994-01-01
Abstract Aeroelasticity which involves strong coupling of fluids, structures and controls is an important element in designing an aircraft. Computational aeroelasticity using low fidelity methods such as the linear aerodynamic flow equations coupled with the modal structural equations are well advanced. Though these low fidelity approaches are computationally less intensive, they are not adequate for the analysis of modern aircraft such as High Speed Civil Transport (HSCT) and Advanced Subsonic Transport (AST) which can experience complex flow/structure interactions. HSCT can experience vortex induced aeroelastic oscillations whereas AST can experience transonic buffet associated structural oscillations. Both aircraft may experience a dip in the flutter speed at the transonic regime. For accurate aeroelastic computations at these complex fluid/structure interaction situations, high fidelity equations such as the Navier-Stokes for fluids and the finite-elements for structures are needed. Computations using these high fidelity equations require large computational resources both in memory and speed. Current conventional super computers have reached their limitations both in memory and speed. As a result, parallel computers have evolved to overcome the limitations of conventional computers. This paper will address the transition that is taking place in computational aeroelasticity from conventional computers to parallel computers. The paper will address special techniques needed to take advantage of the architecture of new parallel computers. Results will be illustrated from computations made on iPSC/860 and IBM SP2 computer by using ENSAERO code that directly couples the Euler/Navier-Stokes flow equations with high resolution finite-element structural equations.
Yu, Leiming; Nina-Paravecino, Fanny; Kaeli, David; Fang, Qianqian
2018-01-01
We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strategies are developed to obtain efficient simulations using multiple central processing units and GPUs. (2018) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).
Heterogeneous Hardware Parallelism Review of the IN2P3 2016 Computing School
Lafage, Vincent
2017-11-01
Parallel and hybrid Monte Carlo computation. The Monte Carlo method is the main workhorse for computation of particle physics observables. This paper provides an overview of various HPC technologies that can be used today: multicore (OpenMP, HPX), manycore (OpenCL). The rewrite of a twenty years old Fortran 77 Monte Carlo will illustrate the various programming paradigms in use beyond language implementation. The problem of parallel random number generator will be addressed. We will give a short report of the one week school dedicated to these recent approaches, that took place in École Polytechnique in May 2016.
Herrera, I.; Herrera, G. S.
2015-12-01
Most geophysical systems are macroscopic physical systems. The behavior prediction of such systems is carried out by means of computational models whose basic models are partial differential equations (PDEs) [1]. Due to the enormous size of the discretized version of such PDEs it is necessary to apply highly parallelized super-computers. For them, at present, the most efficient software is based on non-overlapping domain decomposition methods (DDM). However, a limiting feature of the present state-of-the-art techniques is due to the kind of discretizations used in them. Recently, I. Herrera and co-workers using 'non-overlapping discretizations' have produced the DVS-Software which overcomes this limitation [2]. The DVS-software can be applied to a great variety of geophysical problems and achieves very high parallel efficiencies (90%, or so [3]). It is therefore very suitable for effectively applying the most advanced parallel supercomputers available at present. In a parallel talk, in this AGU Fall Meeting, Graciela Herrera Z. will present how this software is being applied to advance MOD-FLOW. Key Words: Parallel Software for Geophysics, High Performance Computing, HPC, Parallel Computing, Domain Decomposition Methods (DDM)REFERENCES [1]. Herrera Ismael and George F. Pinder, Mathematical Modelling in Science and Engineering: An axiomatic approach", John Wiley, 243p., 2012. [2]. Herrera, I., de la Cruz L.M. and Rosas-Medina A. "Non Overlapping Discretization Methods for Partial, Differential Equations". NUMER METH PART D E, 30: 1427-1454, 2014, DOI 10.1002/num 21852. (Open source) [3]. Herrera, I., & Contreras Iván "An Innovative Tool for Effectively Applying Highly Parallelized Software To Problems of Elasticity". Geofísica Internacional, 2015 (In press)
Application of parallel computing to seismic damage process simulation of an arch dam
International Nuclear Information System (INIS)
Zhong Hong; Lin Gao; Li Jianbo
2010-01-01
The simulation of damage process of high arch dam subjected to strong earthquake shocks is significant to the evaluation of its performance and seismic safety, considering the catastrophic effect of dam failure. However, such numerical simulation requires rigorous computational capacity. Conventional serial computing falls short of that and parallel computing is a fairly promising solution to this problem. The parallel finite element code PDPAD was developed for the damage prediction of arch dams utilizing the damage model with inheterogeneity of concrete considered. Developed with programming language Fortran, the code uses a master/slave mode for programming, domain decomposition method for allocation of tasks, MPI (Message Passing Interface) for communication and solvers from AZTEC library for solution of large-scale equations. Speedup test showed that the performance of PDPAD was quite satisfactory. The code was employed to study the damage process of a being-built arch dam on a 4-node PC Cluster, with more than one million degrees of freedom considered. The obtained damage mode was quite similar to that of shaking table test, indicating that the proposed procedure and parallel code PDPAD has a good potential in simulating seismic damage mode of arch dams. With the rapidly growing need for massive computation emerged from engineering problems, parallel computing will find more and more applications in pertinent areas.
Evaluation of a simplified version of KENO V.a on a parallel processors computer
International Nuclear Information System (INIS)
Ugolini, D.; Petrie, L.M.; Dodds, H.L. Jr.
1987-01-01
KENO V.a is a widely used Monte Carlo criticality code developed by Oak Ridge National Laboratory for use primarily on large single processor mainframe computers. The code can be very costly to use if a large number of histories is required because the histories are performed sequentially via the single processor. With the advent of parallel processor computers, it should be possible to reduce computing costs (i.e., computer run time) by performing the histories in parallel. The purposes of this work is to implement KENO V.a on a parallel processor computer, specifically the NCUBE and then to compare results obtained on the NCUBE (i.e., accuracy and computing time) with results obtained on a large mainframe computer (IBM 3033). The NCUBE is a message-passing machine with no shared memory. A simplified version of KENO V.a was developed for this study because the standard version was too large to compile on the NCUBE. In addition, a special 1-group cross-section library, reduced from the standard 16-group Hansen Roach Library, was also used. The sample problem used in this study was an 18-cm-diam sphere of 235 U at 0.05 atom/b x cm
International Nuclear Information System (INIS)
Heo, Jaeseok; Kim, Kyung Doo
2015-01-01
Highlights: • We developed an interface between an engineering simulation code and statistical analysis software. • Multiple packages of the sensitivity analysis, uncertainty quantification, and parameter estimation algorithms are implemented in the framework. • Parallel computing algorithms are also implemented in the framework to solve multiple computational problems simultaneously. - Abstract: This paper introduces a statistical data analysis toolkit, PAPIRUS, designed to perform the model calibration, uncertainty propagation, Chi-square linearity test, and sensitivity analysis for both linear and nonlinear problems. The PAPIRUS was developed by implementing multiple packages of methodologies, and building an interface between an engineering simulation code and the statistical analysis algorithms. A parallel computing framework is implemented in the PAPIRUS with multiple computing resources and proper communications between the server and the clients of each processor. It was shown that even though a large amount of data is considered for the engineering calculation, the distributions of the model parameters and the calculation results can be quantified accurately with significant reductions in computational effort. A general description about the PAPIRUS with a graphical user interface is presented in Section 2. Sections 2.1–2.5 present the methodologies of data assimilation, uncertainty propagation, Chi-square linearity test, and sensitivity analysis implemented in the toolkit with some results obtained by each module of the software. Parallel computing algorithms adopted in the framework to solve multiple computational problems simultaneously are also summarized in the paper
With enhanced data availability, distributed watershed models for large areas with high spatial and temporal resolution are increasingly used to understand water budgets and examine effects of human activities and climate change/variability on water resources. Developing parallel computing software...
Speed up of MCACE, a Monte Carlo code for evaluation of shielding safety, by parallel computer, (3)
International Nuclear Information System (INIS)
Takano, Makoto; Masukawa, Fumihiro; Naito, Yoshitaka; Onodera, Emi; Imawaka, Tsuneyuki; Yoda, Yoshihisa.
1993-07-01
The parallel computing of the MCACE code has been studied on two platforms; 1) Shared Memory Type Vector-Parallel Computer Monte-4 and 2) Networked Several Workstations. On the Monte-4, a disk-file has been allocated to collect all results computed by 4 CPUs in parallel, executing the copy of the MCACE code on each CPU. On the workstations under network environment, two parallel models have been evaluated; 1) a host-node model and 2) the model used on the Monte-4 where no software for parallelization has been employed but only standard FORTRAN language. The measurement of computing times has showed that speed up of about 3 times has been achieved by using 4 CPUs of the Monte-4. Further, connecting 4 workstations by network, the computing speed by parallelization has achieved faster than our scalar main frame computer, FACOM M-780. (author)
Fast parallel algorithms that compute transitive closure of a fuzzy relation
Kreinovich, Vladik YA.
1993-01-01
The notion of a transitive closure of a fuzzy relation is very useful for clustering in pattern recognition, for fuzzy databases, etc. The original algorithm proposed by L. Zadeh (1971) requires the computation time O(n(sup 4)), where n is the number of elements in the relation. In 1974, J. C. Dunn proposed a O(n(sup 2)) algorithm. Since we must compute n(n-1)/2 different values s(a, b) (a not equal to b) that represent the fuzzy relation, and we need at least one computational step to compute each of these values, we cannot compute all of them in less than O(n(sup 2)) steps. So, Dunn's algorithm is in this sense optimal. For small n, it is ok. However, for big n (e.g., for big databases), it is still a lot, so it would be desirable to decrease the computation time (this problem was formulated by J. Bezdek). Since this decrease cannot be done on a sequential computer, the only way to do it is to use a computer with several processors working in parallel. We show that on a parallel computer, transitive closure can be computed in time O((log(sub 2)(n))2).
Parallel computing in cluster of GPU applied to a problem of nuclear engineering
International Nuclear Information System (INIS)
Moraes, Sergio Ricardo S.; Heimlich, Adino; Resende, Pedro
2013-01-01
Cluster computing has been widely used as a low cost alternative for parallel processing in scientific applications. With the use of Message-Passing Interface (MPI) protocol development became even more accessible and widespread in the scientific community. A more recent trend is the use of Graphic Processing Unit (GPU), which is a powerful co-processor able to perform hundreds of instructions in parallel, reaching a capacity of hundreds of times the processing of a CPU. However, a standard PC does not allow, in general, more than two GPUs. Hence, it is proposed in this work development and evaluation of a hybrid low cost parallel approach to the solution to a nuclear engineering typical problem. The idea is to use clusters parallelism technology (MPI) together with GPU programming techniques (CUDA - Compute Unified Device Architecture) to simulate neutron transport through a slab using Monte Carlo method. By using a cluster comprised by four quad-core computers with 2 GPU each, it has been developed programs using MPI and CUDA technologies. Experiments, applying different configurations, from 1 to 8 GPUs has been performed and results were compared with the sequential (non-parallel) version. A speed up of about 2.000 times has been observed when comparing the 8-GPU with the sequential version. Results here presented are discussed and analyzed with the objective of outlining gains and possible limitations of the proposed approach. (author)
Directory of Open Access Journals (Sweden)
Xing Cai
2005-01-01
Full Text Available This article addresses the performance of scientific applications that use the Python programming language. First, we investigate several techniques for improving the computational efficiency of serial Python codes. Then, we discuss the basic programming techniques in Python for parallelizing serial scientific applications. It is shown that an efficient implementation of the array-related operations is essential for achieving good parallel performance, as for the serial case. Once the array-related operations are efficiently implemented, probably using a mixed-language implementation, good serial and parallel performance become achievable. This is confirmed by a set of numerical experiments. Python is also shown to be well suited for writing high-level parallel programs.
Effecting a broadcast with an allreduce operation on a parallel computer
Almasi, Gheorghe; Archer, Charles J.; Ratterman, Joseph D.; Smith, Brian E.
2010-11-02
A parallel computer comprises a plurality of compute nodes organized into at least one operational group for collective parallel operations. Each compute node is assigned a unique rank and is coupled for data communications through a global combining network. One compute node is assigned to be a logical root. A send buffer and a receive buffer is configured. Each element of a contribution of the logical root in the send buffer is contributed. One or more zeros corresponding to a size of the element are injected. An allreduce operation with a bitwise OR using the element and the injected zeros is performed. And the result for the allreduce operation is determined and stored in each receive buffer.
Design And Implementation Of A Parallel Computer For Expert System Applications
Butler, Philip L.; Allen, John D.; Bouldin, Donald W.
1988-03-01
A parallel computer for high-speed execution of expert system programs has been designed and implemented at the Oak Ridge National Laboratory. Programs written in the popular OPS5 language for serial machines need not be modified by the programmer, since the compiler on this special-purpose machine automatically employs the parallelism inherent in the language. Tasks are automatically distributed to parallel rule processors which can evaluate OPS5 rules in parallel. Performance improvements of a factor of 10 over serial machines have already been demonstrated. Enhancements are under way to attain a performance improvement of 100 or more over serial machines for artificial intelligence applications requiring the evaluation of thousands of rules each recognize-act cycle. The initial hardware implementation of the parallel architecture consists of a host computer that broadcasts to 64 parallel rule processors over a transmit-only bus. The communication time is kept to a minimum by using direct-memory access and a memory-mapped addressing scheme that permit each of the parallel rule processors to receive the appropriate information simultaneously. A wired-OR completion flag signals the host whenever all of the parallel rule processors have finished their recognition tasks. The host then extracts information from those rule processors whose rules have been satisfied and, based on a global criterion, selects one of these rules. The host then carries out the actions dictated by this rule and broadcasts new information to the rule processors to begin another recognize-act cycle. Statistics detailing the activities of the host and all of the rule processors are collected and displayed in real time. Thus, the performance of the various aspects of the architecture can be readily analyzed. Also, the execution of the expert system program itself can be studied to detect situations that may be altered to permit additional speedup.
Parallelization of the preconditioned IDR solver for modern multicore computer systems
Bessonov, O. A.; Fedoseyev, A. I.
2012-10-01
This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).
A review of parallel computing for large-scale remote sensing image mosaicking
Chen, Lajiao; Ma, Yan; Liu, Peng; Wei, Jingbo; Jie, Wei; He, Jijun
2015-01-01
Interest in image mosaicking has been spurred by a wide variety of research and management needs. However, for large-scale applications, remote sensing image mosaicking usually requires significant computational capabilities. Several studies have attempted to apply parallel computing to improve image mosaicking algorithms and to speed up calculation process. The state of the art of this field has not yet been summarized, which is, however, essential for a better understanding and for further ...
Energy Technology Data Exchange (ETDEWEB)
Koniges, A.
1996-02-09
This project is a package of 11 individual CRADA`s plus hardware. This innovative project established a three-year multi-party collaboration that is significantly accelerating the availability of commercial massively parallel processing computing software technology to U.S. government, academic, and industrial end-users. This report contains individual presentations from nine principal investigators along with overall program information.
Use of massively parallel computing to improve modelling accuracy within the nuclear sector
Directory of Open Access Journals (Sweden)
L M Evans
2016-06-01
This work presents recent advancements in three techniques: Uncertainty quantification (UQ; Cellular automata finite element (CAFE; Image based finite element methods (IBFEM. Case studies are presented demonstrating their suitability for use in nuclear engineering made possible by advancements in parallel computing hardware that is projected to be available for industry within the next decade costing of the order of $100k.
Chen, Hsinchun; Martinez, Joanne; Kirchhoff, Amy; Ng, Tobun D.; Schatz, Bruce R.
1998-01-01
Grounded on object filtering, automatic indexing, and co-occurrence analysis, an experiment was performed using a parallel supercomputer to analyze over 400,000 abstracts in an INSPEC computer engineering collection. A user evaluation revealed that system-generated thesauri were better than the human-generated INSPEC subject thesaurus in concept…
CaKernel – A Parallel Application Programming Framework for Heterogenous Computing Architectures
Directory of Open Access Journals (Sweden)
Marek Blazewicz
2011-01-01
Full Text Available With the recent advent of new heterogeneous computing architectures there is still a lack of parallel problem solving environments that can help scientists to use easily and efficiently hybrid supercomputers. Many scientific simulations that use structured grids to solve partial differential equations in fact rely on stencil computations. Stencil computations have become crucial in solving many challenging problems in various domains, e.g., engineering or physics. Although many parallel stencil computing approaches have been proposed, in most cases they solve only particular problems. As a result, scientists are struggling when it comes to the subject of implementing a new stencil-based simulation, especially on high performance hybrid supercomputers. In response to the presented need we extend our previous work on a parallel programming framework for CUDA – CaCUDA that now supports OpenCL. We present CaKernel – a tool that simplifies the development of parallel scientific applications on hybrid systems. CaKernel is built on the highly scalable and portable Cactus framework. In the CaKernel framework, Cactus manages the inter-process communication via MPI while CaKernel manages the code running on Graphics Processing Units (GPUs and interactions between them. As a non-trivial test case we have developed a 3D CFD code to demonstrate the performance and scalability of the automatically generated code.
International Nuclear Information System (INIS)
Chen Jian-Lin; Li Lei; Wang Lin-Yuan; Cai Ai-Long; Xi Xiao-Qi; Zhang Han-Ming; Li Jian-Xin; Yan Bin
2015-01-01
The projection matrix model is used to describe the physical relationship between reconstructed object and projection. Such a model has a strong influence on projection and backprojection, two vital operations in iterative computed tomographic reconstruction. The distance-driven model (DDM) is a state-of-the-art technology that simulates forward and back projections. This model has a low computational complexity and a relatively high spatial resolution; however, it includes only a few methods in a parallel operation with a matched model scheme. This study introduces a fast and parallelizable algorithm to improve the traditional DDM for computing the parallel projection and backprojection operations. Our proposed model has been implemented on a GPU (graphic processing unit) platform and has achieved satisfactory computational efficiency with no approximation. The runtime for the projection and backprojection operations with our model is approximately 4.5 s and 10.5 s per loop, respectively, with an image size of 256×256×256 and 360 projections with a size of 512×512. We compare several general algorithms that have been proposed for maximizing GPU efficiency by using the unmatched projection/backprojection models in a parallel computation. The imaging resolution is not sacrificed and remains accurate during computed tomographic reconstruction. (paper)
International Nuclear Information System (INIS)
Woodruff, S.B.
1992-01-01
The Transient Reactor Analysis Code (TRAC), which features a two- fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, poor load balancing will degrade efficiency on either vector or data parallel architectures when the data are organized according to spatial location. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. This document discusses why developers algorithms, such as a neural net representation, that do not exhibit algorithms, such as a neural net representation, that do not exhibit load-balancing problems
Energy Technology Data Exchange (ETDEWEB)
Barrett, Richard Frederick; Heroux, Michael Allen; Vaughan, Courtenay Thomas
2012-04-01
A broad range of scientific computation involves the use of difference stencils. In a parallel computing environment, this computation is typically implemented by decomposing the spacial domain, inducing a 'halo exchange' of process-owned boundary data. This approach adheres to the Bulk Synchronous Parallel (BSP) model. Because commonly available architectures provide strong inter-node bandwidth relative to latency costs, many codes 'bulk up' these messages by aggregating data into a message as a means of reducing the number of messages. A renewed focus on non-traditional architectures and architecture features provides new opportunities for exploring alternatives to this programming approach. In this report we describe miniGhost, a 'miniapp' designed for exploration of the capabilities of current as well as emerging and future architectures within the context of these sorts of applications. MiniGhost joins the suite of miniapps developed as part of the Mantevo project.
International Nuclear Information System (INIS)
Pang, Kar Mun; Ng, Hoon Kiat; Gan, Suyin
2012-01-01
Highlights: ► A performance benchmarking exercise is conducted for diesel combustion simulations. ► The reduced chemical mechanism shows its advantages over base and skeletal models. ► High efficiency and great reduction of CPU runtime are achieved through 4-node solver. ► Increasing ISAT memory from 0.1 to 2 GB reduces the CPU runtime by almost 35%. ► Combustion and soot processes are predicted well with minimal computational cost. - Abstract: In the present study, in-cylinder diesel combustion simulation was performed with parallel processing on an Intel Xeon Quad-Core platform to allow both fluid dynamics and chemical kinetics of the surrogate diesel fuel model to be solved simultaneously on multiple processors. Here, Cartesian Z-Coordinate was selected as the most appropriate partitioning algorithm since it computationally bisects the domain such that the dynamic load associated with fuel particle tracking was evenly distributed during parallel computations. Other variables examined included number of compute nodes, chemistry sizes and in situ adaptive tabulation (ISAT) parameters. Based on the performance benchmarking test conducted, parallel configuration of 4-compute node was found to reduce the computational runtime most efficiently whereby a parallel efficiency of up to 75.4% was achieved. The simulation results also indicated that accuracy level was insensitive to the number of partitions or the partitioning algorithms. The effect of reducing the number of species on computational runtime was observed to be more significant than reducing the number of reactions. Besides, the study showed that an increase in the ISAT maximum storage of up to 2 GB reduced the computational runtime by 50%. Also, the ISAT error tolerance of 10 −3 was chosen to strike a balance between results accuracy and computational runtime. The optimised parameters in parallel processing and ISAT, as well as the use of the in-house reduced chemistry model allowed accurate
Just-in-time compilation-inspired methodology for parallelization of compute intensive java code
International Nuclear Information System (INIS)
Mustafa, G.; Ghani, M.U.
2017-01-01
Compute intensive programs generally consume significant fraction of execution time in a small amount of repetitive code. Such repetitive code is commonly known as hotspot code. We observed that compute intensive hotspots often possess exploitable loop level parallelism. A JIT (Just-in-Time) compiler profiles a running program to identify its hotspots. Hotspots are then translated into native code, for efficient execution. Using similar approach, we propose a methodology to identify hotspots and exploit their parallelization potential on multicore systems. Proposed methodology selects and parallelizes each DOALL loop that is either contained in a hotspot method or calls a hotspot method. The methodology could be integrated in front-end of a JIT compiler to parallelize sequential code, just before native translation. However, compilation to native code is out of scope of this work. As a case study, we analyze eighteen JGF (Java Grande Forum) benchmarks to determine parallelization potential of hotspots. Eight benchmarks demonstrate a speedup of up to 7.6x on an 8-core system. (author)
Just-in-Time Compilation-Inspired Methodology for Parallelization of Compute Intensive Java Code
Directory of Open Access Journals (Sweden)
GHULAM MUSTAFA
2017-01-01
Full Text Available Compute intensive programs generally consume significant fraction of execution time in a small amount of repetitive code. Such repetitive code is commonly known as hotspot code. We observed that compute intensive hotspots often possess exploitable loop level parallelism. A JIT (Just-in-Time compiler profiles a running program to identify its hotspots. Hotspots are then translated into native code, for efficient execution. Using similar approach, we propose a methodology to identify hotspots and exploit their parallelization potential on multicore systems. Proposed methodology selects and parallelizes each DOALL loop that is either contained in a hotspot method or calls a hotspot method. The methodology could be integrated in front-end of a JIT compiler to parallelize sequential code, just before native translation. However, compilation to native code is out of scope of this work. As a case study, we analyze eighteen JGF (Java Grande Forum benchmarks to determine parallelization potential of hotspots. Eight benchmarks demonstrate a speedup of up to 7.6x on an 8-core system
A parallel simulated annealing algorithm for standard cell placement on a hypercube computer
Jones, Mark Howard
1987-01-01
A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.
Architecture and VHDL behavioural validation of a parallel processor dedicated to computer vision
International Nuclear Information System (INIS)
Collette, Thierry
1992-01-01
Speeding up image processing is mainly obtained using parallel computers; SIMD processors (single instruction stream, multiple data stream) have been developed, and have proven highly efficient regarding low-level image processing operations. Nevertheless, their performances drop for most intermediate of high level operations, mainly when random data reorganisations in processor memories are involved. The aim of this thesis was to extend the SIMD computer capabilities to allow it to perform more efficiently at the image processing intermediate level. The study of some representative algorithms of this class, points out the limits of this computer. Nevertheless, these limits can be erased by architectural modifications. This leads us to propose SYMPATIX, a new SIMD parallel computer. To valid its new concept, a behavioural model written in VHDL - Hardware Description Language - has been elaborated. With this model, the new computer performances have been estimated running image processing algorithm simulations. VHDL modeling approach allows to perform the system top down electronic design giving an easy coupling between system architectural modifications and their electronic cost. The obtained results show SYMPATIX to be an efficient computer for low and intermediate level image processing. It can be connected to a high level computer, opening up the development of new computer vision applications. This thesis also presents, a top down design method, based on the VHDL, intended for electronic system architects. (author) [fr
A learnable parallel processing architecture towards unity of memory and computing
Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.
2015-08-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
A learnable parallel processing architecture towards unity of memory and computing.
Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J
2015-08-14
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
International Nuclear Information System (INIS)
Bergamaschi, Luca; Pini, Giorgio; Sartoretto, Flavio
2003-01-01
The Jacobi-Davidson (JD) algorithm was recently proposed for evaluating a number of the eigenvalues of a matrix. JD goes beyond pure Krylov-space techniques; it cleverly expands its search space, by solving the so-called correction equation, thus in principle providing a more powerful method. Preconditioning the Jacobi-Davidson correction equation is mandatory when large, sparse matrices are analyzed. We considered several preconditioners: Classical block-Jacobi, and IC(0), together with approximate inverse (AINV or FSAI) preconditioners. The rationale for using approximate inverse preconditioners is their high parallelization potential, combined with their efficiency in accelerating the iterative solution of the correction equation. Analysis was carried on the sequential performance of preconditioned JD for the spectral decomposition of large, sparse matrices, which originate in the numerical integration of partial differential equations arising in physical and engineering problems. It was found that JD is highly sensitive to preconditioning, and it can display an irregular convergence behavior. We parallelized JD by data-splitting techniques, combining them with techniques to reduce the amount of communication data. Our own parallel, preconditioned code was executed on a dedicated parallel machine, and we present the results of our experiments. Our JD code provides an appreciable parallel degree of computation. Its performance was also compared with those of PARPACK and parallel DACG
Janetzke, D. C.; Murthy, D. V.
1991-01-01
Aeroelastic analysis is mult-disciplinary and computationally expensive. Hence, it can greatly benefit from parallel processing. As part of an effort to develop an aeroelastic analysis capability on a distributed-memory transputer network, a parallel algorithm for the computation of aerodynamic influence coefficients is implemented on a network of 32 transputers. The aerodynamic influence coefficients are calculated using a three-dimensional unsteady aerodynamic model and a panel discretization. Efficiencies up to 85 percent are demonstrated using 32 processors. The effects of subtask ordering, problem size and network topology are presented. A comparison to results on a shared-memory computer indicates that higher speedup is achieved on the distributed-memory system.
Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David
1987-01-01
The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.
Directory of Open Access Journals (Sweden)
Mikhail А. Plaksin
2017-12-01
Full Text Available The paper describes the methodological materials for the inclusion of the topic "Parallel computing" in the school informatics. The set of tasks "Swarm of Robots" are considered. The tasks were tested at the competition "TRIZformaska-2017" - an inter-regional competition in informatics, system analysis and the theory of inventive problems solving (TRIZ for schoolchildren and students. The following tasks are considered: computer game "Swarm of Robots" (the software is freeware; the task about rescue-robots; the task about a swarm of cellular automata, creating a pattern; load balancing in the computer network using a multi-agent system; the show of quadrocopters.
Parallel Monte Carlo simulations on an ARC-enabled computing grid
International Nuclear Information System (INIS)
Nilsen, Jon K; Samset, Bjørn H
2011-01-01
Grid computing opens new possibilities for running heavy Monte Carlo simulations of physical systems in parallel. The presentation gives an overview of GaMPI, a system for running an MPI-based random walker simulation on grid resources. Integrating the ARC middleware and the new storage system Chelonia with the Ganga grid job submission and control system, we show that MPI jobs can be run on a world-wide computing grid with good performance and promising scaling properties. Results for relatively communication-heavy Monte Carlo simulations run on multiple heterogeneous, ARC-enabled computing clusters in several countries are presented.
Pierson, Kendall Hugh
The Finite Element Tearing and Interconnecting (FETI) algorithms are numerically scalable iterative domain decomposition methods for solving systems of equations generated from the finite element discretization of second- or fourth-order elasticity problems. These methods have been substantially improved over the last ten years and recently shown parallel scalability up to one thousand processors. The purpose of this thesis is to present and investigate a dual-primal FETI method, which addresses some of the critical issues related to the original FETI methods. These critical issues involve the accurate computation of the local rigid body modes, the cost and size of the FETI coarse problems with respect to fourth-order elasticity problems, and the overall robustness and versatility of the equation solver. These improvements due to the dual-primal FETI formulation are especially beneficial when implemented on massively parallel distributed memory computers such as the Accelerated Strategic Computing Initiative (ASCI) Red Option supercomputer. Numerical results will be shown detailing scalability with respect to the mesh size, subdomain size, and the number of elements per subdomain for both second- and fourth-order elasticity problems. Parallel scalability will be reported for various large scale realistic problems on a SGI Origin 2000 and the ASCI Red option massively parallel supercomputer. Lastly, results from linear dynamics, eigenvalue analysis and geometrically non-linear static problems will be shown highlighting the benefits of FETI methods for solving large-scale problems with multiple right hand sides.
A nested decomposition algorithm for parallel computations of very large sparse systems
Directory of Open Access Journals (Sweden)
D. D. Šiljak
1995-01-01
Full Text Available In this paper we present a generalization of the balanced border block diagonal (BBD decomposition algorithm, which was developed for the parallel computation of sparse systems of linear equations. The efficiency of the new procedure is substantially higher, and it extends the applicability of the BBD decomposition to extremely large problems. Examples of the decomposition are provided for matrices as large as 250,000×250,000, and its performance is compared to other sparse decompositions. Applications to the parallel solution of sparse systems are discussed for a variety of engineering problems.
Parallel computations of molecular dynamics trajectories using the stochastic path approach
Zaloj, Veaceslav; Elber, Ron
2000-06-01
A novel protocol to parallelize molecular dynamics trajectories is discussed and tested on a cluster of PCs running the NT operating system. The new technique does not propagate the solution in small time steps, but uses instead a global optimization of a functional of the whole trajectory. The new approach is especially attractive for parallel and distributed computing and its advantages (and disadvantages) are presented. Two numerical examples are discussed: (a) A conformational transition in a solvated dipeptide, and (b) The R→T conformational transition in solvated hemoglobin.
Method, systems, and computer program products for implementing function-parallel network firewall
Fulp, Errin W [Winston-Salem, NC; Farley, Ryan J [Winston-Salem, NC
2011-10-11
Methods, systems, and computer program products for providing function-parallel firewalls are disclosed. According to one aspect, a function-parallel firewall includes a first firewall node for filtering received packets using a first portion of a rule set including a plurality of rules. The first portion includes less than all of the rules in the rule set. At least one second firewall node filters packets using a second portion of the rule set. The second portion includes at least one rule in the rule set that is not present in the first portion. The first and second portions together include all of the rules in the rule set.
Iterative schemes for parallel Sn algorithms in a shared-memory computing environment
International Nuclear Information System (INIS)
Haghighat, A.; Hunter, M.A.; Mattis, R.E.
1995-01-01
Several two-dimensional spatial domain partitioning S n transport theory algorithms are developed on the basis of different iterative schemes. These algorithms are incorporated into TWOTRAN-II and tested on the shared-memory CRAY Y-MP C90 computer. For a series of fixed-source r-z geometry homogeneous problems, it is demonstrated that the concurrent red-black algorithms may result in large parallel efficiencies (>60%) on C90. It is also demonstrated that for a realistic shielding problem, the use of the negative flux fixup causes high load imbalance, which results in a significant loss of parallel efficiency
Acceleration of Radiance for Lighting Simulation by Using Parallel Computing with OpenCL
Energy Technology Data Exchange (ETDEWEB)
Zuo, Wangda; McNeil, Andrew; Wetter, Michael; Lee, Eleanor
2011-09-06
We report on the acceleration of annual daylighting simulations for fenestration systems in the Radiance ray-tracing program. The algorithm was optimized to reduce both the redundant data input/output operations and the floating-point operations. To further accelerate the simulation speed, the calculation for matrix multiplications was implemented using parallel computing on a graphics processing unit. We used OpenCL, which is a cross-platform parallel programming language. Numerical experiments show that the combination of the above measures can speed up the annual daylighting simulations 101.7 times or 28.6 times when the sky vector has 146 or 2306 elements, respectively.
SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws
Cooke, Daniel; Rushton, Nelson
2013-01-01
With the introduction of new parallel architectures like the cell and multicore chips from IBM, Intel, AMD, and ARM, as well as the petascale processing available for highend computing, a larger number of programmers will need to write parallel codes. Adding the parallel control structure to the sequence, selection, and iterative control constructs increases the complexity of code development, which often results in increased development costs and decreased reliability. SequenceL is a high-level programming language that is, a programming language that is closer to a human s way of thinking than to a machine s. Historically, high-level languages have resulted in decreased development costs and increased reliability, at the expense of performance. In recent applications at JSC and in industry, SequenceL has demonstrated the usual advantages of high-level programming in terms of low cost and high reliability. SequenceL programs, however, have run at speeds typically comparable with, and in many cases faster than, their counterparts written in C and C++ when run on single-core processors. Moreover, SequenceL is able to generate parallel executables automatically for multicore hardware, gaining parallel speedups without any extra effort from the programmer beyond what is required to write the sequen tial/singlecore code. A SequenceL-to-C++ translator has been developed that automatically renders readable multithreaded C++ from a combination of a SequenceL program and sample data input. The SequenceL language is based on two fundamental computational laws, Consume-Simplify- Produce (CSP) and Normalize-Trans - pose (NT), which enable it to automate the creation of parallel algorithms from high-level code that has no annotations of parallelism whatsoever. In our anecdotal experience, SequenceL development has been in every case less costly than development of the same algorithm in sequential (that is, single-core, single process) C or C++, and an order of magnitude less
International Nuclear Information System (INIS)
Kirk, B.L.; Azmy, Y.Y.
1992-01-01
In this paper the one-group, steady-state neutron diffusion equation in two-dimensional Cartesian geometry is solved using the nodal integral method. The discrete variable equations comprise loosely coupled sets of equations representing the nodal balance of neutrons, as well as neutron current continuity along rows or columns of computational cells. An iterative algorithm that is more suitable for solving large problems concurrently is derived based on the decomposition of the spatial domain and is accelerated using successive overrelaxation. This algorithm is very well suited for parallel computers, especially since the spatial domain decomposition occurs naturally, so that the number of iterations required for convergence does not depend on the number of processors participating in the calculation. Implementation of the authors' algorithm on the Intel iPSC/2 hypercube and Sequent Balance 8000 parallel computer is presented, and measured speedup and efficiency for test problems are reported. The results suggest that the efficiency of the hypercube quickly deteriorates when many processors are used, while the Sequent Balance retains very high efficiency for a comparable number of participating processors. This leads to the conjecture that message-passing parallel computers are not as well suited for this algorithm as shared-memory machines
Energy Technology Data Exchange (ETDEWEB)
Joubert, W. [Los Alamos National Lab., NM (United States); Carey, G.F. [Univ. of Texas, Austin, TX (United States)
1994-12-31
A great need exists for high performance numerical software libraries transportable across parallel machines. This talk concerns the PCG package, which solves systems of linear equations by iterative methods on parallel computers. The features of the package are discussed, as well as techniques used to obtain high performance as well as transportability across architectures. Representative numerical results are presented for several machines including the Connection Machine CM-5, Intel Paragon and Cray T3D parallel computers.
7th International Workshop on Parallel Tools for High Performance Computing
Gracia, José; Nagel, Wolfgang; Resch, Michael
2014-01-01
Current advances in High Performance Computing (HPC) increasingly impact efficient software development workflows. Programmers for HPC applications need to consider trends such as increased core counts, multiple levels of parallelism, reduced memory per core, and I/O system challenges in order to derive well performing and highly scalable codes. At the same time, the increasing complexity adds further sources of program defects. While novel programming paradigms and advanced system libraries provide solutions for some of these challenges, appropriate supporting tools are indispensable. Such tools aid application developers in debugging, performance analysis, or code optimization and therefore make a major contribution to the development of robust and efficient parallel software. This book introduces a selection of the tools presented and discussed at the 7th International Parallel Tools Workshop, held in Dresden, Germany, September 3-4, 2013.
International Nuclear Information System (INIS)
Samatova, Nagiza F; Branstetter, Marcia; Ganguly, Auroop R; Hettich, Robert; Khan, Shiraj; Kora, Guruprasad; Li, Jiangtian; Ma, Xiaosong; Pan, Chongle; Shoshani, Arie; Yoginath, Srikanth
2006-01-01
Ultrascale computing and high-throughput experimental technologies have enabled the production of scientific data about complex natural phenomena. With this opportunity, comes a new problem - the massive quantities of data so produced. Answers to fundamental questions about the nature of those phenomena remain largely hidden in the produced data. The goal of this work is to provide a scalable high performance statistical data analysis framework to help scientists perform interactive analyses of these raw data to extract knowledge. Towards this goal we have been developing an open source parallel statistical analysis package, called Parallel R, that lets scientists employ a wide range of statistical analysis routines on high performance shared and distributed memory architectures without having to deal with the intricacies of parallelizing these routines
Concurrent particle-in-cell plasma simulation on a multi-transputer parallel computer
International Nuclear Information System (INIS)
Khare, A.N.; Jethra, A.; Patel, Kartik
1992-01-01
This report describes the parallelization of a Particle-in-Cell (PIC) plasma simulation code on a multi-transputer parallel computer. The algorithm used in the parallelization of the PIC method is described. The decomposition schemes related to the distribution of the particles among the processors are discussed. The implementation of the algorithm on a transputer network connected as a torus is presented. The solutions of the problems related to global communication of data are presented in the form of a set of generalized communication functions. The performance of the program as a function of data size and the number of transputers show that the implementation is scalable and represents an effective way of achieving high performance at acceptable cost. (author). 11 refs., 4 figs., 2 tabs., appendices
11th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing
Barolli, Leonard; Amato, Flora
2017-01-01
P2P, Grid, Cloud and Internet computing technologies have been very fast established as breakthrough paradigms for solving complex problems by enabling aggregation and sharing of an increasing variety of distributed computational resources at large scale. The aim of this volume is to provide latest research findings, innovative research results, methods and development techniques from both theoretical and practical perspectives related to P2P, Grid, Cloud and Internet computing as well as to reveal synergies among such large scale computing paradigms. This proceedings volume presents the results of the 11th International Conference on P2P, Parallel, Grid, Cloud And Internet Computing (3PGCIC-2016), held November 5-7, 2016, at Soonchunhyang University, Asan, Korea.
A scalable approach to modeling groundwater flow on massively parallel computers
International Nuclear Information System (INIS)
Ashby, S.F.; Falgout, R.D.; Tompson, A.F.B.
1995-12-01
We describe a fully scalable approach to the simulation of groundwater flow on a hierarchy of computing platforms, ranging from workstations to massively parallel computers. Specifically, we advocate the use of scalable conceptual models in which the subsurface model is defined independently of the computational grid on which the simulation takes place. We also describe a scalable multigrid algorithm for computing the groundwater flow velocities. We axe thus able to leverage both the engineer's time spent developing the conceptual model and the computing resources used in the numerical simulation. We have successfully employed this approach at the LLNL site, where we have run simulations ranging in size from just a few thousand spatial zones (on workstations) to more than eight million spatial zones (on the CRAY T3D)-all using the same conceptual model
A 3D gyrokinetic particle-in-cell simulation of fusion plasma microturbulence on parallel computers
Williams, T. J.
1992-12-01
One of the grand challenge problems now supported by HPCC is the Numerical Tokamak Project. A goal of this project is the study of low-frequency micro-instabilities in tokamak plasmas, which are believed to cause energy loss via turbulent thermal transport across the magnetic field lines. An important tool in this study is gyrokinetic particle-in-cell (PIC) simulation. Gyrokinetic, as opposed to fully-kinetic, methods are particularly well suited to the task because they are optimized to study the frequency and wavelength domain of the microinstabilities. Furthermore, many researchers now employ low-noise delta(f) methods to greatly reduce statistical noise by modelling only the perturbation of the gyrokinetic distribution function from a fixed background, not the entire distribution function. In spite of the increased efficiency of these improved algorithms over conventional PIC algorithms, gyrokinetic PIC simulations of tokamak micro-turbulence are still highly demanding of computer power--even fully-vectorized codes on vector supercomputers. For this reason, we have worked for several years to redevelop these codes on massively parallel computers. We have developed 3D gyrokinetic PIC simulation codes for SIMD and MIMD parallel processors, using control-parallel, data-parallel, and domain-decomposition message-passing (DDMP) programming paradigms. This poster summarizes our earlier work on codes for the Connection Machine and BBN TC2000 and our development of a generic DDMP code for distributed-memory parallel machines. We discuss the memory-access issues which are of key importance in writing parallel PIC codes, with special emphasis on issues peculiar to gyrokinetic PIC. We outline the domain decompositions in our new DDMP code and discuss the interplay of different domain decompositions suited for the particle-pushing and field-solution components of the PIC algorithm.
Proceedings of the workshop on Compilation of (Symbolic) Languages for Parallel Computers
Energy Technology Data Exchange (ETDEWEB)
Foster, I.; Tick, E. (comp.)
1991-11-01
This report comprises the abstracts and papers for the talks presented at the Workshop on Compilation of (Symbolic) Languages for Parallel Computers, held October 31--November 1, 1991, in San Diego. These unreferred contributions were provided by the participants for the purpose of this workshop; many of them will be published elsewhere in peer-reviewed conferences and publications. Our goal is planning this workshop was to bring together researchers from different disciplines with common problems in compilation. In particular, we wished to encourage interaction between researchers working in compilation of symbolic languages and those working on compilation of conventional, imperative languages. The fundamental problems facing researchers interested in compilation of logic, functional, and procedural programming languages for parallel computers are essentially the same. However, differences in the basic programming paradigms have led to different communities emphasizing different species of the parallel compilation problem. For example, parallel logic and functional languages provide dataflow-like formalisms in which control dependencies are unimportant. Hence, a major focus of research in compilation has been on techniques that try to infer when sequential control flow can safely be imposed. Granularity analysis for scheduling is a related problem. The single- assignment property leads to a need for analysis of memory use in order to detect opportunities for reuse. Much of the work in each of these areas relies on the use of abstract interpretation techniques.
Energy Technology Data Exchange (ETDEWEB)
Amadio, G.; et al.
2017-11-22
An intensive R&D and programming effort is required to accomplish new challenges posed by future experimental high-energy particle physics (HEP) programs. The GeantV project aims to narrow the gap between the performance of the existing HEP detector simulation software and the ideal performance achievable, exploiting latest advances in computing technology. The project has developed a particle detector simulation prototype capable of transporting in parallel particles in complex geometries exploiting instruction level microparallelism (SIMD and SIMT), task-level parallelism (multithreading) and high-level parallelism (MPI), leveraging both the multi-core and the many-core opportunities. We present preliminary verification results concerning the electromagnetic (EM) physics models developed for parallel computing architectures within the GeantV project. In order to exploit the potential of vectorization and accelerators and to make the physics model effectively parallelizable, advanced sampling techniques have been implemented and tested. In this paper we introduce a set of automated statistical tests in order to verify the vectorized models by checking their consistency with the corresponding Geant4 models and to validate them against experimental data.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
International Nuclear Information System (INIS)
Azad, Ariful; Buluc, Aydn; Pothen, Alex
2016-01-01
It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting path is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.
The FORCE: A portable parallel programming language supporting computational structural mechanics
Jordan, Harry F.; Benten, Muhammad S.; Brehm, Juergen; Ramanan, Aruna
1989-01-01
This project supports the conversion of codes in Computational Structural Mechanics (CSM) to a parallel form which will efficiently exploit the computational power available from multiprocessors. The work is a part of a comprehensive, FORTRAN-based system to form a basis for a parallel version of the NICE/SPAR combination which will form the CSM Testbed. The software is macro-based and rests on the force methodology developed by the principal investigator in connection with an early scientific multiprocessor. Machine independence is an important characteristic of the system so that retargeting it to the Flex/32, or any other multiprocessor on which NICE/SPAR might be imnplemented, is well supported. The principal investigator has experience in producing parallel software for both full and sparse systems of linear equations using the force macros. Other researchers have used the Force in finite element programs. It has been possible to rapidly develop software which performs at maximum efficiency on a multiprocessor. The inherent machine independence of the system also means that the parallelization will not be limited to a specific multiprocessor.
Amadio, G.; Apostolakis, J.; Bandieramonte, M.; Behera, S. P.; Brun, R.; Canal, P.; Carminati, F.; Cosmo, G.; Duhem, L.; Elvira, D.; Folger, G.; Gheata, A.; Gheata, M.; Goulas, I.; Hariri, F.; Jun, S. Y.; Konstantinov, D.; Kumawat, H.; Ivantchenko, V.; Lima, G.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.
2017-10-01
An intensive R&D and programming effort is required to accomplish new challenges posed by future experimental high-energy particle physics (HEP) programs. The GeantV project aims to narrow the gap between the performance of the existing HEP detector simulation software and the ideal performance achievable, exploiting latest advances in computing technology. The project has developed a particle detector simulation prototype capable of transporting in parallel particles in complex geometries exploiting instruction level microparallelism (SIMD and SIMT), task-level parallelism (multithreading) and high-level parallelism (MPI), leveraging both the multi-core and the many-core opportunities. We present preliminary verification results concerning the electromagnetic (EM) physics models developed for parallel computing architectures within the GeantV project. In order to exploit the potential of vectorization and accelerators and to make the physics model effectively parallelizable, advanced sampling techniques have been implemented and tested. In this paper we introduce a set of automated statistical tests in order to verify the vectorized models by checking their consistency with the corresponding Geant4 models and to validate them against experimental data.
Analysis and selection of optimal function implementations in massively parallel computer
Archer, Charles Jens [Rochester, MN; Peters, Amanda [Rochester, MN; Ratterman, Joseph D [Rochester, MN
2011-05-31
An apparatus, program product and method optimize the operation of a parallel computer system by, in part, collecting performance data for a set of implementations of a function capable of being executed on the parallel computer system based upon the execution of the set of implementations under varying input parameters in a plurality of input dimensions. The collected performance data may be used to generate selection program code that is configured to call selected implementations of the function in response to a call to the function under varying input parameters. The collected performance data may be used to perform more detailed analysis to ascertain the comparative performance of the set of implementations of the function under the varying input parameters.
High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation
Energy Technology Data Exchange (ETDEWEB)
Peterka, Tom; Morozov, Dmitriy; Phillips, Carolyn
2014-11-14
Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization; but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared-memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbor points need to be exchanged among the subdomains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.
Process-Oriented Parallel Programming with an Application to Data-Intensive Computing
Givelberg, Edward
2014-01-01
We introduce process-oriented programming as a natural extension of object-oriented programming for parallel computing. It is based on the observation that every class of an object-oriented language can be instantiated as a process, accessible via a remote pointer. The introduction of process pointers requires no syntax extension, identifies processes with programming objects, and enables processes to exchange information simply by executing remote methods. Process-oriented programming is a h...
International Nuclear Information System (INIS)
Scheer, Patrick
1998-01-01
Progress in microelectronics lead to electronic circuits which are increasingly integrated, with an operating frequency and an inputs/outputs count larger than the ones supported by printed circuit board and back-plane technologies. As a result, distributed systems with several boards cannot fully exploit the performance of integrated circuits. In synchronous parallel computers, the situation is worsen since the overall system performances rely on the efficiency of electrical interconnects between the integrated circuits which include the processing elements (PE). The study of a real parallel computer named SYMPHONIE shows for instance that the system operating frequency is far smaller than the capabilities of the microelectronics technology used for the PE implementation. Optical interconnections may cancel these limitations by providing more efficient connections between the PE. Especially, free-space optical interconnections based on vertical-cavity surface-emitting lasers (VCSEL), micro-lens and PIN photodiodes are compatible with the required features of the PE communications. Zero bias modulation of VCSEL with CMOS-compatible digital signals is studied and experimentally demonstrated. A model of the propagation of truncated gaussian beams through micro-lenses is developed. It is then used to optimise the geometry of the detection areas. A dedicated mechanical system is also proposed and implemented for integrating free-space optical interconnects in a standard electronic environment, representative of the one of parallel computer systems. A specially designed demonstrator provides the experimental validation of the above physical concepts. (author) [fr
Förster, Michael
2014-01-01
Numerical programs often use parallel programming techniques such as OpenMP to compute the program's output values as efficient as possible. In addition, derivative values of these output values with respect to certain input values play a crucial role. To achieve code that computes not only the output values simultaneously but also the derivative values, this work introduces several source-to-source transformation rules. These rules are based on a technique called algorithmic differentiation. The main focus of this work lies on the important reverse mode of algorithmic differentiation. The inh
Open Quantum Dynamics Calculations with the Hierarchy Equations of Motion on Parallel Computers.
Strümpfer, Johan; Schulten, Klaus
2012-08-14
Calculating the evolution of an open quantum system, i.e., a system in contact with a thermal environment, has presented a theoretical and computational challenge for many years. With the advent of supercomputers containing large amounts of memory and many processors, the computational challenge posed by the previously intractable theoretical models can now be addressed. The hierarchy equations of motion present one such model and offer a powerful method that remained under-utilized so far due to its considerable computational expense. By exploiting concurrent processing on parallel computers the hierarchy equations of motion can be applied to biological-scale systems. Herein we introduce the quantum dynamics software PHI, that solves the hierarchical equations of motion. We describe the integrator employed by PHI and demonstrate PHI's scaling and efficiency running on large parallel computers by applying the software to the calculation of inter-complex excitation transfer between the light harvesting complexes 1 and 2 of purple photosynthetic bacteria, a 50 pigment system.
Parallel, distributed and GPU computing technologies in single-particle electron microscopy
International Nuclear Information System (INIS)
Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger
2009-01-01
An introduction to the current paradigm shift towards concurrency in software. Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined
Wang, Zhaocai; Ji, Zuwen; Wang, Xiaoming; Wu, Tunhua; Huang, Wei
2017-12-01
As a promising approach to solve the computationally intractable problem, the method based on DNA computing is an emerging research area including mathematics, computer science and molecular biology. The task scheduling problem, as a well-known NP-complete problem, arranges n jobs to m individuals and finds the minimum execution time of last finished individual. In this paper, we use a biologically inspired computational model and describe a new parallel algorithm to solve the task scheduling problem by basic DNA molecular operations. In turn, we skillfully design flexible length DNA strands to represent elements of the allocation matrix, take appropriate biological experiment operations and get solutions of the task scheduling problem in proper length range with less than O(n 2 ) time complexity. Copyright © 2017. Published by Elsevier B.V.
International Nuclear Information System (INIS)
Yamada, Susumu; Igarashi, Ryo; Machida, Masahiko; Imamura, Toshiyuki; Okumura, Masahiko; Onishi, Hiroaki
2010-01-01
We parallelize the density matrix renormalization group (DMRG) method, which is a ground-state solver for one-dimensional quantum lattice systems. The parallelization allows us to extend the applicable range of the DMRG to n-leg ladders i.e., quasi two-dimension cases. Such an extension is regarded to bring about several breakthroughs in e.g., quantum-physics, chemistry, and nano-engineering. However, the straightforward parallelization requires all-to-all communications between all processes which are unsuitable for multi-core systems, which is a mainstream of current parallel computers. Therefore, we optimize the all-to-all communications by the following two steps. The first one is the elimination of the communications between all processes by only rearranging data distribution with the communication data amount kept. The second one is the avoidance of the communication conflict by rescheduling the calculation and the communication. We evaluate the performance of the DMRG method on multi-core supercomputers and confirm that our two-steps tuning is quite effective. (author)
Nakano, S.; Higuchi, T.
2012-04-01
The particle filter (PF) is one of ensemble-based algorithms for data assimilation. The PF obtains an approximation of a posterior PDF of a state by resampling with replacement from a prior ensemble. The procedure of the PF does not assume linearity or Gaussianity. Thus, it can be applied to general nonlinear problems. However, in order to obtain appropriate results for high-dimensional problems, the PF requires an enormous number of ensemble members. Since the PF must calculate the time integral for each particle at each time step, the large ensemble size results in prohibitive computational cost. There exists various methods for reducing the number of particle. In contrast, we employ a straightforward approach to overcome this problem; that is, we use a massively parallel computer to achieve sufficiently large ensemble size. Since the time integral in the PF can be readily be parallelized, we can notably improve the computational efficiency using a parallel computer. However, if we naively implement the PF on a distributed computing system, we encounter another difficulty; that is, many data transfers occur randomly between different nodes of the distributed computing system. Such data transfers can be reduced by dividing the ensemble into small subsets (groups). If we limit the resampling within each of the subsets, the data transfers can be done efficiently in parallel. If the ensemble are divided into small subsets, the risk of local sample impoverishment within each of the subsets is enhanced. However, if we change the grouping at each time step, the information held by a node can be propagated to all of the nodes after a finite number of time steps and the local sample impoverishment can be avoided. In the present study, we compare between the above method based on the local resampling of each group and the naive implementation of the PF based on the global resampling of the whole ensemble. The global resampling enables us to achive a slightly better
Scalability of preconditioners as a strategy for parallel computation of compressible fluid flow
Energy Technology Data Exchange (ETDEWEB)
Hansen, G.A.
1996-05-01
Parallel implementations of a Newton-Krylov-Schwarz algorithm are used to solve a model problem representing low Mach number compressible fluid flow over a backward-facing step. The Mach number is specifically selected to result in a numerically {open_quote}stiff{close_quotes} matrix problem, based on an implicit finite volume discretization of the compressible 2D Navier-Stokes/energy equations using primitive variables. Newton`s method is used to linearize the discrete system, and a preconditioned Krylov projection technique is used to solve the resulting linear system. Domain decomposition enables the development of a global preconditioner via the parallel construction of contributions derived from subdomains. Formation of the global preconditioner is based upon additive and multiplicative Schwarz algorithms, with and without subdomain overlap. The degree of parallelism of this technique is further enhanced with the use of a matrix-free approximation for the Jacobian used in the Krylov technique (in this case, GMRES(k)). Of paramount interest to this study is the implementation and optimization of these techniques on parallel shared-memory hardware, namely the Cray C90 and SGI Challenge architectures. These architectures were chosen as representative and commonly available to researchers interested in the solution of problems of this type. The Newton-Krylov-Schwarz solution technique is increasingly being investigated for computational fluid dynamics (CFD) applications due to the advantages of full coupling of all variables and equations, rapid non-linear convergence, and moderate memory requirements. A parallel version of this method that scales effectively on the above architectures would be extremely attractive to practitioners, resulting in efficient, cost-effective, parallel solutions exhibiting the benefits of the solution technique.
Kumar, Sameer
2010-06-15
Disclosed is a mechanism on receiving processors in a parallel computing system for providing order to data packets received from a broadcast call and to distinguish data packets received at nodes from several incoming asynchronous broadcast messages where header space is limited. In the present invention, processors at lower leafs of a tree do not need to obtain a broadcast message by directly accessing the data in a root processor's buffer. Instead, each subsequent intermediate node's rank id information is squeezed into the software header of packet headers. In turn, the entire broadcast message is not transferred from the root processor to each processor in a communicator but instead is replicated on several intermediate nodes which then replicated the message to nodes in lower leafs. Hence, the intermediate compute nodes become "virtual root compute nodes" for the purpose of replicating the broadcast message to lower levels of a tree.
International Nuclear Information System (INIS)
Woodruff, S.B.
1994-01-01
The Transient Reactor Analysis Code (TRAC), which features a two-fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local, the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, a fixed, uniform assignment of nodes to prallel processors will result in degraded computational efficiency due to the poor load balancing. A standard method for treating data-dependent models on vector architectures has been to use gather operations (or indirect adressing) to sort the nodes into subsets that (temporarily) share a common computational model. However, this method is not effective on distributed memory data parallel architectures, where indirect adressing involves expensive communication overhead. Another serious problem with this method involves software engineering challenges in the areas of maintainability and extensibility. For example, an implementation that was hand-tuned to achieve good computational efficiency would have to be rewritten whenever the decision tree governing the sorting was modified. Using an example based on the calculation of the wall-to-liquid and wall-to-vapor heat-transfer coefficients for three nonboiling flow regimes, we describe how the use of the Fortran 90 WHERE construct and automatic inlining of functions can be used to ameliorate this problem while improving both efficiency and software engineering. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. We discuss why developers should either wait for such solutions or consider alternative numerical algorithms, such as a neural network
Adaptive finite element simulation of flow and transport applications on parallel computers
Kirk, Benjamin Shelton
The subject of this work is the adaptive finite element simulation of problems arising in flow and transport applications on parallel computers. Of particular interest are new contributions to adaptive mesh refinement (AMR) in this parallel high-performance context, including novel work on data structures, treatment of constraints in a parallel setting, generality and extensibility via object-oriented programming, and the design/implementation of a flexible software framework. This technology and software capability then enables more robust, reliable treatment of multiscale--multiphysics problems and specific studies of fine scale interaction such as those in biological chemotaxis (Chapter 4) and high-speed shock physics for compressible flows (Chapter 5). The work begins by presenting an overview of key concepts and data structures employed in AMR simulations. Of particular interest is how these concepts are applied in the physics-independent software framework which is developed here and is the basis for all the numerical simulations performed in this work. This open-source software framework has been adopted by a number of researchers in the U.S. and abroad for use in a wide range of applications. The dynamic nature of adaptive simulations pose particular issues for efficient implementation on distributed-memory parallel architectures. Communication cost, computational load balance, and memory requirements must all be considered when developing adaptive software for this class of machines. Specific extensions to the adaptive data structures to enable implementation on parallel computers is therefore considered in detail. The libMesh framework for performing adaptive finite element simulations on parallel computers is developed to provide a concrete implementation of the above ideas. This physics-independent framework is applied to two distinct flow and transport applications classes in the subsequent application studies to illustrate the flexibility of the
Near real-time digital holographic microscope based on GPU parallel computing
Zhu, Gang; Zhao, Zhixiong; Wang, Huarui; Yang, Yan
2018-01-01
A transmission near real-time digital holographic microscope with in-line and off-axis light path is presented, in which the parallel computing technology based on compute unified device architecture (CUDA) and digital holographic microscopy are combined. Compared to other holographic microscopes, which have to implement reconstruction in multiple focal planes and are time-consuming the reconstruction speed of the near real-time digital holographic microscope can be greatly improved with the parallel computing technology based on CUDA, so it is especially suitable for measurements of particle field in micrometer and nanometer scale. Simulations and experiments show that the proposed transmission digital holographic microscope can accurately measure and display the velocity of particle field in micrometer scale, and the average velocity error is lower than 10%.With the graphic processing units(GPU), the computing time of the 100 reconstruction planes(512×512 grids) is lower than 120ms, while it is 4.9s using traditional reconstruction method by CPU. The reconstruction speed has been raised by 40 times. In other words, it can handle holograms at 8.3 frames per second and the near real-time measurement and display of particle velocity field are realized. The real-time three-dimensional reconstruction of particle velocity field is expected to achieve by further optimization of software and hardware. Keywords: digital holographic microscope,
Energy Technology Data Exchange (ETDEWEB)
Pinchedez, K
1999-06-01
Parallel computing meets the ever-increasing requirements for neutronic computer code speed and accuracy. In this work, two different approaches have been considered. We first parallelized the sequential algorithm used by the neutronics code CRONOS developed at the French Atomic Energy Commission. The algorithm computes the dominant eigenvalue associated with PN simplified transport equations by a mixed finite element method. Several parallel algorithms have been developed on distributed memory machines. The performances of the parallel algorithms have been studied experimentally by implementation on a T3D Cray and theoretically by complexity models. A comparison of various parallel algorithms has confirmed the chosen implementations. We next applied a domain sub-division technique to the two-group diffusion Eigen problem. In the modal synthesis-based method, the global spectrum is determined from the partial spectra associated with sub-domains. Then the Eigen problem is expanded on a family composed, on the one hand, from eigenfunctions associated with the sub-domains and, on the other hand, from functions corresponding to the contribution from the interface between the sub-domains. For a 2-D homogeneous core, this modal method has been validated and its accuracy has been measured. (author)
Computational cost estimates for parallel shared memory isogeometric multi-frontal solvers
Woźniak, Maciej
2014-06-01
In this paper we present computational cost estimates for parallel shared memory isogeometric multi-frontal solvers. The estimates show that the ideal isogeometric shared memory parallel direct solver scales as O( p2log(N/p)) for one dimensional problems, O(Np2) for two dimensional problems, and O(N4/3p2) for three dimensional problems, where N is the number of degrees of freedom, and p is the polynomial order of approximation. The computational costs of the shared memory parallel isogeometric direct solver are compared with those corresponding to the sequential isogeometric direct solver, being the latest equal to O(N p2) for the one dimensional case, O(N1.5p3) for the two dimensional case, and O(N2p3) for the three dimensional case. The shared memory version significantly reduces both the scalability in terms of N and p. Theoretical estimates are compared with numerical experiments performed with linear, quadratic, cubic, quartic, and quintic B-splines, in one and two spatial dimensions. © 2014 Elsevier Ltd. All rights reserved.
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Parallel calculations on shared memory, NUMA-based computers using MATLAB
Krotkiewski, Marcin; Dabrowski, Marcin
2014-05-01
Achieving satisfactory computational performance in numerical simulations on modern computer architectures can be a complex task. Multi-core design makes it necessary to parallelize the code. Efficient parallelization on NUMA (Non-Uniform Memory Access) shared memory architectures necessitates explicit placement of the data in the memory close to the CPU that uses it. In addition, using more than 8 CPUs (~100 cores) requires a cluster solution of interconnected nodes, which involves (expensive) communication between the processors. It takes significant effort to overcome these challenges even when programming in low-level languages, which give the programmer full control over data placement and work distribution. Instead, many modelers use high-level tools such as MATLAB, which severely limit the optimization/tuning options available. Nonetheless, the advantage of programming simplicity and a large available code base can tip the scale in favor of MATLAB. We investigate whether MATLAB can be used for efficient, parallel computations on modern shared memory architectures. A common approach to performance optimization of MATLAB programs is to identify a bottleneck and migrate the corresponding code block to a MEX file implemented in, e.g. C. Instead, we aim at achieving a scalable parallel performance of MATLABs core functionality. Some of the MATLABs internal functions (e.g., bsxfun, sort, BLAS3, operations on vectors) are multi-threaded. Achieving high parallel efficiency of those may potentially improve the performance of significant portion of MATLABs code base. Since we do not have MATLABs source code, our performance tuning relies on the tools provided by the operating system alone. Most importantly, we use custom memory allocation routines, thread to CPU binding, and memory page migration. The performance tests are carried out on multi-socket shared memory systems (2- and 4-way Intel-based computers), as well as a Distributed Shared Memory machine with 96 CPU
An implementation of a tree code on a SIMD, parallel computer
Olson, Kevin M.; Dorband, John E.
1994-01-01
We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.
A parallel-vector algorithm for rapid structural analysis on high-performance computers
Storaasli, Olaf O.; Nguyen, Duc T.; Agarwal, Tarun K.
1990-01-01
A fast, accurate Choleski method for the solution of symmetric systems of linear equations is presented. This direct method is based on a variable-band storage scheme and takes advantage of column heights to reduce the number of operations in the Choleski factorization. The method employs parallel computation in the outermost DO-loop and vector computation via the loop unrolling technique in the innermost DO-loop. The method avoids computations with zeros outside the column heights, and as an option, zeros inside the band. The close relationship between Choleski and Gauss elimination methods is examined. The minor changes required to convert the Choleski code to a Gauss code to solve non-positive-definite symmetric systems of equations are identified. The results for two large scale structural analyses performed on supercomputers, demonstrate the accuracy and speed of the method.
Proxy-equation paradigm: A strategy for massively parallel asynchronous computations
Mittal, Ankita; Girimaji, Sharath
2017-09-01
Massively parallel simulations of transport equation systems call for a paradigm change in algorithm development to achieve efficient scalability. Traditional approaches require time synchronization of processing elements (PEs), which severely restricts scalability. Relaxing synchronization requirement introduces error and slows down convergence. In this paper, we propose and develop a novel "proxy equation" concept for a general transport equation that (i) tolerates asynchrony with minimal added error, (ii) preserves convergence order and thus, (iii) expected to scale efficiently on massively parallel machines. The central idea is to modify a priori the transport equation at the PE boundaries to offset asynchrony errors. Proof-of-concept computations are performed using a one-dimensional advection (convection) diffusion equation. The results demonstrate the promise and advantages of the present strategy.
An FPGA-Based Quantum Computing Emulation Framework Based on Serial-Parallel Architecture
Directory of Open Access Journals (Sweden)
Y. H. Lee
2016-01-01
Full Text Available Hardware emulation of quantum systems can mimic more efficiently the parallel behaviour of quantum computations, thus allowing higher processing speed-up than software simulations. In this paper, an efficient hardware emulation method that employs a serial-parallel hardware architecture targeted for field programmable gate array (FPGA is proposed. Quantum Fourier transform and Grover’s search are chosen as case studies in this work since they are the core of many useful quantum algorithms. Experimental work shows that, with the proposed emulation architecture, a linear reduction in resource utilization is attained against the pipeline implementations proposed in prior works. The proposed work contributes to the formulation of a proof-of-concept baseline FPGA emulation framework with optimization on datapath designs that can be extended to emulate practical large-scale quantum circuits.
High performance parallel computing of flows in complex geometries: II. Applications
International Nuclear Information System (INIS)
Gourdain, N; Gicquel, L; Staffelbach, G; Vermorel, O; Duchaine, F; Boussuge, J-F; Poinsot, T
2009-01-01
Present regulations in terms of pollutant emissions, noise and economical constraints, require new approaches and designs in the fields of energy supply and transportation. It is now well established that the next breakthrough will come from a better understanding of unsteady flow effects and by considering the entire system and not only isolated components. However, these aspects are still not well taken into account by the numerical approaches or understood whatever the design stage considered. The main challenge is essentially due to the computational requirements inferred by such complex systems if it is to be simulated by use of supercomputers. This paper shows how new challenges can be addressed by using parallel computing platforms for distinct elements of a more complex systems as encountered in aeronautical applications. Based on numerical simulations performed with modern aerodynamic and reactive flow solvers, this work underlines the interest of high-performance computing for solving flow in complex industrial configurations such as aircrafts, combustion chambers and turbomachines. Performance indicators related to parallel computing efficiency are presented, showing that establishing fair criterions is a difficult task for complex industrial applications. Examples of numerical simulations performed in industrial systems are also described with a particular interest for the computational time and the potential design improvements obtained with high-fidelity and multi-physics computing methods. These simulations use either unsteady Reynolds-averaged Navier-Stokes methods or large eddy simulation and deal with turbulent unsteady flows, such as coupled flow phenomena (thermo-acoustic instabilities, buffet, etc). Some examples of the difficulties with grid generation and data analysis are also presented when dealing with these complex industrial applications.
Computational cost of isogeometric multi-frontal solvers on parallel distributed memory machines
Woźniak, Maciej
2015-02-01
This paper derives theoretical estimates of the computational cost for isogeometric multi-frontal direct solver executed on parallel distributed memory machines. We show theoretically that for the Cp-1 global continuity of the isogeometric solution, both the computational cost and the communication cost of a direct solver are of order O(log(N)p2) for the one dimensional (1D) case, O(Np2) for the two dimensional (2D) case, and O(N4/3p2) for the three dimensional (3D) case, where N is the number of degrees of freedom and p is the polynomial order of the B-spline basis functions. The theoretical estimates are verified by numerical experiments performed with three parallel multi-frontal direct solvers: MUMPS, PaStiX and SuperLU, available through PETIGA toolkit built on top of PETSc. Numerical results confirm these theoretical estimates both in terms of p and N. For a given problem size, the strong efficiency rapidly decreases as the number of processors increases, becoming about 20% for 256 processors for a 3D example with 1283 unknowns and linear B-splines with C0 global continuity, and 15% for a 3D example with 643 unknowns and quartic B-splines with C3 global continuity. At the same time, one cannot arbitrarily increase the problem size, since the memory required by higher order continuity spaces is large, quickly consuming all the available memory resources even in the parallel distributed memory version. Numerical results also suggest that the use of distributed parallel machines is highly beneficial when solving higher order continuity spaces, although the number of processors that one can efficiently employ is somehow limited.
Homemade Buckeye-Pi: A Learning Many-Node Platform for High-Performance Parallel Computing
Amooie, M. A.; Moortgat, J.
2017-12-01
We report on the "Buckeye-Pi" cluster, the supercomputer developed in The Ohio State University School of Earth Sciences from 128 inexpensive Raspberry Pi (RPi) 3 Model B single-board computers. Each RPi is equipped with fast Quad Core 1.2GHz ARMv8 64bit processor, 1GB of RAM, and 32GB microSD card for local storage. Therefore, the cluster has a total RAM of 128GB that is distributed on the individual nodes and a flash capacity of 4TB with 512 processors, while it benefits from low power consumption, easy portability, and low total cost. The cluster uses the Message Passing Interface protocol to manage the communications between each node. These features render our platform the most powerful RPi supercomputer to date and suitable for educational applications in high-performance-computing (HPC) and handling of large datasets. In particular, we use the Buckeye-Pi to implement optimized parallel codes in our in-house simulator for subsurface media flows with the goal of achieving a massively-parallelized scalable code. We present benchmarking results for the computational performance across various number of RPi nodes. We believe our project could inspire scientists and students to consider the proposed unconventional cluster architecture as a mainstream and a feasible learning platform for challenging engineering and scientific problems.
Paralelno umrežavanje računara / Parallel networking of the computers
Directory of Open Access Journals (Sweden)
Milojko Jevtović
2007-04-01
Full Text Available U radu je izložena originalna koncepcija tehničkog rešenja paralelnog umrežavanja računara, kao i lokalnih računarskih mreža (LAN - Local Area Network, odnosno povezivanje i istovremena komunikacija preko više različitih transportnih telekomunikacionih mreža. Opisano je jedno rešenje paralelnog umrežavanja, kojim je omogućen pouzdani prenos multimedijalnog saobraćaja i prenos podataka u realnom vremenu između računara ili LAN istovremeno preko N (N = 1, 2, 3, 4,.. različitih, međusobno nezavisnih mreža širokog prostranstva (WAN - Wide Area Network. Paralelno umrežavanje zasnovano je na korišćenju univerzalnog modema, čije je rešenje, takođe ukratko predstavljeno. / In this paper, new concept for parallel networking of the computers or LANs over different WAN telecommunications networks, is presented. One solution of the parallel networks, which enables reliable transfer of multimedia traffic and data transmission in real time between a computer of LAN via N (N = 1, 2 3, 4,… different inter-connected Wide Area Network. Connections between computers or LANs and wide area networks are realized using universal modems whose solution has also been presented.
Honkonen, I.
2015-03-01
I present a method for developing extensible and modular computational models without sacrificing serial or parallel performance or source code readability. By using a generic simulation cell method I show that it is possible to combine several distinct computational models to run in the same computational grid without requiring modification of existing code. This is an advantage for the development and testing of, e.g., geoscientific software as each submodel can be developed and tested independently and subsequently used without modification in a more complex coupled program. An implementation of the generic simulation cell method presented here, generic simulation cell class (gensimcell), also includes support for parallel programming by allowing model developers to select which simulation variables of, e.g., a domain-decomposed model to transfer between processes via a Message Passing Interface (MPI) library. This allows the communication strategy of a program to be formalized by explicitly stating which variables must be transferred between processes for the correct functionality of each submodel and the entire program. The generic simulation cell class requires a C++ compiler that supports a version of the language standardized in 2011 (C++11). The code is available at https://github.com/nasailja/gensimcell for everyone to use, study, modify and redistribute; those who do are kindly requested to acknowledge and cite this work.
Real-time processing of radar return on a parallel computer
Aalfs, David D.
1992-01-01
NASA is working with the FAA to demonstrate the feasibility of pulse Doppler radar as a candidate airborne sensor to detect low altitude windshears. The need to provide the pilot with timely information about possible hazards has motivated a demand for real-time processing of a radar return. Investigated here is parallel processing as a means of accommodating the high data rates required. A PC based parallel computer, called the transputer, is used to investigate issues in real time concurrent processing of radar signals. A transputer network is made up of an array of single instruction stream processors that can be networked in a variety of ways. They are easily reconfigured and software development is largely independent of the particular network topology. The performance of the transputer is evaluated in light of the computational requirements. A number of algorithms have been implemented on the transputers in OCCAM, a language specially designed for parallel processing. These include signal processing algorithms such as the Fast Fourier Transform (FFT), pulse-pair, and autoregressive modelling, as well as routing software to support concurrency. The most computationally intensive task is estimating the spectrum. Two approaches have been taken on this problem, the first and most conventional of which is to use the FFT. By using table look-ups for the basis function and other optimizing techniques, an algorithm has been developed that is sufficient for real time. The other approach is to model the signal as an autoregressive process and estimate the spectrum based on the model coefficients. This technique is attractive because it does not suffer from the spectral leakage problem inherent in the FFT. Benchmark tests indicate that autoregressive modeling is feasible in real time.
Using an Open-Source Grid Framework and Virtualization for Embarrassingly Parallel Computations
Freeman, S.; Qu, Y.; Boller, R. A.; Yang, C.; Wojcik, G. S.; Bambacus, M.; Cahalan, R. F.
2010-12-01
In this talk, we present the overall architecture and ideas driving the Climate@Home project. Building on the success of ClimatePrediction.net and SETI@Home, Climate@Home enlists citizen volunteers and NASA’s cloud services to donate their spare compute cycles to run the GISS ModelE climate model in a massively parallel fashion. NASA’s cloud services and project data management support will come from NASA’s Infrastructure as a Service (IaaS), powered by Nebula(TM). By using these donated cycles, scientists are able to run the model with significantly more configurations to help better determine the model’s sensitivity to its input parameters. Our architecture differs from the widely used BOINC platform (as used by ClimatePrediction.net and SETI@Home) in its use of open source virtualization technology, VirtualBox. By providing a standardized and highly controlled environment, we can ensure bit-wise reproducibility of computational results across heterogeneous platforms, helping to maintain the scientific integrity of the model experiments. While the initial experiments focus on climate, the system is designed to be flexible enough to apply to any “embarrassingly parallel,” computationally intensive tasks via the Java Parallel Processing Framework (JPPF) open source grid framework. The system provides a simple, generic configuration interface for scientists to create experiments and submit them to the grid. It also provides a pluggable architecture, allowing different projects to provide their own visualization capabilities. Finally, metrics and management features provide users with the ability to control and track contributions to their selected projects.
Cho, In Ho
For the last few decades, we have obtained tremendous insight into underlying microscopic mechanisms of degrading quasi-brittle materials from persistent and near-saintly efforts in laboratories, and at the same time we have seen unprecedented evolution in computational technology such as massively parallel computers. Thus, time is ripe to embark on a novel approach to settle unanswered questions, especially for the earthquake engineering community, by harmoniously combining the microphysics mechanisms with advanced parallel computing technology. To begin with, it should be stressed that we placed a great deal of emphasis on preserving clear meaning and physical counterparts of all the microscopic material models proposed herein, since it is directly tied to the belief that by doing so, the more physical mechanisms we incorporate, the better prediction we can obtain. We departed from reviewing representative microscopic analysis methodologies, selecting out "fixed-type" multidirectional smeared crack model as the base framework for nonlinear quasi-brittle materials, since it is widely believed to best retain the physical nature of actual cracks. Microscopic stress functions are proposed by integrating well-received existing models to update normal stresses on the crack surfaces (three orthogonal surfaces are allowed to initiate herein) under cyclic loading. Unlike the normal stress update, special attention had to be paid to the shear stress update on the crack surfaces, due primarily to the well-known pathological nature of the fixed-type smeared crack model---spurious large stress transfer over the open crack under nonproportional loading. In hopes of exploiting physical mechanism to resolve this deleterious nature of the fixed crack model, a tribology-inspired three-dimensional (3d) interlocking mechanism has been proposed. Following the main trend of tribology (i.e., the science and engineering of interacting surfaces), we introduced the base fabric of solid
Energy Technology Data Exchange (ETDEWEB)
Osei-Kuffuor, Daniel [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Fattebert, Jean-Luc [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
2014-01-01
We present the first truly scalable first-principles molecular dynamics algorithm with O(N) complexity and controllable accuracy, capable of simulating systems with finite band gaps of sizes that were previously impossible with this degree of accuracy. By avoiding global communications, we provide a practical computational scheme capable of extreme scalability. Accuracy is controlled by the mesh spacing of the finite difference discretization, the size of the localization regions in which the electronic wave functions are confined, and a cutoff beyond which the components of the overlap matrix can be omitted when computing selected elements of its inverse. We demonstrate the algorithm's excellent parallel scaling for up to 101 952 atoms on 23 328 processors, with a wall-clock time of the order of 1 min per molecular dynamics time step and numerical error on the forces of less than 7x10^{-4} Ha/Bohr.
Techniques and environments for big data analysis parallel, cloud, and grid computing
Dehuri, Satchidananda; Kim, Euiwhan; Wang, Gi-Name
2016-01-01
This volume is aiming at a wide range of readers and researchers in the area of Big Data by presenting the recent advances in the fields of Big Data Analysis, as well as the techniques and tools used to analyze it. The book includes 10 distinct chapters providing a concise introduction to Big Data Analysis and recent Techniques and Environments for Big Data Analysis. It gives insight into how the expensive fitness evaluation of evolutionary learning can play a vital role in big data analysis by adopting Parallel, Grid, and Cloud computing environments.
Fractional time stepping for unsteady engineering calculations on parallel computer systems
Molev, Sergey; Podaruev, Vladimir; Troshin, Alexey
2017-11-01
The tool for explicit scheme acceleration is described. Its essence is reducing arithmetic operations. Cells of the mesh are scattered by groups named levels. Each level has own time step. Coordination of levels is carried out. The method may be useful for great time scale scattering problems of aerodynamics. Reasons that produce deterioration of unsteady process modelling are revealed. Resolutions that correct the troubles are proposed. Example that demonstrates troubles rising conditions and successful abolition of them is presented. Limit of producing acceleration is denoted. Means that favor effective parallel computing with method are discussed.
Kantor, A. V.; Perevertkin, S. M.; Shcherbakova, T. S.
1974-01-01
The method of parallel adaptive discretization of data is considered the most promising and allows the effective compression algorithms to be used for high information-capacity radio telemetry systems. An associative computer device (ACD), i.e., parallel computers based on associative memory units (AMU), are recommended for realization of this method. A detailed discussion of the problems of application of AMU is followed by description of a particular ACD and its recommended use.
Unified parallel C and the computing needs of Sandia National Laboratories.
Energy Technology Data Exchange (ETDEWEB)
Brown, Jonathan Leighton; Wen, Zhaofang
2004-09-01
As Sandia looks toward petaflops computing and other advanced architectures, it is necessary to provide a programming environment that can exploit this additional computing power while supporting reasonable development time for applications. Thus, they evaluate the Partitioned Global Address Space (PGAS) programming model as implemented in Unified Parallel C (UPC) for its applicability. They report on their experiences in implementing sorting and minimum spanning tree algorithms on a test system, a Cray T3e, with UPC support. They describe several macros that could serve as language extensions and several building-block operations that could serve as a foundation for a PGAS programming library. They analyze the limitations of the UPC implementation available on the test system, and suggest improvements necessary before UPC can be used in a production environment.
Parallel Backprojection: A Case Study in High-Performance Reconfigurable Computing
Directory of Open Access Journals (Sweden)
Cordes Ben
2009-01-01
Full Text Available High-performance reconfigurable computing (HPRC is a novel approach to provide large-scale computing power to modern scientific applications. Using both general-purpose processors and FPGAs allows application designers to exploit fine-grained and coarse-grained parallelism, achieving high degrees of speedup. One scientific application that benefits from this technique is backprojection, an image formation algorithm that can be used as part of a synthetic aperture radar (SAR processing system. We present an implementation of backprojection for SAR on an HPRC system. Using simulated data taken at a variety of ranges, our implementation runs over 200 times faster than a similar software program, with an overall application speedup better than 50x. The backprojection application is easily parallelizable, achieving near-linear speedup when run on multiple nodes of a clustered HPRC system. The results presented can be applied to other systems and other algorithms with similar characteristics.
Parallel Backprojection: A Case Study in High-Performance Reconfigurable Computing
Directory of Open Access Journals (Sweden)
2009-03-01
Full Text Available High-performance reconfigurable computing (HPRC is a novel approach to provide large-scale computing power to modern scientific applications. Using both general-purpose processors and FPGAs allows application designers to exploit fine-grained and coarse-grained parallelism, achieving high degrees of speedup. One scientific application that benefits from this technique is backprojection, an image formation algorithm that can be used as part of a synthetic aperture radar (SAR processing system. We present an implementation of backprojection for SAR on an HPRC system. Using simulated data taken at a variety of ranges, our implementation runs over 200 times faster than a similar software program, with an overall application speedup better than 50x. The backprojection application is easily parallelizable, achieving near-linear speedup when run on multiple nodes of a clustered HPRC system. The results presented can be applied to other systems and other algorithms with similar characteristics.
A result-driven minimum blocking method for PageRank parallel computing
Tao, Wan; Liu, Tao; Yu, Wei; Huang, Gan
2017-01-01
Matrix blocking is a common method for improving computational efficiency of PageRank, but the blocking rules are hard to be determined, and the following calculation is complicated. In tackling these problems, we propose a minimum blocking method driven by result needs to accomplish a parallel implementation of PageRank algorithm. The minimum blocking just stores the element which is necessary for the result matrix. In return, the following calculation becomes simple and the consumption of the I/O transmission is cut down. We do experiments on several matrixes of different data size and different sparsity degree. The results show that the proposed method has better computational efficiency than traditional blocking methods.
Final Report, DE-FG01-06ER25718 Domain Decomposition and Parallel Computing
Energy Technology Data Exchange (ETDEWEB)
Widlund, Olof B. [New York Univ. (NYU), NY (United States). Courant Inst.
2015-06-09
The goal of this project is to develop and improve domain decomposition algorithms for a variety of partial differential equations such as those of linear elasticity and electro-magnetics.These iterative methods are designed for massively parallel computing systems and allow the fast solution of the very large systems of algebraic equations that arise in large scale and complicated simulations. A special emphasis is placed on problems arising from Maxwell's equation. The approximate solvers, the preconditioners, are combined with the conjugate gradient method and must always include a solver of a coarse model in order to have a performance which is independent of the number of processors used in the computer simulation. A recent development allows for an adaptive construction of this coarse component of the preconditioner.
The Optimization of a Shaped-Charge Design Using Parallel Computers
Energy Technology Data Exchange (ETDEWEB)
GARDNER,DAVID R.; VAUGHAN,COURTENAY T.
1999-11-01
Current supercomputers use large parallel arrays of tightly coupled processors to achieve levels of performance far surpassing conventional vector supercomputers. Shock-wave physics codes have been developed for these new supercomputers at Sandia National Laboratories and elsewhere. These parallel codes run fast enough on many simulations to consider using them to study the effects of varying design parameters on the performance of models of conventional munitions and other complex systems. Such studies maybe directed by optimization software to improve the performance of the modeled system. Using a shaped-charge jet design as an archetypal test case and the CTH parallel shock-wave physics code controlled by the Dakota optimization software, we explored the use of automatic optimization tools to optimize the design for conventional munitions. We used a scheme in which a lower resolution computational mesh was used to identify candidate optimal solutions and then these were verified using a higher resolution mesh. We identified three optimal solutions for the model and a region of the design domain where the jet tip speed is nearly optimal, indicating the possibility of a robust design. Based on this study we identified some of the difficulties in using high-fidelity models with optimization software to develop improved designs. These include developing robust algorithms for the objective function and constraints and mitigating the effects of numerical noise in them. We conclude that optimization software running high-fidelity models of physical systems using parallel shock wave physics codes to find improved designs can be a valuable tool for designers. While current state of algorithm and software development does not permit routine, ''black box'' optimization of designs, the effort involved in using the existing tools may well be worth the improvement achieved in designs.
Computer architecture evaluation for structural dynamics computations: Project summary
Standley, Hilda M.
1989-01-01
The intent of the proposed effort is the examination of the impact of the elements of parallel architectures on the performance realized in a parallel computation. To this end, three major projects are developed: a language for the expression of high level parallelism, a statistical technique for the synthesis of multicomputer interconnection networks based upon performance prediction, and a queueing model for the analysis of shared memory hierarchies.
Application of the parallel processing computer to a nuclear disaster prevention support system
International Nuclear Information System (INIS)
Shigehiro, Nukatsuka; Osami, Watanabe
2003-01-01
At the time of nuclear emergency, it is important to identify the type and the cause of the accident. Besides with these, it is also important to provide adequate information for the emergency response organization to support decision making by predicting and evaluating the development of the event and the influence of the release of radioactivity for the environment. Recently, a new type of nuclear disaster prevention support system called MEASURES (Multiple Radiological Emergency Assistance System for Urgent Response) was developed which provides not only the current state of the nuclear power plant and the influence of the radioactivity for the environment, but also the future prediction of the accident development. In order to provide the accurate results of these analyses quickly, MEASURES utilizes various techniques, such as multiple nesting method which narrows down the calculation area gradually, and parallel processing computer for three dimensional analyses, such as air current distribution analysis. In this paper, the outline and the feature of MEASURES are presented, especially focused on the usage of parallel processing computer for the three dimensional air current distribution analysis. (authors)
Trajectory Tracking of a Planer Parallel Manipulator by Using Computed Force Control Method
Bayram, Atilla
2017-03-01
Despite small workspace, parallel manipulators have some advantages over their serial counterparts in terms of higher speed, acceleration, rigidity, accuracy, manufacturing cost and payload. Accordingly, this type of manipulators can be used in many applications such as in high-speed machine tools, tuning machine for feeding, sensitive cutting, assembly and packaging. This paper presents a special type of planar parallel manipulator with three degrees of freedom. It is constructed as a variable geometry truss generally known planar Stewart platform. The reachable and orientation workspaces are obtained for this manipulator. The inverse kinematic analysis is solved for the trajectory tracking according to the redundancy and joint limit avoidance. Then, the dynamics model of the manipulator is established by using Virtual Work method. The simulations are performed to follow the given planar trajectories by using the dynamic equations of the variable geometry truss manipulator and computed force control method. In computed force control method, the feedback gain matrices for PD control are tuned with fixed matrices by trail end error and variable ones by means of optimization with genetic algorithm.
Processing optimization with parallel computing for the J-PET scanner
Directory of Open Access Journals (Sweden)
Krzemień Wojciech
2015-12-01
Full Text Available The Jagiellonian Positron Emission Tomograph (J-PET collaboration is developing a prototype time of flight (TOF-positron emission tomograph (PET detector based on long polymer scintillators. This novel approach exploits the excellent time properties of the plastic scintillators, which permit very precise time measurements. The very fast field programmable gate array (FPGA-based front-end electronics and the data acquisition system, as well as low- and high-level reconstruction algorithms were specially developed to be used with the J-PET scanner. The TOF-PET data processing and reconstruction are time and resource demanding operations, especially in the case of a large acceptance detector that works in triggerless data acquisition mode. In this article, we discuss the parallel computing methods applied to optimize the data processing for the J-PET detector. We begin with general concepts of parallel computing and then we discuss several applications of those techniques in the J-PET data processing.
Directory of Open Access Journals (Sweden)
Erik G. Boman
2012-01-01
Full Text Available Partitioning and load balancing are important problems in scientific computing that can be modeled as combinatorial problems using graphs or hypergraphs. The Zoltan toolkit was developed primarily for partitioning and load balancing to support dynamic parallel applications, but has expanded to support other problems in combinatorial scientific computing, including matrix ordering and graph coloring. Zoltan is based on abstract user interfaces and uses callback functions. To simplify the use and integration of Zoltan with other matrix-based frameworks, such as the ones in Trilinos, we developed Isorropia as a Trilinos package, which supports most of Zoltan's features via a matrix-based interface. In addition to providing an easy-to-use matrix-based interface to Zoltan, Isorropia also serves as a platform for additional matrix algorithms. In this paper, we give an overview of the Zoltan and Isorropia toolkits, their design, capabilities and use. We also show how Zoltan and Isorropia enable large-scale, parallel scientific simulations, and describe current and future development in the next-generation package Zoltan2.
Efficient Method for Parallel Process and Matching of Large Data set in Grid Computing Environment
Directory of Open Access Journals (Sweden)
E. Sankar
2014-09-01
Full Text Available Data management is one of the challenging issues in grid computing and its environments. Because grid computing systems and its applications deals with huge amount of data sets, due to the heterogeneous grid resources that belongs to different organizations and various locations with many access policies. Here To achieve the promising potentials of tremendous distributed resources, useful and capable Scheduling Algorithms are important. Task Scheduling is the mapping of tasks to a selected group of resources which may be distributed in different administrative domains. In this the Parallel Processing of the distributed systems will works using the grid scheduling algorithms. Genetic Algorithm which is a type of scheduling algorithm used for task scheduling to the various resources are working as parallel in the distributed systems.Basically, a Grid scheduler receives applications from Grid users, selects sufficient resources for these applications according to acquired information from the Grid Information Service module, and in conclusion generates application to resource mappings based on assured objective functions and predicted resource performance. Information about the status of available resources is very important for a Grid scheduler to make a proper scheduling, particularly when the heterogeneous and self motivated nature of the Grid is taken into account .The function of the Grid information service is to provide such information to Grid schedulers.
A study on optimal task decomposition of networked parallel computing using PVM
International Nuclear Information System (INIS)
Seong, Kwan Jae; Kim, Han Gyoo
1998-01-01
A numerical study is performed to investigate the effect of task decomposition on networked parallel processes using Parallel Virtual Machine (PVM). In our study, a PVM program distributed over a network of workstations is used in solving a finite difference version of a one dimensional heat equation, where natural choice of PVM programming structure would be the master-slave paradigm, with the aim of finding an optimal configuration resulting in least computing time including communication overhead among machines. Given a set of PVM tasks comprised of one master and five slave programs, it is found that there exists a pseudo-optimal number of machines, which does not necessarily coincide with the number of tasks, that yields the best performance when the network is under a light usage. Increasing the number of machines beyond this optimal one does not improve computing performance since increase in communication overhead among the excess number of machines offsets the decrease in CPU time obtained by distributing the PVM tasks among these machines. However, when the network traffic is heavy, the results exhibit a more random characteristic that is explained by the random nature of data transfer time
Large-Scale, Parallel, Multi-Sensor Atmospheric Data Fusion Using Cloud Computing
Wilson, B. D.; Manipon, G.; Hua, H.; Fetzer, E. J.
2013-12-01
NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the 'A-Train' platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over decades. Moving to multi-sensor, long-duration analyses of important climate variables presents serious challenges for large-scale data mining and fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another (MODIS), and to a model (MERRA), stratify the comparisons using a classification of the 'cloud scenes' from CloudSat, and repeat the entire analysis over 10 years of data. To efficiently assemble such datasets, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. However, these problems are Data Intensive computing so the data transfer times and storage costs (for caching) are key issues. SciReduce is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Figure 1 shows the architecture of the full computational system, with SciReduce at the core. Multi-year datasets are automatically 'sharded' by time and space across a cluster of nodes so that years of data (millions of files) can be processed in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP URLs or other subsetting services, thereby minimizing the size of the cached input and intermediate datasets. We are using SciReduce to automate the production of multiple versions of a ten-year A-Train water vapor climatology under a NASA MEASURES grant. We will
Li, Chuan; Petukh, Marharyta; Li, Lin; Alexov, Emil
2013-08-15
Due to the enormous importance of electrostatics in molecular biology, calculating the electrostatic potential and corresponding energies has become a standard computational approach for the study of biomolecules and nano-objects immersed in water and salt phase or other media. However, the electrostatics of large macromolecules and macromolecular complexes, including nano-objects, may not be obtainable via explicit methods and even the standard continuum electrostatics methods may not be applicable due to high computational time and memory requirements. Here, we report further development of the parallelization scheme reported in our previous work (Li, et al., J. Comput. Chem. 2012, 33, 1960) to include parallelization of the molecular surface and energy calculations components of the algorithm. The parallelization scheme utilizes different approaches such as space domain parallelization, algorithmic parallelization, multithreading, and task scheduling, depending on the quantity being calculated. This allows for efficient use of the computing resources of the corresponding computer cluster. The parallelization scheme is implemented in the popular software DelPhi and results in speedup of several folds. As a demonstration of the efficiency and capability of this methodology, the electrostatic potential, and electric field distributions are calculated for the bovine mitochondrial supercomplex illustrating their complex topology, which cannot be obtained by modeling the supercomplex components alone. Copyright © 2013 Wiley Periodicals, Inc.
A COMPUTER CLUSTER SYSTEM FOR PSEUDO-PARALLEL EXECUTION OF GEANT4 SERIAL APPLICATION
Directory of Open Access Journals (Sweden)
Memmo Federici
2013-12-01
Full Text Available Simulation of the interactions between particles and matter in studies for developing X-rays detectors generally requires very long calculation times (up to several days or weeks. These times are often a serious limitation for the success of the simulations and for the accuracy of the simulated models. One of the tools used by the scientific community to perform these simulations is Geant4 (Geometry And Tracking [2, 3]. On the best of experience in the design of the AVES cluster computing system, Federici et al. [1], the IAPS (Istituto di Astrofisica e Planetologia Spaziali INAF laboratories were able to develop a cluster computer system dedicated to Geant 4. The Cluster is easy to use and easily expandable, and thanks to the design criteria adopted it achieves an excellent compromise between performance and cost. The management software developed for the Cluster splits the single instance of simulation on the cores available, allowing the use of software written for serial computation to reach a computing speed similar to that obtainable from a native parallel software. The simulations carried out on the Cluster showed an increase in execution time by a factor of 20 to 60 compared to the times obtained with the use of a single PC of medium quality.
Harcke, L. J.; Zebker, H. A.
2006-12-01
We report on experiences in processing repeat-orbit interferometry data sets on a mid-scale multiprocessor mainframe computer. Newer applications of interferometric and polarimetric data processing, such as permanent scatterer deformation monitoring, require the generation of many tens of repeat-pass interferometry data pairs, perhaps 30 to 50, to provide sufficient input to the deformation model. Moving existing radar processing techniques toward massively parallel computation provides a path to coping with such large data sets, which can consist of 30 to 50 gigabytes (GB) of raw data. In June 2006, the Stanford School of Earth Sciences dedicated a new computation center for general research use. Two large machines compose the center: a single-node, symmetric multiprocessor (SMP) machine with 48 processor cores and a single 192~GB memory, and a 64 node distributed cluster containing 128 processor cores with at least 2~GB of memory per node. Distributed processing of the matched filter for synthetic aperture radar image formation requires a high communication-to-computation ratio. Experiments performed over a decade ago on distributed memory supercomputers, and repeated a half-decade ago on commodity workstation clusters, both demonstrated saturation of inter-node communication links. For this reason, we chose to parallelize the interferometric processor on the shared memory computer using the OpenMP programming standard. We find, not unexpectedly, that the input/output stage of processing standard 100-by-100~kilometer ERS-1 scenes quickly dominates the total computation time, and that only modest increases in processing time are achieved after 8 to 16 processor cores are brought to bear on a single data set. The input and output data sit in single, serially accessed disk files, creating a bottleneck for overall throughput. This points to a scheme for efficient partitioning of mid-size (24 to 48~core) machines for reducing large Earth science data sets, where 3 to
Computational Fluid Dynamic Pressure Drop Estimation of Flow between Parallel Plates
International Nuclear Information System (INIS)
Son, Hyung Min; Yang, Soo Hyung; Park, Jong Hark
2014-01-01
Many pool type reactors have forced downward flows inside the core during normal operation; there is a chance of flow inversion when transients occur. During this phase, the flow undergo transition between turbulent and laminar regions where drastic changes take place in terms of momentum and heat transfer, and the decrease in safety margin is usually observed. Additionally, for high Prandtl number fluids such as water, an effect of the velocity profile inside the channel on the temperature distribution is more pronounced over the low Prandtl number ones. This makes the checking of its pressure drop estimation accuracy less important, assuming the code verification is complete. With an advent of powerful computer hardware, engineering applications of computational fluid dynamics (CFD) methods have become quite common these days. Especially for a fully-turbulent and single phase convective heat transfer, the predictability of the commercial codes has matured enough so that many well-known companies adopt those to accelerate a product development cycle and to realize an increased profitability. In contrast to the above, the transition models for the CFD code are still under development, and the most of the models show limited generality and prediction accuracy. Unlike the system codes, the CFD codes estimate the pressure drop from the velocity profile which is obtained by solving momentum conservation equations, and the resulting friction factor can be a representative parameter for a constant cross section channel flow. In addition, the flow inside a rectangular channel with a high span to gap ratio can be approximated by flow inside parallel plates. The computational fluid dynamics simulation on the flow between parallel plates showed reasonable prediction capability for the laminar and the turbulent regime
Computational Fluid Dynamic Pressure Drop Estimation of Flow between Parallel Plates
Energy Technology Data Exchange (ETDEWEB)
Son, Hyung Min; Yang, Soo Hyung; Park, Jong Hark [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of)
2014-10-15
Many pool type reactors have forced downward flows inside the core during normal operation; there is a chance of flow inversion when transients occur. During this phase, the flow undergo transition between turbulent and laminar regions where drastic changes take place in terms of momentum and heat transfer, and the decrease in safety margin is usually observed. Additionally, for high Prandtl number fluids such as water, an effect of the velocity profile inside the channel on the temperature distribution is more pronounced over the low Prandtl number ones. This makes the checking of its pressure drop estimation accuracy less important, assuming the code verification is complete. With an advent of powerful computer hardware, engineering applications of computational fluid dynamics (CFD) methods have become quite common these days. Especially for a fully-turbulent and single phase convective heat transfer, the predictability of the commercial codes has matured enough so that many well-known companies adopt those to accelerate a product development cycle and to realize an increased profitability. In contrast to the above, the transition models for the CFD code are still under development, and the most of the models show limited generality and prediction accuracy. Unlike the system codes, the CFD codes estimate the pressure drop from the velocity profile which is obtained by solving momentum conservation equations, and the resulting friction factor can be a representative parameter for a constant cross section channel flow. In addition, the flow inside a rectangular channel with a high span to gap ratio can be approximated by flow inside parallel plates. The computational fluid dynamics simulation on the flow between parallel plates showed reasonable prediction capability for the laminar and the turbulent regime.
Henriques, David; González, Patricia; Doallo, Ramón; Saez-Rodriguez, Julio; Banga, Julio R.
2017-01-01
Background We consider a general class of global optimization problems dealing with nonlinear dynamic models. Although this class is relevant to many areas of science and engineering, here we are interested in applying this framework to the reverse engineering problem in computational systems biology, which yields very large mixed-integer dynamic optimization (MIDO) problems. In particular, we consider the framework of logic-based ordinary differential equations (ODEs). Methods We present saCeSS2, a parallel method for the solution of this class of problems. This method is based on an parallel cooperative scatter search metaheuristic, with new mechanisms of self-adaptation and specific extensions to handle large mixed-integer problems. We have paid special attention to the avoidance of convergence stagnation using adaptive cooperation strategies tailored to this class of problems. Results We illustrate its performance with a set of three very challenging case studies from the domain of dynamic modelling of cell signaling. The simpler case study considers a synthetic signaling pathway and has 84 continuous and 34 binary decision variables. A second case study considers the dynamic modeling of signaling in liver cancer using high-throughput data, and has 135 continuous and 109 binaries decision variables. The third case study is an extremely difficult problem related with breast cancer, involving 690 continuous and 138 binary decision variables. We report computational results obtained in different infrastructures, including a local cluster, a large supercomputer and a public cloud platform. Interestingly, the results show how the cooperation of individual parallel searches modifies the systemic properties of the sequential algorithm, achieving superlinear speedups compared to an individual search (e.g. speedups of 15 with 10 cores), and significantly improving (above a 60%) the performance with respect to a non-cooperative parallel scheme. The scalability of the
Penas, David R; Henriques, David; González, Patricia; Doallo, Ramón; Saez-Rodriguez, Julio; Banga, Julio R
2017-01-01
We consider a general class of global optimization problems dealing with nonlinear dynamic models. Although this class is relevant to many areas of science and engineering, here we are interested in applying this framework to the reverse engineering problem in computational systems biology, which yields very large mixed-integer dynamic optimization (MIDO) problems. In particular, we consider the framework of logic-based ordinary differential equations (ODEs). We present saCeSS2, a parallel method for the solution of this class of problems. This method is based on an parallel cooperative scatter search metaheuristic, with new mechanisms of self-adaptation and specific extensions to handle large mixed-integer problems. We have paid special attention to the avoidance of convergence stagnation using adaptive cooperation strategies tailored to this class of problems. We illustrate its performance with a set of three very challenging case studies from the domain of dynamic modelling of cell signaling. The simpler case study considers a synthetic signaling pathway and has 84 continuous and 34 binary decision variables. A second case study considers the dynamic modeling of signaling in liver cancer using high-throughput data, and has 135 continuous and 109 binaries decision variables. The third case study is an extremely difficult problem related with breast cancer, involving 690 continuous and 138 binary decision variables. We report computational results obtained in different infrastructures, including a local cluster, a large supercomputer and a public cloud platform. Interestingly, the results show how the cooperation of individual parallel searches modifies the systemic properties of the sequential algorithm, achieving superlinear speedups compared to an individual search (e.g. speedups of 15 with 10 cores), and significantly improving (above a 60%) the performance with respect to a non-cooperative parallel scheme. The scalability of the method is also good (tests
Katz, D.; Cwik, T.; Sterling, T.
1998-01-01
This paper uses the parallel calculation of the radiation integral for examination of performance and compiler issues on a Beowulf-class computer. This type of computer, built from mass-market, commodity, off-the-shelf components, has limited communications performance and therefore also has a limited regime of codes for which it is suitable.
Ouyang, Lizhi
A systematic improvement and extension of the orthogonalized linear combinations of atomic orbitals method was carried out using a combined computational and theoretical approach. For high performance parallel computing, a Beowulf class personal computer cluster was constructed. It also served as a parallel program development platform that helped us to port the programs of the method to the national supercomputer facilities. The program, received a language upgrade from Fortran 77 to Fortran 90, and a dynamic memory allocation feature. A preliminary parallel High Performance Fortran version of the program has been developed as well. To be of more benefit though, scalability improvements are needed. In order to circumvent the difficulties of the analytical force calculation in the method, we developed a geometry optimization scheme using the finite difference approximation based on the total energy calculation. The implementation of this scheme was facilitated by the powerful general utility lattice program, which offers many desired features such as multiple optimization schemes and usage of space group symmetry. So far, many ceramic oxides have been tested with the geometry optimization program. Their optimized geometries were in excellent agreement with the experimental data. For nine ceramic oxide crystals, the optimized cell parameters differ from the experimental ones within 0.5%. Moreover, the geometry optimization was recently used to predict a new phase of TiNx. The method has also been used to investigate a complex Vitamin B12-derivative, the OHCbl crystals. In order to overcome the prohibitive disk I/O demand, an on-demand version of the method was developed. Based on the electronic structure calculation of the OHCbl crystal, a partial density of states analysis and a bond order analysis were carried out. The calculated bonding of the corrin ring of OHCbl model was coincident with the big open-ring pi bond. One interesting find of the calculation was
High-performance parallel computing in the classroom using the public goods game as an example
Perc, Matjaž
2017-07-01
The use of computers in statistical physics is common because the sheer number of equations that describe the behaviour of an entire system particle by particle often makes it impossible to solve them exactly. Monte Carlo methods form a particularly important class of numerical methods for solving problems in statistical physics. Although these methods are simple in principle, their proper use requires a good command of statistical mechanics, as well as considerable computational resources. The aim of this paper is to demonstrate how the usage of widely accessible graphics cards on personal computers can elevate the computing power in Monte Carlo simulations by orders of magnitude, thus allowing live classroom demonstration of phenomena that would otherwise be out of reach. As an example, we use the public goods game on a square lattice where two strategies compete for common resources in a social dilemma situation. We show that the second-order phase transition to an absorbing phase in the system belongs to the directed percolation universality class, and we compare the time needed to arrive at this result by means of the main processor and by means of a suitable graphics card. Parallel computing on graphics processing units has been developed actively during the last decade, to the point where today the learning curve for entry is anything but steep for those familiar with programming. The subject is thus ripe for inclusion in graduate and advanced undergraduate curricula, and we hope that this paper will facilitate this process in the realm of physics education. To that end, we provide a documented source code for an easy reproduction of presented results and for further development of Monte Carlo simulations of similar systems.
Juang, Hann-Ming Henry; Tao, Wei-Kuo; Zeng, Xi-Ping; Shie, Chung-Lin; Simpson, Joanne; Lang, Steve
2004-01-01
The capability for massively parallel programming (MPP) using a message passing interface (MPI) has been implemented into a three-dimensional version of the Goddard Cumulus Ensemble (GCE) model. The design for the MPP with MPI uses the concept of maintaining similar code structure between the whole domain as well as the portions after decomposition. Hence the model follows the same integration for single and multiple tasks (CPUs). Also, it provides for minimal changes to the original code, so it is easily modified and/or managed by the model developers and users who have little knowledge of MPP. The entire model domain could be sliced into one- or two-dimensional decomposition with a halo regime, which is overlaid on partial domains. The halo regime requires that no data be fetched across tasks during the computational stage, but it must be updated before the next computational stage through data exchange via MPI. For reproducible purposes, transposing data among tasks is required for spectral transform (Fast Fourier Transform, FFT), which is used in the anelastic version of the model for solving the pressure equation. The performance of the MPI-implemented codes (i.e., the compressible and anelastic versions) was tested on three different computing platforms. The major results are: 1) both versions have speedups of about 99% up to 256 tasks but not for 512 tasks; 2) the anelastic version has better speedup and efficiency because it requires more computations than that of the compressible version; 3) equal or approximately-equal numbers of slices between the x- and y- directions provide the fastest integration due to fewer data exchanges; and 4) one-dimensional slices in the x-direction result in the slowest integration due to the need for more memory relocation for computation.
Quantum and classical parallelism in parity algorithms for ensemble quantum computers
International Nuclear Information System (INIS)
Stadelhofer, Ralf; Suter, Dieter; Banzhaf, Wolfgang
2005-01-01
The determination of the parity of a string of N binary digits is a well-known problem in classical as well as quantum information processing, which can be formulated as an oracle problem. It has been established that quantum algorithms require at least N/2 oracle calls. We present an algorithm that reaches this lower bound and is also optimal in terms of additional gate operations required. We discuss its application to pure and mixed states. Since it can be applied directly to thermal states, it does not suffer from signal loss associated with pseudo-pure-state preparation. For ensemble quantum computers, the number of oracle calls can be further reduced by a factor 2 k , with k is a member of {{1,2,...,log 2 (N/2}}, provided the signal-to-noise ratio is sufficiently high. This additional speed-up is linked to (classical) parallelism of the ensemble quantum computer. Experimental realizations are demonstrated on a liquid-state NMR quantum computer
Parallelizing Epistasis Detection in GWAS on FPGA and GPU-Accelerated Computing Systems.
González-Domínguez, Jorge; Wienbrandt, Lars; Kässens, Jan Christian; Ellinghaus, David; Schimmler, Manfred; Schmidt, Bertil
2015-01-01
High-throughput genotyping technologies (such as SNP-arrays) allow the rapid collection of up to a few million genetic markers of an individual. Detecting epistasis (based on 2-SNP interactions) in Genome-Wide Association Studies is an important but time consuming operation since statistical computations have to be performed for each pair of measured markers. Computational methods to detect epistasis therefore suffer from prohibitively long runtimes; e.g., processing a moderately-sized dataset consisting of about 500,000 SNPs and 5,000 samples requires several days using state-of-the-art tools on a standard 3 GHz CPU. In this paper, we demonstrate how this task can be accelerated using a combination of fine-grained and coarse-grained parallelism on two different computing systems. The first architecture is based on reconfigurable hardware (FPGAs) while the second architecture uses multiple GPUs connected to the same host. We show that both systems can achieve speedups of around four orders-of-magnitude compared to the sequential implementation. This significantly reduces the runtimes for detecting epistasis to only a few minutes for moderately-sized datasets and to a few hours for large-scale datasets.
Two-phase flow steam generator simulations on parallel computers using domain decomposition method
International Nuclear Information System (INIS)
Belliard, M.
2003-01-01
Within the framework of the Domain Decomposition Method (DDM), we present industrial steady state two-phase flow simulations of PWR Steam Generators (SG) using iteration-by-sub-domain methods: standard and Adaptive Dirichlet/Neumann methods (ADN). The averaged mixture balance equations are solved by a Fractional-Step algorithm, jointly with the Crank-Nicholson scheme and the Finite Element Method. The algorithm works with overlapping or non-overlapping sub-domains and with conforming or nonconforming meshing. Computations are run on PC networks or on massively parallel mainframe computers. A CEA code-linker and the PVM package are used (master-slave context). SG mock-up simulations, involving up to 32 sub-domains, highlight the efficiency (speed-up, scalability) and the robustness of the chosen approach. With the DDM, the computational problem size is easily increased to about 1,000,000 cells and the CPU time is significantly reduced. The difficulties related to industrial use are also discussed. (author)
International Nuclear Information System (INIS)
Lee, Soon-Hwan; Chino, Masamichi
2000-01-01
The coupling between atmosphere and ocean model has physical and computational difficulties for short-term forecasting of weather and ocean current. In this research, a combination system between high-resolution meso-scale atmospheric model and ocean model has been constructed using a new message-passing library, called Stampi (Seamless Thinking Aid Message Passing Interface), for prediction of particle dispersion at emergency nuclear accident. Stampi, which is based on the MPI (Message Passing Interface) 2 specification, makes us carry out parallel calculations of combination system without parallelization skill to model code. And it realizes dynamic process creation on different machines and communication between spawned one within the scope of MPI semantics. The models included in this combination system are PHYSIC as an atmosphere model, and POM (Princeton Ocean Model) as an ocean model. We applied this combination system to predict sea surface current at Sea of Japan in winter season. Simulation results indicate that the wind stress near the sea surface tends to be a predominant factor to determine surface ocean currents and dispersion of radioactive contamination in the ocean. The surface ocean current is well correspondent with wind direction, induced by high mountains at North Korea. The satellite data of NSCAT (NASA-SCATterometer), which is an image of sea surface current, also agrees well with the results of this system. (author)
Jiang, Y.; Xing, H. L.
2016-12-01
Micro-seismic events induced by water injection, mining activity or oil/gas extraction are quite informative, the interpretation of which can be applied for the reconstruction of underground stress and monitoring of hydraulic fracturing progress in oil/gas reservoirs. The source characterises and locations are crucial parameters that required for these purposes, which can be obtained through the waveform matching inversion (WMI) method. Therefore it is imperative to develop a WMI algorithm with high accuracy and convergence speed. Heuristic algorithm, as a category of nonlinear method, possesses a very high convergence speed and good capacity to overcome local minimal values, and has been well applied for many areas (e.g. image processing, artificial intelligence). However, its effectiveness for micro-seismic WMI is still poorly investigated; very few literatures exits that addressing this subject. In this research an advanced heuristic algorithm, gravitational search algorithm (GSA) , is proposed to estimate the focal mechanism (angle of strike, dip and rake) and source locations in three dimension. Unlike traditional inversion methods, the heuristic algorithm inversion does not require the approximation of green function. The method directly interacts with a CPU parallelized finite difference forward modelling engine, and updating the model parameters under GSA criterions. The effectiveness of this method is tested with synthetic data form a multi-layered elastic model; the results indicate GSA can be well applied on WMI and has its unique advantages. Keywords: Micro-seismicity, Waveform matching inversion, gravitational search algorithm, parallel computation
Directory of Open Access Journals (Sweden)
Zhaocai Wang
2015-10-01
Full Text Available The unbalanced assignment problem (UAP is to optimally resolve the problem of assigning n jobs to m individuals (m < n, such that minimum cost or maximum profit obtained. It is a vitally important Non-deterministic Polynomial (NP complete problem in operation management and applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation.
Parallel Processing and Bio-inspired Computing for Biomedical Image Registration
Directory of Open Access Journals (Sweden)
Silviu Ioan Bejinariu
2014-07-01
Full Text Available Image Registration (IR is an optimization problem computing optimal parameters of a geometric transform used to overlay one or more source images to a given model by maximizing a similarity measure. In this paper the use of bio-inspired optimization algorithms in image registration is analyzed. Results obtained by means of three different algorithms are compared: Bacterial Foraging Optimization Algorithm (BFOA, Genetic Algorithm (GA and Clonal Selection Algorithm (CSA. Depending on the images type, the registration may be: area based, which is slow but more precise, and features based, which is faster. In this paper a feature based approach based on the Scale Invariant Feature Transform (SIFT is proposed. Finally, results obtained using sequential and parallel implementations on multi-core systems for area based and features based image registration are compared.
International Nuclear Information System (INIS)
Wong Unhong; Wong Honcheng; Tang Zesheng
2010-01-01
The smoothed particle hydrodynamics (SPH), which is a class of meshfree particle methods (MPMs), has a wide range of applications from micro-scale to macro-scale as well as from discrete systems to continuum systems. Graphics hardware, originally designed for computer graphics, now provide unprecedented computational power for scientific computation. Particle system needs a huge amount of computations in physical simulation. In this paper, an efficient parallel implementation of a SPH method on graphics hardware using the Compute Unified Device Architecture is developed for fluid simulation. Comparing to the corresponding CPU implementation, our experimental results show that the new approach allows significant speedups of fluid simulation through handling huge amount of computations in parallel on graphics hardware.
International Nuclear Information System (INIS)
Azmy, Y.Y.; Kirk, B.L.
1990-01-01
Modern parallel computer architectures offer an enormous potential for reducing CPU and wall-clock execution times of large-scale computations commonly performed in various applications in science and engineering. Recently, several authors have reported their efforts in developing and implementing parallel algorithms for solving the neutron diffusion equation on a variety of shared- and distributed-memory parallel computers. Testing of these algorithms for a variety of two- and three-dimensional meshes showed significant speedup of the computation. Even for very large problems (i.e., three-dimensional fine meshes) executed concurrently on a few nodes in serial (nonvector) mode, however, the measured computational efficiency is very low (40 to 86%). In this paper, the authors present a highly efficient (∼85 to 99.9%) algorithm for solving the two-dimensional nodal diffusion equations on the Sequent Balance 8000 parallel computer. Also presented is a model for the performance, represented by the efficiency, as a function of problem size and the number of participating processors. The model is validated through several tests and then extrapolated to larger problems and more processors to predict the performance of the algorithm in more computationally demanding situations
3D streamers simulation in a pin to plane configuration using massively parallel computing
Plewa, J.-M.; Eichwald, O.; Ducasse, O.; Dessante, P.; Jacobs, C.; Renon, N.; Yousfi, M.
2018-03-01
This paper concerns the 3D simulation of corona discharge using high performance computing (HPC) managed with the message passing interface (MPI) library. In the field of finite volume methods applied on non-adaptive mesh grids and in the case of a specific 3D dynamic benchmark test devoted to streamer studies, the great efficiency of the iterative R&B SOR and BiCGSTAB methods versus the direct MUMPS method was clearly demonstrated in solving the Poisson equation using HPC resources. The optimization of the parallelization and the resulting scalability was undertaken as a function of the HPC architecture for a number of mesh cells ranging from 8 to 512 million and a number of cores ranging from 20 to 1600. The R&B SOR method remains at least about four times faster than the BiCGSTAB method and requires significantly less memory for all tested situations. The R&B SOR method was then implemented in a 3D MPI parallelized code that solves the classical first order model of an atmospheric pressure corona discharge in air. The 3D code capabilities were tested by following the development of one, two and four coplanar streamers generated by initial plasma spots for 6 ns. The preliminary results obtained allowed us to follow in detail the formation of the tree structure of a corona discharge and the effects of the mutual interactions between the streamers in terms of streamer velocity, trajectory and diameter. The computing time for 64 million of mesh cells distributed over 1000 cores using the MPI procedures is about 30 min ns-1, regardless of the number of streamers.
pWeb: A High-Performance, Parallel-Computing Framework for Web-Browser-Based Medical Simulation.
Halic, Tansel; Ahn, Woojin; De, Suvranu
2014-01-01
This work presents a pWeb - a new language and compiler for parallelization of client-side compute intensive web applications such as surgical simulations. The recently introduced HTML5 standard has enabled creating unprecedented applications on the web. Low performance of the web browser, however, remains the bottleneck of computationally intensive applications including visualization of complex scenes, real time physical simulations and image processing compared to native ones. The new proposed language is built upon web workers for multithreaded programming in HTML5. The language provides fundamental functionalities of parallel programming languages as well as the fork/join parallel model which is not supported by web workers. The language compiler automatically generates an equivalent parallel script that complies with the HTML5 standard. A case study on realistic rendering for surgical simulations demonstrates enhanced performance with a compact set of instructions.
Energy Technology Data Exchange (ETDEWEB)
Archer, Charles J.; Faraj, Daniel A.; Inglett, Todd A.; Ratterman, Joseph D.
2018-01-30
Methods, apparatus, and products are disclosed for providing full point-to-point communications among compute nodes of an operational group in a global combining network of a parallel computer, each compute node connected to each adjacent compute node in the global combining network through a link, that include: receiving a network packet in a compute node, the network packet specifying a destination compute node; selecting, in dependence upon the destination compute node, at least one of the links for the compute node along which to forward the network packet toward the destination compute node; and forwarding the network packet along the selected link to the adjacent compute node connected to the compute node through the selected link.
Marek, A; Blum, V; Johanni, R; Havu, V; Lang, B; Auckenthaler, T; Heinecke, A; Bungartz, H-J; Lederer, H
2014-05-28
Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N(3)) with the size of the investigated problem, N (e.g. the electron count in electronic structure theory), and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on a few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the Scalable Linear Algebra PACKage (ScaLAPACK) library but replacing all actual parallel solution steps with subroutines of its own. For these steps, ELPA significantly outperforms the corresponding ScaLAPACK routines and proprietary libraries that implement the ScaLAPACK interface (e.g. Intel's MKL). The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10,000 CPU cores for problem
International Nuclear Information System (INIS)
Kole, J S; Beekman, F J
2005-01-01
Statistical reconstruction methods offer possibilities of improving image quality as compared to analytical methods, but current reconstruction times prohibit routine clinical applications. To reduce reconstruction times we have parallelized a statistical reconstruction algorithm for cone-beam x-ray CT, the ordered subset convex algorithm (OSC), and evaluated it on a shared memory computer. Two different parallelization strategies were developed: one that employs parallelism by computing the work for all projections within a subset in parallel, and one that divides the total volume into parts and processes the work for each sub-volume in parallel. Both methods are used to reconstruct a three-dimensional mathematical phantom on two different grid densities. The reconstructed images are binary identical to the result of the serial (non-parallelized) algorithm. The speed-up factor equals approximately 30 when using 32 to 40 processors, and scales almost linearly with the number of cpus for both methods. The huge reduction in computation time allows us to apply statistical reconstruction to clinically relevant studies for the first time
Faibish, Sorin; Bent, John M; Tzelnic, Percy; Grider, Gary; Torres, Aaron
2015-02-03
Techniques are provided for storing files in a parallel computing system using sub-files with semantically meaningful boundaries. A method is provided for storing at least one file generated by a distributed application in a parallel computing system. The file comprises one or more of a complete file and a plurality of sub-files. The method comprises the steps of obtaining a user specification of semantic information related to the file; providing the semantic information as a data structure description to a data formatting library write function; and storing the semantic information related to the file with one or more of the sub-files in one or more storage nodes of the parallel computing system. The semantic information provides a description of data in the file. The sub-files can be replicated based on semantically meaningful boundaries.
High fidelity thermal-hydraulic analysis using CFD and massively parallel computers
International Nuclear Information System (INIS)
Weber, D.P.; Wei, T.Y.C.; Brewster, R.A.; Rock, Daniel T.; Rizwan-uddin
2000-01-01
Thermal-hydraulic analyses play an important role in design and reload analysis of nuclear power plants. These analyses have historically relied on early generation computational fluid dynamics capabilities, originally developed in the 1960s and 1970s. Over the last twenty years, however, dramatic improvements in both computational fluid dynamics codes in the commercial sector and in computing power have taken place. These developments offer the possibility of performing large scale, high fidelity, core thermal hydraulics analysis. Such analyses will allow a determination of the conservatism employed in traditional design approaches and possibly justify the operation of nuclear power systems at higher powers without compromising safety margins. The objective of this work is to demonstrate such a large scale analysis approach using a state of the art CFD code, STAR-CD, and the computing power of massively parallel computers, provided by IBM. A high fidelity representation of a current generation PWR was analyzed with the STAR-CD CFD code and the results were compared to traditional analyses based on the VIPRE code. Current design methodology typically involves a simplified representation of the assemblies, where a single average pin is used in each assembly to determine the hot assembly from a whole core analysis. After determining this assembly, increased refinement is used in the hot assembly, and possibly some of its neighbors, to refine the analysis for purposes of calculating DNBR. This latter calculation is performed with sub-channel codes such as VIPRE. The modeling simplifications that are used involve the approximate treatment of surrounding assemblies and coarse representation of the hot assembly, where the subchannel is the lowest level of discretization. In the high fidelity analysis performed in this study, both restrictions have been removed. Within the hot assembly, several hundred thousand to several million computational zones have been used, to
Energy Technology Data Exchange (ETDEWEB)
Candel, A.; Kabel, A.; Lee, L.; Li, Z.; Limborg, C.; Ng, C.; Prudencio, E.; Schussman, G.; Uplenchwar, R.; Ko, K.; /SLAC
2009-06-19
Over the past years, SLAC's Advanced Computations Department (ACD), under SciDAC sponsorship, has developed a suite of 3D (2D) parallel higher-order finite element (FE) codes, T3P (T2P) and Pic3P (Pic2P), aimed at accurate, large-scale simulation of wakefields and particle-field interactions in radio-frequency (RF) cavities of complex shape. The codes are built on the FE infrastructure that supports SLAC's frequency domain codes, Omega3P and S3P, to utilize conformal tetrahedral (triangular)meshes, higher-order basis functions and quadratic geometry approximation. For time integration, they adopt an unconditionally stable implicit scheme. Pic3P (Pic2P) extends T3P (T2P) to treat charged-particle dynamics self-consistently using the PIC (particle-in-cell) approach, the first such implementation on a conformal, unstructured grid using Whitney basis functions. Examples from applications to the International Linear Collider (ILC), Positron Electron Project-II (PEP-II), Linac Coherent Light Source (LCLS) and other accelerators will be presented to compare the accuracy and computational efficiency of these codes versus their counterparts using structured grids.
Directory of Open Access Journals (Sweden)
Mark Michael Budnik
2011-04-01
Full Text Available In this paper, we present how our College of Engineering is developing a growing portfolio of engineering computer games as a parallel learning opportunity for undergraduate engineering and primary (grade K-5 students. Around the world, many schools provide secondary students (grade 6-12 with opportunities to pursue pre-engineering classes. However, by the time students reach this age, many of them have already determined their educational goals and preferred careers. Our College of Engineering is developing resources to provide primary students, still in their educational formative years, with opportunities to learn more about engineering. One of these resources is a library of engineering games targeted to the primary student population. The games are designed by sophomore students in our College of Engineering. During their Introduction to Computational Techniques course, the students use the LabVIEW environment to develop the games. This software provides a wealth of design resources for the novice programmer; using it to develop the games strengthens the undergraduates
Energy Technology Data Exchange (ETDEWEB)
Elbert, Stephen T.; Kalsi, Karanjit; Vlachopoulou, Maria; Rice, Mark J.; Glaesemann, Kurt R.; Zhou, Ning
2012-07-26
Financial Transmission Rights (FTRs) help power market participants reduce price risks associated with transmission congestion. FTRs are issued based on a process of solving a constrained optimization problem with the objective to maximize the FTR social welfare under power flow security constraints. Security constraints for different FTR categories (monthly, seasonal or annual) are usually coupled and the number of constraints increases exponentially with the number of categories. Commercial software for FTR calculation can only provide limited categories of FTRs due to the inherent computational challenges mentioned above. In this paper, a novel non-linear dynamical system (NDS) approach is proposed to solve the optimization problem. The new formulation and performance of the NDS solver is benchmarked against widely used linear programming (LP) solvers like CPLEX™ and tested on large-scale systems using data from the Western Electricity Coordinating Council (WECC). The NDS is demonstrated to outperform the widely used CPLEX algorithms while exhibiting superior scalability. Furthermore, the NDS based solver can be easily parallelized which results in significant computational improvement.
International Nuclear Information System (INIS)
Takemiya, Hiroshi; Yamagishi, Nobuhiro
2000-02-01
We report on a RPC(Remote Procedure Call)-based communication library, Starpc, for a parallel computer cluster. Starpc supports communication between Java Applets and C programs as well as between C programs. Starpc has the following three features. (1) It enables communication between Java Applets and C programs on an arbitrary computer without security violation, although Java Applets are supposed to communicate only with programs on the specific computer (Web server) in subject to a restriction on security. (2) Diverse network communication protocols are available on Starpc, because of using Nexus communication library developed at Argonne National Laboratory. (3) It works on many kinds of computers including eight parallel computers and four WS servers. In this report, the usage of Starpc and the development of applications using Starpc are described. (author)
Al Jarro, Ahmed
2011-08-01
A hybrid MPI/OpenMP scheme for efficiently parallelizing the explicit marching-on-in-time (MOT)-based solution of the time-domain volume (Volterra) integral equation (TD-VIE) is presented. The proposed scheme equally distributes tested field values and operations pertinent to the computation of tested fields among the nodes using the MPI standard; while the source field values are stored in all nodes. Within each node, OpenMP standard is used to further accelerate the computation of the tested fields. Numerical results demonstrate that the proposed parallelization scheme scales well for problems involving three million or more spatial discretization elements. © 2011 IEEE.
Resolution of the neutron transport equation by massively parallel computer in the Cronos code
International Nuclear Information System (INIS)
Zardini, D.M.
1996-01-01
The feasibility of neutron transport problems parallel resolution by CRONOS code's SN module is here studied. In this report we give the first data about the parallel resolution by angular variable decomposition of the transport equation. Problems about parallel resolution by spatial variable decomposition and memory stage limits are also explained here. (author)
Fast parallel DNA-based algorithms for molecular computation: the set-partition problem.
Chang, Weng-Long
2007-12-01
This paper demonstrates that basic biological operations can be used to solve the set-partition problem. In order to achieve this, we propose three DNA-based algorithms, a signed parallel adder, a signed parallel subtractor and a signed parallel comparator, that formally verify our designed molecular solutions for solving the set-partition problem.
Energy Technology Data Exchange (ETDEWEB)
Hermenegildo, M.V.
1986-01-01
The term Logic Programming refers to a variety of computer languages and execution models based on the traditional concept of Symbolic Logic. The expressive power of these languages offers promise to be of great assistance in facing the programming challenges of present and future symbolic processing applications in artificial intelligence, knowledge-based systems, and many other areas of computing. This dissertation presents an efficient parallel execution model for logic programs. The model is described from the source language level down to an Abstract Machine level, suitable for direct implementation on existing parallel systems or for the design of special purpose parallel architectures. Few assumptions are made at the source language level and, therefore, the techniques developed and the general Abstract Machine design are applicable to a variety of logic (and also functional) languages. These techniques offer efficient solutions to several areas of parallel Logic Programming implementation previously considered problematic or a source of considerable overhead, such as the detection and handling of variable binding conflicts in AND-parallelism, the specification of control and management of the execution tree, the treatment of distributed backtracking, and goal scheduling and memory management issues, etc. A parallel Abstract Machine design is offered, specifying data areas, operation, and a suitable instruction set.
Directory of Open Access Journals (Sweden)
JONG WOON KIM
2014-04-01
In this paper, we introduce a modified scattering kernel approach to avoid the unnecessarily repeated calculations involved with the scattering source calculation, and used it with parallel computing to effectively reduce the computation time. Its computational efficiency was tested for three-dimensional full-coupled photon-electron transport problems using our computer program which solves the multi-group discrete ordinates transport equation by using the discontinuous finite element method with unstructured tetrahedral meshes for complicated geometrical problems. The numerical tests show that we can improve speed up to 17∼42 times for the elapsed time per iteration using the modified scattering kernel, not only in the single CPU calculation but also in the parallel computing with several CPUs.
Crockett, Thomas W.
1995-01-01
This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.
International Nuclear Information System (INIS)
Satake, Shinsuke; Okamoto, Masao; Nakajima, Noriyoshi; Takamaru, Hisanori
2005-11-01
A neoclassical transport simulation code (FORTEC-3D) applicable to three-dimensional configurations has been developed using High Performance Fortran (HPF). Adoption of computing techniques for parallelization and a hybrid simulation model to the δf Monte-Carlo method transport simulation, including non-local transport effects in three-dimensional configurations, makes it possible to simulate the dynamism of global, non-local transport phenomena with a self-consistent radial electric field within a reasonable computation time. In this paper, development of the transport code using HPF is reported. Optimization techniques in order to achieve both high vectorization and parallelization efficiency, adoption of a parallel random number generator, and also benchmark results, are shown. (author)
Discovery of resources using MADM approaches for parallel and distributed computing
Directory of Open Access Journals (Sweden)
Mandeep Kaur
2017-06-01
Full Text Available Grid, a form of parallel and distributed computing, allows the sharing of data and computational resources among its users from various geographical locations. The grid resources are diverse in terms of their underlying attributes. The majority of the state-of-the-art resource discovery techniques rely on the static resource attributes during resource selection. However, the matching resources based on the static resource attributes may not be the most appropriate resources for the execution of user applications because they may have heavy job loads, less storage space or less working memory (RAM. Hence, there is a need to consider the current state of the resources in order to find the most suitable resources. In this paper, we have proposed a two-phased multi-attribute decision making (MADM approach for discovery of grid resources by using P2P formalism. The proposed approach considers multiple resource attributes for decision making of resource selection and provides the best suitable resource(s to grid users. The first phase describes a mechanism to discover all matching resources and applies SAW method to shortlist the top ranked resources, which are communicated to the requesting super-peer. The second phase of our proposed methodology applies integrated MADM approach (AHP enriched PROMETHEE-II on the list of selected resources received from different super-peers. The pairwise comparison of the resources with respect to their attributes is made and the rank of each resource is determined. The top ranked resource is then communicated to the grid user by the grid scheduler. Our proposed methodology enables the grid scheduler to allocate the most suitable resource to the user application and also reduces the search complexity by filtering out the less suitable resources during resource discovery.
A parallel calibration utility for WRF-Hydro on high performance computers
Wang, J.; Wang, C.; Kotamarthi, V. R.
2017-12-01
A successful modeling of complex hydrological processes comprises establishing an integrated hydrological model which simulates the hydrological processes in each water regime, calibrates and validates the model performance based on observation data, and estimates the uncertainties from different sources especially those associated with parameters. Such a model system requires large computing resources and often have to be run on High Performance Computers (HPC). The recently developed WRF-Hydro modeling system provides a significant advancement in the capability to simulate regional water cycles more completely. The WRF-Hydro model has a large range of parameters such as those in the input table files — GENPARM.TBL, SOILPARM.TBL and CHANPARM.TBL — and several distributed scaling factors such as OVROUGHRTFAC. These parameters affect the behavior and outputs of the model and thus may need to be calibrated against the observations in order to obtain a good modeling performance. Having a parameter calibration tool specifically for automate calibration and uncertainty estimates of WRF-Hydro model can provide significant convenience for the modeling community. In this study, we developed a customized tool using the parallel version of the model-independent parameter estimation and uncertainty analysis tool, PEST, to enabled it to run on HPC with PBS and SLURM workload manager and job scheduler. We also developed a series of PEST input file templates that are specifically for WRF-Hydro model calibration and uncertainty analysis. Here we will present a flood case study occurred in April 2013 over Midwest. The sensitivity and uncertainties are analyzed using the customized PEST tool we developed.
Lyster, Peter M.; Guo, J.; Clune, T.; Larson, J. W.; Atlas, Robert (Technical Monitor)
2001-01-01
The computational complexity of algorithms for Four Dimensional Data Assimilation (4DDA) at NASA's Data Assimilation Office (DAO) is discussed. In 4DDA, observations are assimilated with the output of a dynamical model to generate best-estimates of the states of the system. It is thus a mapping problem, whereby scattered observations are converted into regular accurate maps of wind, temperature, moisture and other variables. The DAO is developing and using 4DDA algorithms that provide these datasets, or analyses, in support of Earth System Science research. Two large-scale algorithms are discussed. The first approach, the Goddard Earth Observing System Data Assimilation System (GEOS DAS), uses an atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS). GEOS DAS is very similar to global meteorological weather forecasting data assimilation systems, but is used at NASA for climate research. Systems of this size typically run at between 1 and 20 gigaflop/s. The second approach, the Kalman filter, uses a more consistent algorithm to determine the forecast error covariance matrix than does GEOS DAS. For atmospheric assimilation, the gridded dynamical fields typically have More than 10(exp 6) variables, therefore the full error covariance matrix may be in excess of a teraword. For the Kalman filter this problem can easily scale to petaflop/s proportions. We discuss the computational complexity of GEOS DAS and our implementation of the Kalman filter. We also discuss and quantify some of the technical issues and limitations in developing efficient, in terms of wall clock time, and scalable parallel implementations of the algorithms.
Niculescu, Mihai; Hristov, Peter
In this thesis we tried to show the impact of new technologies on scientific work in the large field of heavy ion physics and as a case study, we present the implementation of the event plane method, on a highly parallel technology: the graphic processor. By the end of the thesis, a comparison of the analysis results with the elliptic flow published by ALICE is made. In Chapter 1 we presented the computing needs at the heavy ion physics experiment ALICE and showed the current state of software and technologies. The new technologies available for some time, Chapter 2, present new performance capabilities and generated a trend in preparing for the new wave of technologies and software, which most indicators show will dominate the future. This was not disregarded by the scientific community and in consequence section 2.2 shows the rising interest in the new technologies by the High Energy Physics community. A real case study was needed to better understand how the new technologies can be applied in HEP and aniso...
Communication complexity of distributed computing and a parallel algorithm for polynomial roots
International Nuclear Information System (INIS)
Tiwari, P.
1986-01-01
The first part of this thesis begins with a discussion of the minimum communication requirements in some distributed networks. The main result is a general technique for determining lower bounds on the communication complexity of problems on various distributed computer networks. This general technique is derived by simulating the general network by a linear array and then using a lower bound on the communication complexity of the problem on the linear array. Applications of this technique yield nontrivial optimal or near-optimal lower bounds on the communication complexity of distinctness, ranking, uniqueness, merging, and triangle detection on a ring, a mesh, and a complete binary tree of processors. A technique similar to the one used in proving the above results, yields interesting graph theoretic results concerning decomposition of a graph into complete bipartite subgraphs. The second part of the this is devoted to the design of a fast parallel algorithm for determining all roots of a polynomial. Given a polynomial rho(z) of degree n with m bit integer coefficients and an integer μ, the author considers the problem of determining all its roots with error less than 2/sup -μ/. It is shown that this problem is in the class NC if rho(z) has all real roots
Parallel sort with a ranged, partitioned key-value store in a high perfomance computing environment
Bent, John M.; Faibish, Sorin; Grider, Gary; Torres, Aaron; Poole, Stephen W.
2016-01-26
Improved sorting techniques are provided that perform a parallel sort using a ranged, partitioned key-value store in a high performance computing (HPC) environment. A plurality of input data files comprising unsorted key-value data in a partitioned key-value store are sorted. The partitioned key-value store comprises a range server for each of a plurality of ranges. Each input data file has an associated reader thread. Each reader thread reads the unsorted key-value data in the corresponding input data file and performs a local sort of the unsorted key-value data to generate sorted key-value data. A plurality of sorted, ranged subsets of each of the sorted key-value data are generated based on the plurality of ranges. Each sorted, ranged subset corresponds to a given one of the ranges and is provided to one of the range servers corresponding to the range of the sorted, ranged subset. Each range server sorts the received sorted, ranged subsets and provides a sorted range. A plurality of the sorted ranges are concatenated to obtain a globally sorted result.
Performance modeling and analysis of parallel Gaussian elimination on multi-core computers
Directory of Open Access Journals (Sweden)
Fadi N. Sibai
2014-01-01
Full Text Available Gaussian elimination is used in many applications and in particular in the solution of systems of linear equations. This paper presents mathematical performance models and analysis of four parallel Gaussian Elimination methods (precisely the Original method and the new Meet in the Middle –MiM– algorithms and their variants with SIMD vectorization on multi-core systems. Analytical performance models of the four methods are formulated and presented followed by evaluations of these models with modern multi-core systems’ operation latencies. Our results reveal that the four methods generally exhibit good performance scaling with increasing matrix size and number of cores. SIMD vectorization only makes a large difference in performance for low number of cores. For a large matrix size (n ⩾ 16 K, the performance difference between the MiM and Original methods falls from 16× with four cores to 4× with 16 K cores. The efficiencies of all four methods are low with 1 K cores or more stressing a major problem of multi-core systems where the network-on-chip and memory latencies are too high in relation to basic arithmetic operations. Thus Gaussian Elimination can greatly benefit from the resources of multi-core systems, but higher performance gains can be achieved if multi-core systems can be designed with lower memory operation, synchronization, and interconnect communication latencies, requirements of utmost importance and challenge in the exascale computing age.
BPF-type region-of-interest reconstruction for parallel translational computed tomography.
Wu, Weiwen; Yu, Hengyong; Wang, Shaoyu; Liu, Fenglin
2017-01-01
The objective of this study is to present and test a new ultra-low-cost linear scan based tomography architecture. Similar to linear tomosynthesis, the source and detector are translated in opposite directions and the data acquisition system targets on a region-of-interest (ROI) to acquire data for image reconstruction. This kind of tomographic architecture was named parallel translational computed tomography (PTCT). In previous studies, filtered backprojection (FBP)-type algorithms were developed to reconstruct images from PTCT. However, the reconstructed ROI images from truncated projections have severe truncation artefact. In order to overcome this limitation, we in this study proposed two backprojection filtering (BPF)-type algorithms named MP-BPF and MZ-BPF to reconstruct ROI images from truncated PTCT data. A weight function is constructed to deal with data redundancy for multi-linear translations modes. Extensive numerical simulations are performed to evaluate the proposed MP-BPF and MZ-BPF algorithms for PTCT in fan-beam geometry. Qualitative and quantitative results demonstrate that the proposed BPF-type algorithms cannot only more accurately reconstruct ROI images from truncated projections but also generate high-quality images for the entire image support in some circumstances.
A scalable fully implicit framework for reservoir simulation on parallel computers
Yang, Haijian
2017-11-10
The modeling of multiphase fluid flow in porous medium is of interest in the field of reservoir simulation. The promising numerical methods in the literature are mostly based on the explicit or semi-implicit approach, which both have certain stability restrictions on the time step size. In this work, we introduce and study a scalable fully implicit solver for the simulation of two-phase flow in a porous medium with capillarity, gravity and compressibility, which is free from the limitations of the conventional methods. In the fully implicit framework, a mixed finite element method is applied to discretize the model equations for the spatial terms, and the implicit Backward Euler scheme with adaptive time stepping is used for the temporal integration. The resultant nonlinear system arising at each time step is solved in a monolithic way by using a Newton–Krylov type method. The corresponding linear system from the Newton iteration is large sparse, nonsymmetric and ill-conditioned, consequently posing a significant challenge to the fully implicit solver. To address this issue, the family of additive Schwarz preconditioners is taken into account to accelerate the convergence of the linear system, and thereby improves the robustness of the outer Newton method. Several test cases in one, two and three dimensions are used to validate the correctness of the scheme and examine the performance of the newly developed algorithm on parallel computers.
CERN. Geneva
2016-01-01
Large scale scientific computing raises questions on different levels ranging from the fomulation of the problems to the choice of the best algorithms and their implementation for a specific platform. There are similarities in these different topics that can be exploited by modern-style C++ template metaprogramming techniques to produce readable, maintainable and generic code. Traditional low-level code tend to be fast but platform-dependent, and it obfuscates the meaning of the algorithm. On the other hand, object-oriented approach is nice to read, but may come with an inherent performance penalty. These lectures aim to present he basics of the Expression Template (ET) idiom which allows us to keep the object-oriented approach without sacrificing performance. We will in particular show to to enhance ET to include SIMD vectorization. We will then introduce techniques for abstracting iteration, and introduce thread-level parallelism for use in heavy data-centric loads. We will show to to apply these methods i...
Interactive simulation of LEB commissioning procedure on a hypercube parallel computer
International Nuclear Information System (INIS)
Bourianoff, G.; Botlo, M.; Cole, B.; Hunt, S.; Malitsky, N.; Romero, A.
1993-01-01
It is desirable that an interactive simulation of accelerator operation be developed in order to write and test commissioning correction, supervisory control, closed loop control, optimization and automation code prior to machine construction. The simulator should produce realistic diagnostic information, analyze and display the information at a workstation, accept operator input, and react appropriately. Such a system has been developed by the Accelerator System Control Simulator Collaboration to model the Low Energy Booster (LEB). The system is implemented on a 64 node INTEL ISPC/860 parallel computer which operates at approximately 600 Mflops. The simulator can track 512 particles on 32 nodes at 1 turn per second using an element by element symplectic integrator based on the TEAPOT algorithm. An operator interface has been implemented on a SUN Sparc 2 workstation operating as a client to a VME based 68040 processor board running VxWorks real time operating system. Data display and operator input utilize the operator interface routines in the EPICS control system. Data between the SPARC Card and the HYPERCUBE is accomplished currently with an interprocess connection. Simulation of the interactive closed orbit smoothing process will be shown
Wang, Yunzhi; Qiu, Yuchen; Thai, Theresa; Moore, Kathleen; Liu, Hong; Zheng, Bin
2017-06-01
Accurately assessment of adipose tissue volume inside a human body plays an important role in predicting disease or cancer risk, diagnosis and prognosis. In order to overcome limitation of using only one subjectively selected CT image slice to estimate size of fat areas, this study aims to develop and test a computer-aided detection (CAD) scheme based on deep learning technique to automatically segment subcutaneous fat areas (SFA) and visceral fat areas (VFA) depicting on volumetric CT images. A retrospectively collected CT image dataset was divided into two independent training and testing groups. The proposed CAD framework consisted of two steps with two convolution neural networks (CNNs) namely, Selection-CNN and Segmentation-CNN. The first CNN was trained using 2,240 CT slices to select abdominal CT slices depicting SFA and VFA. The second CNN was trained with 84,000pixel patches and applied to the selected CT slices to identify fat-related pixels and assign them into SFA and VFA classes. Comparing to the manual CT slice selection and fat pixel segmentation results, the accuracy of CT slice selection using the Selection-CNN yielded 95.8%, while the accuracy of fat pixel segmentation using the Segmentation-CNN was 96.8%. This study demonstrated the feasibility of applying a new deep learning based CAD scheme to automatically recognize abdominal section of human body from CT scans and segment SFA and VFA from volumetric CT data with high accuracy or agreement with the manual segmentation results. Copyright © 2017 Elsevier B.V. All rights reserved.
Blake, Douglas Clifton
A new methodology is presented for conducting numerical simulations of electromagnetic scattering and wave-propagation phenomena on massively parallel computing platforms. A process is constructed which is rooted in the Finite-Volume Time-Domain (FVTD) technique to create a simulation capability that is both versatile and practical. In terms of versatility, the method is platform independent, is easily modifiable, and is capable of solving a large number of problems with no alterations. In terms of practicality, the method is sophisticated enough to solve problems of engineering significance and is not limited to mere academic exercises. In order to achieve this capability, techniques are integrated from several scientific disciplines including computational fluid dynamics, computational electromagnetics, and parallel computing. The end result is the first FVTD solver capable of utilizing the highly flexible overset-gridding process in a distributed-memory computing environment. In the process of creating this capability, work is accomplished to conduct the first study designed to quantify the effects of domain-decomposition dimensionality on the parallel performance of hyperbolic partial differential equations solvers; to develop a new method of partitioning a computational domain comprised of overset grids; and to provide the first detailed assessment of the applicability of overset grids to the field of computational electromagnetics. Using these new methods and capabilities, results from a large number of wave propagation and scattering simulations are presented. The overset-grid FVTD algorithm is demonstrated to produce results of comparable accuracy to single-grid simulations while simultaneously shortening the grid-generation process and increasing the flexibility and utility of the FVTD technique. Furthermore, the new domain-decomposition approaches developed for overset grids are shown to be capable of producing partitions that are better load balanced and
Advanced parallel computing for the coupled PCR-GLOBWB-MODFLOW model
Verkaik, Jarno; Schmitz, Oliver; Sutanudjaja, Edwin
2017-04-01
PCR-GLOBWB (https://github.com/UU-Hydro/PCR-GLOBWB_model) is a large-scale hydrological model intended for global to regional studies and developed at the Department of Physical Geography, Utrecht University (Netherlands). The latest version of the model can simulate terrestrial hydrological and water resource fluxes and storages with a typical spatial resolution of 5 arc-minutes (less than 10 km) at the global extent. One of the recent features in the model development is the inclusion of a global 2-layer MODFLOW model simulating groundwater lateral flow. This advanced feature enables us to simulate and assess the groundwater head dynamics at the global extent, including at regions with declining groundwater head problems. Unfortunately, the current coupled PCR-GLOBWB-MODFLOW requires long run times mainly attributed to the current inefficient parallel computing and coupling algorithm. In this work, we aim to improve it by setting-up a favorable river-basin partitioning manner that reduces I/O communication and optimizes load balance between PCR-GLOBWB and MODFLOW. We also aim to replace the MODFLOW-2000 in the current coupled model with MODFLOW-USG. This will allow us to use the new Parallel Krylov Solver (PKS) that can run with Message Passing Interface (MPI) and can be easily combined with Open Multi-Processing (OpenMP). The latest scaling test carried out on the Cartesius Dutch National supercomputer shows that the usage of MODFLOW-USG and new PKS solver can result in significant MODFLOW calculation speedups (up to 45). The encouraging result of this work opens a possibility for running the model with more detailed setup and at higher resolution. As MODFLOW-USG supports both structured and unstructured grids, this includes an opportunity to have a next generation of PCR-GLOBWB-MODFLOW model that has flexibility in grid design for its groundwater flow simulation (e.g. grid design can be used to focus along rivers and around wells, to discretize individual
International Nuclear Information System (INIS)
Li, X.L.
1993-01-01
Computation of three-dimensional (3-D) Rayleigh--Taylor instability in compressible fluids is performed on a MIMD computer. A second-order TVD scheme is applied with a fully parallelized algorithm to the 3-D Euler equations. The computational program is implemented for a 3-D study of bubble evolution in the Rayleigh--Taylor instability with varying bubble aspect ratio and for large-scale simulation of a 3-D random fluid interface. The numerical solution is compared with the experimental results by Taylor
2015-01-01
This edited book presents scientific results of 15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 2014) held on June 30 – July 2, 2014 in Las Vegas Nevada, USA. The aim of this conference was to bring together scientists, engineers, computer users, and students to share their experiences and exchange new ideas, research results about all aspects (theory, applications and tools) of computer and information science, and to discuss the practical challenges encountered along the way and the solutions adopted to solve them. The conference organizers selected the 13 outstanding papers from those papers accepted for presentation at the conference.
Studies in Computational Intelligence : Volume 492
2013-01-01
This edited book presents scientific results of the 14th ACIS/IEEE International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 2013), held in Honolulu, Hawaii, USA on July 1-3, 2013. The aim of this conference was to bring together scientists, engineers, computer users, and students to share their experiences and exchange new ideas, research results about all aspects (theory, applications and tools) of computer and information science, and to discuss the practical challenges encountered along the way and the solutions adopted to solve them. The conference organizers selected the 17 outstanding papers from those papers accepted for presentation at the conference.
Energy Technology Data Exchange (ETDEWEB)
Kostin, Mikhail [Michigan State Univ., East Lansing, MI (United States); Mokhov, Nikolai [Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States); Niita, Koji [Research Organization for Information Science and Technology, Ibaraki-ken (Japan)
2013-09-25
A parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. It is intended to be used with older radiation transport codes implemented in Fortran77, Fortran 90 or C. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was developed and tested in conjunction with the MARS15 code. It is possible to use it with other codes such as PHITS, FLUKA and MCNP after certain adjustments. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. The framework corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.
Wakefield Computations for the CLIC PETS using the Parallel Finite Element Time-Domain Code T3P
Energy Technology Data Exchange (ETDEWEB)
Candel, A; Kabel, A.; Lee, L.; Li, Z.; Ng, C.; Schussman, G.; Ko, K.; /SLAC; Syratchev, I.; /CERN
2009-06-19
In recent years, SLAC's Advanced Computations Department (ACD) has developed the high-performance parallel 3D electromagnetic time-domain code, T3P, for simulations of wakefields and transients in complex accelerator structures. T3P is based on advanced higher-order Finite Element methods on unstructured grids with quadratic surface approximation. Optimized for large-scale parallel processing on leadership supercomputing facilities, T3P allows simulations of realistic 3D structures with unprecedented accuracy, aiding the design of the next generation of accelerator facilities. Applications to the Compact Linear Collider (CLIC) Power Extraction and Transfer Structure (PETS) are presented.
International Nuclear Information System (INIS)
1994-08-01
This is the first annual report of the MPP pilot project 93MPR05. In this pilot project four research groups with different, complementary backgrounds collaborate with the aim to develop new algorithms and codes to simulate the magnetohydrodynamics of thermonuclear and astrophysical plasmas on massively parallel machines. The expected speed-up is required to simulate the dynamics of the hot plasmas of interest which are characterized by very large magnetic Reynolds numbers and, hence, require high spatial and temporal resolutions (for details see section 1). The four research groups that collaborated to produce the results reported here are: The MHD group of Prof. Dr. J.P. Goedbloed at the FOM-Institute for Plasma Physics 'Rijnhuizen' in Nieuwegein, the group of Prof. Dr. H. van der Vorst at the Mathematics Institute of Utrecht University, the group of Prof. Dr. A.G. Hearn at the Astronomical Institute of Utrecht University, and the group of Dr. Ir. H.J.J. te Riele at the CWI in Amsterdam. The full project team met frequently during this first project year to discuss progress reports, current problems, etc. (see section 2). The main results of the first project year are: - Proof of the scalability of typical linear and nonlinear MHD codes - development and testing of a parallel version of the Arnoldi algorithm - development and testing of alternative methods for solving large non-Hermitian eigenvalue problems - porting of the 3D nonlinear semi-implicit time evolution code HERA to an MPP system. The steps that were scheduled to reach these intended results are given in section 3. (orig./WL)
Three-dimensional electromagnetic modeling and inversion on massively parallel computers
Energy Technology Data Exchange (ETDEWEB)
Newman, G.A.; Alumbaugh, D.L. [Sandia National Labs., Albuquerque, NM (United States). Geophysics Dept.
1996-03-01
This report has demonstrated techniques that can be used to construct solutions to the 3-D electromagnetic inverse problem using full wave equation modeling. To this point great progress has been made in developing an inverse solution using the method of conjugate gradients which employs a 3-D finite difference solver to construct model sensitivities and predicted data. The forward modeling code has been developed to incorporate absorbing boundary conditions for high frequency solutions (radar), as well as complex electrical properties, including electrical conductivity, dielectric permittivity and magnetic permeability. In addition both forward and inverse codes have been ported to a massively parallel computer architecture which allows for more realistic solutions that can be achieved with serial machines. While the inversion code has been demonstrated on field data collected at the Richmond field site, techniques for appraising the quality of the reconstructions still need to be developed. Here it is suggested that rather than employing direct matrix inversion to construct the model covariance matrix which would be impossible because of the size of the problem, one can linearize about the 3-D model achieved in the inverse and use Monte-Carlo simulations to construct it. Using these appraisal and construction tools, it is now necessary to demonstrate 3-D inversion for a variety of EM data sets that span the frequency range from induction sounding to radar: below 100 kHz to 100 MHz. Appraised 3-D images of the earth`s electrical properties can provide researchers opportunities to infer the flow paths, flow rates and perhaps the chemistry of fluids in geologic mediums. It also offers a means to study the frequency dependence behavior of the properties in situ. This is of significant relevance to the Department of Energy, paramount to characterizing and monitoring of environmental waste sites and oil and gas exploration.
Network-Based Management Procedures.
Buckner, Allen L.
Network-based management procedures serve as valuable aids in organizational management, achievement of objectives, problem solving, and decisionmaking. Network techniques especially applicable to educational management systems are the program evaluation and review technique (PERT) and the critical path method (CPM). Other network charting…
Li, Kenli; Zou, Shuting; Xv, Jin
2008-01-01
Elliptic curve cryptographic algorithms convert input data to unrecognizable encryption and the unrecognizable data back again into its original decrypted form. The security of this form of encryption hinges on the enormous difficulty that is required to solve the elliptic curve discrete logarithm problem (ECDLP), especially over GF(2n), n ∈ Z+. This paper describes an effective method to find solutions to the ECDLP by means of a molecular computer. We propose that this research accomplishment would represent a breakthrough for applied biological computation and this paper demonstrates that in principle this is possible. Three DNA-based algorithms: a parallel adder, a parallel multiplier, and a parallel inverse over GF(2n) are described. The biological operation time of all of these algorithms is polynomial with respect to n. Considering this analysis, cryptography using a public key might be less secure. In this respect, a principal contribution of this paper is to provide enhanced evidence of the potential of molecular computing to tackle such ambitious computations. PMID:18431451
Directory of Open Access Journals (Sweden)
Kenli Li
2008-01-01
Full Text Available Elliptic curve cryptographic algorithms convert input data to unrecognizable encryption and the unrecognizable data back again into its original decrypted form. The security of this form of encryption hinges on the enormous difficulty that is required to solve the elliptic curve discrete logarithm problem (ECDLP, especially over GF(2n, n∈Z+. This paper describes an effective method to find solutions to the ECDLP by means of a molecular computer. We propose that this research accomplishment would represent a breakthrough for applied biological computation and this paper demonstrates that in principle this is possible. Three DNA-based algorithms: a parallel adder, a parallel multiplier, and a parallel inverse over GF(2n are described. The biological operation time of all of these algorithms is polynomial with respect to n. Considering this analysis, cryptography using a public key might be less secure. In this respect, a principal contribution of this paper is to provide enhanced evidence of the potential of molecular computing to tackle such ambitious computations.
Li, Kenli; Zou, Shuting; Xv, Jin
2008-01-01
Elliptic curve cryptographic algorithms convert input data to unrecognizable encryption and the unrecognizable data back again into its original decrypted form. The security of this form of encryption hinges on the enormous difficulty that is required to solve the elliptic curve discrete logarithm problem (ECDLP), especially over GF(2(n)), n in Z+. This paper describes an effective method to find solutions to the ECDLP by means of a molecular computer. We propose that this research accomplishment would represent a breakthrough for applied biological computation and this paper demonstrates that in principle this is possible. Three DNA-based algorithms: a parallel adder, a parallel multiplier, and a parallel inverse over GF(2(n)) are described. The biological operation time of all of these algorithms is polynomial with respect to n. Considering this analysis, cryptography using a public key might be less secure. In this respect, a principal contribution of this paper is to provide enhanced evidence of the potential of molecular computing to tackle such ambitious computations.
Directory of Open Access Journals (Sweden)
Kamatani Naoyuki
2009-10-01
Full Text Available Abstract Background Since more than a million single-nucleotide polymorphisms (SNPs are analyzed in any given genome-wide association study (GWAS, performing multiple comparisons can be problematic. To cope with multiple-comparison problems in GWAS, haplotype-based algorithms were developed to correct for multiple comparisons at multiple SNP loci in linkage disequilibrium. A permutation test can also control problems inherent in multiple testing; however, both the calculation of exact probability and the execution of permutation tests are time-consuming. Faster methods for calculating exact probabilities and executing permutation tests are required. Methods We developed a set of computer programs for the parallel computation of accurate P-values in haplotype-based GWAS. Our program, ParaHaplo, is intended for workstation clusters using the Intel Message Passing Interface (MPI. We compared the performance of our algorithm to that of the regular permutation test on JPT and CHB of HapMap. Results ParaHaplo can detect smaller differences between 2 populations than SNP-based GWAS. We also found that parallel-computing techniques made ParaHaplo 100-fold faster than a non-parallel version of the program. Conclusion ParaHaplo is a useful tool in conducting haplotype-based GWAS. Since the data sizes of such projects continue to increase, the use of fast computations with parallel computing--such as that used in ParaHaplo--will become increasingly important. The executable binaries and program sources of ParaHaplo are available at the following address: http://sourceforge.jp/projects/parallelgwas/?_sl=1
International Nuclear Information System (INIS)
Dubois, J.
2011-01-01
In science, simulation is a key process for research or validation. Modern computer technology allows faster numerical experiments, which are cheaper than real models. In the field of neutron simulation, the calculation of eigenvalues is one of the key challenges. The complexity of these problems is such that a lot of computing power may be necessary. The work of this thesis is first the evaluation of new computing hardware such as graphics card or massively multi-core chips, and their application to eigenvalue problems for neutron simulation. Then, in order to address the massive parallelism of supercomputers national, we also study the use of asynchronous hybrid methods for solving eigenvalue problems with this very high level of parallelism. Then we experiment the work of this research on several national supercomputers such as the Titane hybrid machine of the Computing Center, Research and Technology (CCRT), the Curie machine of the Very Large Computing Centre (TGCC), currently being installed, and the Hopper machine at the Lawrence Berkeley National Laboratory (LBNL). We also do our experiments on local workstations to illustrate the interest of this research in an everyday use with local computing resources. (author) [fr
computational study of Couette flow between parallel plates for steady and unsteady cases
International Nuclear Information System (INIS)
Rihan, Y.
2008-01-01
Couette flow between parallel plates is a classical problem that has important applications in various industrial processing. In this investigation an analytical solution was obtained to predict the steady and unsteady Couette flow between parallel plates. One of the plates was stationary and the other plate moved with constant velocity. The governing partial differential equations were solved numerically using Crank-Nicolson implicit method to represent the flow behavior of the fluid
Directory of Open Access Journals (Sweden)
Grzeszczuk A.
2015-01-01
Full Text Available Compute Unified Device Architecture (CUDA is a parallel computing platform developed by Nvidia for increase speed of graphics by usage of parallel mode for processes calculation. The success of this solution has opened technology General-Purpose Graphic Processor Units (GPGPUs for applications not coupled with graphics. The GPGPUs system can be applying as effective tool for reducing huge number of data for pulse shape analysis measures, by on-line recalculation or by very quick system of compression. The simplified structure of CUDA system and model of programming based on example Nvidia GForce GTX580 card are presented by our poster contribution in stand-alone version and as ROOT application.
Directory of Open Access Journals (Sweden)
Matthew J. Greene
2016-03-01
Full Text Available Visual motion information is computed by parallel On and Off pathways in the retina, which lead to On and Off types of starburst amacrine cells (SACs. The approximate mirror symmetry between this pair of cell types suggests that On and Off pathways might compute motion using analogous mechanisms. To test this idea, we reconstructed On SACs and On bipolar cells (BCs from serial electron microscopic images of a mouse retina. We defined a new On BC type in the course of classifying On BCs. Through quantitative contact analysis, we found evidence that sustained and transient On BC types are wired to On SAC dendrites at different distances from the SAC soma, mirroring our previous wiring diagram for the Off BC-SAC circuit. Our finding is consistent with the hypothesis that On and Off pathways contain parallel correlation-type motion detectors.
Shapiro, Linda G.; Tanimoto, Steven L.; Ahrens, James P.
1996-01-01
The goal of this task was to create a design and prototype implementation of a database environment that is particular suited for handling the image, vision and scientific data associated with the NASA's EOC Amazon project. The focus was on a data model and query facilities that are designed to execute efficiently on parallel computers. A key feature of the environment is an interface which allows a scientist to specify high-level directives about how query execution should occur.
Golomidov, Y. V.; Li, S. K.; Popov, S. A.; Smolov, V. B.
1986-01-01
After a classification and analysis of electronic and optoelectronic switching devices, the design principles and structure of a matrix optical switch is described. The switching and pair-exclusion operations in this type of switch are examined, and a method for the optical switching of communication channels is elaborated. Finally, attention is given to the structural organization of a parallel computer system with a matrix optical switch.
International Nuclear Information System (INIS)
Zee, S.K.
1987-01-01
A numeric algorithm and an associated computer code were developed for the rapid solution of the finite-difference method representation of the few-group neutron-diffusion equations on parallel computers. Applications of the numeric algorithm on both SIMD (vector pipeline) and MIMD/SIMD (multi-CUP/vector pipeline) architectures were explored. The algorithm was successfully implemented in the two-group, 3-D neutron diffusion computer code named DIFPAR3D (DIFfusion PARallel 3-Dimension). Numerical-solution techniques used in the code include the Chebyshev polynomial acceleration technique in conjunction with the power method of outer iteration. For inner iterations, a parallel form of red-black (cyclic) line SOR with automated determination of group dependent relaxation factors and iteration numbers required to achieve specified inner iteration error tolerance is incorporated. The code employs a macroscopic depletion model with trace capability for selected fission products' transients and critical boron. In addition to this, moderator and fuel temperature feedback models are also incorporated into the DIFPAR3D code, for realistic simulation of power reactor cores. The physics models used were proven acceptable in separate benchmarking studies
Energy Technology Data Exchange (ETDEWEB)
Woodward, P. R.
2003-03-26
This report summarizes the results of the project entitled, ''Piecewise-Parabolic Methods for Parallel Computation with Applications to Unstable Fluid Flow in 2 and 3 Dimensions'' This project covers a span of many years, beginning in early 1987. It has provided over that considerable period the core funding to my research activities in scientific computation at the University of Minnesota. It has supported numerical algorithm development, application of those algorithms to fundamental fluid dynamics problems in order to demonstrate their effectiveness, and the development of scientific visualization software and systems to extract scientific understanding from those applications.
2016-01-01
This edited book presents scientific results of the 16th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 2015) which was held on June 1 – 3, 2015 in Takamatsu, Japan. The aim of this conference was to bring together researchers and scientists, businessmen and entrepreneurs, teachers, engineers, computer users, and students to discuss the numerous fields of computer science and to share their experiences and exchange new ideas and information in a meaningful way. Research results about all aspects (theory, applications and tools) of computer and information science, and to discuss the practical challenges encountered along the way and the solutions adopted to solve them.
SNPD 2016
2016-01-01
This edited book presents scientific results of the 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 2016) which was held on May 30 - June 1, 2016 in Shanghai, China. The aim of this conference was to bring together researchers and scientists, businessmen and entrepreneurs, teachers, engineers, computer users, and students to discuss the numerous fields of computer science and to share their experiences and exchange new ideas and information in a meaningful way. Research results about all aspects (theory, applications and tools) of computer and information science, and to discuss the practical challenges encountered along the way and the solutions adopted to solve them.