Akl, Selim G
1985-01-01
Parallel Sorting Algorithms explains how to use parallel algorithms to sort a sequence of items on a variety of parallel computers. The book reviews the sorting problem, the parallel models of computation, parallel algorithms, and the lower bounds on the parallel sorting problems. The text also presents twenty different algorithms, such as linear arrays, mesh-connected computers, cube-connected computers. Another example where algorithm can be applied is on the shared-memory SIMD (single instruction stream multiple data stream) computers in which the whole sequence to be sorted can fit in the
Totally parallel multilevel algorithms
Frederickson, Paul O.
1988-01-01
Four totally parallel algorithms for the solution of a sparse linear system have common characteristics which become quite apparent when they are implemented on a highly parallel hypercube such as the CM2. These four algorithms are Parallel Superconvergent Multigrid (PSMG) of Frederickson and McBryan, Robust Multigrid (RMG) of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. In fact, all four can be formulated as particular cases of the same totally parallel multilevel algorithm, which are referred to as TPMA. In certain cases the spectral radius of TPMA is zero, and it is recognized to be a direct algorithm. In many other cases the spectral radius, although not zero, is small enough that a single iteration per timestep keeps the local error within the required tolerance.
Algorithms for parallel computers
International Nuclear Information System (INIS)
Churchhouse, R.F.
1985-01-01
Until relatively recently almost all the algorithms for use on computers had been designed on the (usually unstated) assumption that they were to be run on single processor, serial machines. With the introduction of vector processors, array processors and interconnected systems of mainframes, minis and micros, however, various forms of parallelism have become available. The advantage of parallelism is that it offers increased overall processing speed but it also raises some fundamental questions, including: (i) which, if any, of the existing 'serial' algorithms can be adapted for use in the parallel mode. (ii) How close to optimal can such adapted algorithms be and, where relevant, what are the convergence criteria. (iii) How can we design new algorithms specifically for parallel systems. (iv) For multi-processor systems how can we handle the software aspects of the interprocessor communications. Aspects of these questions illustrated by examples are considered in these lectures. (orig.)
Parallel Algorithms and Patterns
Energy Technology Data Exchange (ETDEWEB)
Robey, Robert W. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
2016-06-16
This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.
A survey of parallel multigrid algorithms
Chan, Tony F.; Tuminaro, Ray S.
1987-01-01
A typical multigrid algorithm applied to well-behaved linear-elliptic partial-differential equations (PDEs) is described. Criteria for designing and evaluating parallel algorithms are presented. Before evaluating the performance of some parallel multigrid algorithms, consideration is given to some theoretical complexity results for solving PDEs in parallel and for executing the multigrid algorithm. The effect of mapping and load imbalance on the partial efficiency of the algorithm is studied.
Casanova, Henri; Robert, Yves
2008-01-01
""…The authors of the present book, who have extensive credentials in both research and instruction in the area of parallelism, present a sound, principled treatment of parallel algorithms. … This book is very well written and extremely well designed from an instructional point of view. … The authors have created an instructive and fascinating text. The book will serve researchers as well as instructors who need a solid, readable text for a course on parallelism in computing. Indeed, for anyone who wants an understandable text from which to acquire a current, rigorous, and broad vi
Parallel algorithms for continuum dynamics
International Nuclear Information System (INIS)
Hicks, D.L.; Liebrock, L.M.
1987-01-01
Simply porting existing parallel programs to a new parallel processor may not achieve the full speedup possible; to achieve the maximum efficiency may require redesigning the parallel algorithms for the specific architecture. The authors discuss here parallel algorithms that were developed first for the HEP processor and then ported to the CRAY X-MP/4, the ELXSI/10, and the Intel iPSC/32. Focus is mainly on the most recent parallel processing results produced, i.e., those on the Intel Hypercube. The applications are simulations of continuum dynamics in which the momentum and stress gradients are important. Examples of these are inertial confinement fusion experiments, severe breaks in the coolant system of a reactor, weapons physics, shock-wave physics. Speedup efficiencies on the Intel iPSC Hypercube are very sensitive to the ratio of communication to computation. Great care must be taken in designing algorithms for this machine to avoid global communication. This is much more critical on the iPSC than it was on the three previous parallel processors
Parallel Computing Strategies for Irregular Algorithms
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Algorithmically specialized parallel computers
Snyder, Lawrence; Gannon, Dennis B
1985-01-01
Algorithmically Specialized Parallel Computers focuses on the concept and characteristics of an algorithmically specialized computer.This book discusses the algorithmically specialized computers, algorithmic specialization using VLSI, and innovative architectures. The architectures and algorithms for digital signal, speech, and image processing and specialized architectures for numerical computations are also elaborated. Other topics include the model for analyzing generalized inter-processor, pipelined architecture for search tree maintenance, and specialized computer organization for raster
Experiments with parallel algorithms for combinatorial problems
G.A.P. Kindervater (Gerard); H.W.J.M. Trienekens
1985-01-01
textabstractIn the last decade many models for parallel computation have been proposed and many parallel algorithms have been developed. However, few of these models have been realized and most of these algorithms are supposed to run on idealized, unrealistic parallel machines. The parallel machines
Arkin, Ethem; Tekinerdogan, Bedir
2016-01-01
Mapping parallel algorithms to parallel computing platforms requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform, the mapping of the algorithm to the logical configuration platform and the implementation of the
Parallel algorithms for mapping pipelined and parallel computations
Nicol, David M.
1988-01-01
Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.
Parallel Algorithms for Groebner-Basis Reduction
1987-09-25
22209 ELEMENT NO. NO. NO. ACCESSION NO. 11. TITLE (Include Security Classification) * PARALLEL ALGORITHMS FOR GROEBNER -BASIS REDUCTION 12. PERSONAL...All other editions are obsolete. Productivity Engineering in the UNIXt Environment p Parallel Algorithms for Groebner -Basis Reduction Technical Report
Mapping robust parallel multigrid algorithms to scalable memory architectures
Overman, Andrea; Vanrosendale, John
1993-01-01
The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. In this paper, we look at the parallel implementation of a V-cycle multiple semicoarsened grid (MSG) algorithm on distributed-memory architectures such as the Intel iPSC/860 and Paragon computers. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. This paper describes a mapping of an MSG algorithm to distributed-memory architectures that demonstrates how both levels of parallelism can be exploited. The result is a robust and effective multigrid algorithm for distributed-memory machines.
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis
Choudhary, Alok Nidhi
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
Parallelization of TMVA Machine Learning Algorithms
Hajili, Mammad
2017-01-01
This report reflects my work on Parallelization of TMVA Machine Learning Algorithms integrated to ROOT Data Analysis Framework during summer internship at CERN. The report consists of 4 impor- tant part - data set used in training and validation, algorithms that multiprocessing applied on them, parallelization techniques and re- sults of execution time changes due to number of workers.
Parallel External Memory Graph Algorithms
DEFF Research Database (Denmark)
Arge, Lars Allan; Goodrich, Michael T.; Sitchinava, Nodari
2010-01-01
In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one o f the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest...... an optimal speedup of Â¿(P) in parallel I/O complexity and parallel computation time, compared to the single-processor external memory counterparts....
Parallel algorithms for numerical linear algebra
van der Vorst, H
1990-01-01
This is the first in a new series of books presenting research results and developments concerning the theory and applications of parallel computers, including vector, pipeline, array, fifth/future generation computers, and neural computers.All aspects of high-speed computing fall within the scope of the series, e.g. algorithm design, applications, software engineering, networking, taxonomy, models and architectural trends, performance, peripheral devices.Papers in Volume One cover the main streams of parallel linear algebra: systolic array algorithms, message-passing systems, algorithms for p
Improvement of Parallel Algorithm for MATRA Code
International Nuclear Information System (INIS)
Kim, Seong-Jin; Seo, Kyong-Won; Kwon, Hyouk; Hwang, Dae-Hyun
2014-01-01
The feasibility study to parallelize the MATRA code was conducted in KAERI early this year. As a result, a parallel algorithm for the MATRA code has been developed to decrease a considerably required computing time to solve a bigsize problem such as a whole core pin-by-pin problem of a general PWR reactor and to improve an overall performance of the multi-physics coupling calculations. It was shown that the performance of the MATRA code was greatly improved by implementing the parallel algorithm using MPI communication. For problems of a 1/8 core and whole core for SMART reactor, a speedup was evaluated as about 10 when the numbers of used processor were 25. However, it was also shown that the performance deteriorated as the axial node number increased. In this paper, the procedure of a communication between processors is optimized to improve the previous parallel algorithm.. To improve the performance deterioration of the parallelized MATRA code, the communication algorithm between processors was newly presented. It was shown that the speedup was improved and stable regardless of the axial node number
Parallel Algorithms for the Exascale Era
Energy Technology Data Exchange (ETDEWEB)
Robey, Robert W. [Los Alamos National Laboratory
2016-10-19
New parallel algorithms are needed to reach the Exascale level of parallelism with millions of cores. We look at some of the research developed by students in projects at LANL. The research blends ideas from the early days of computing while weaving in the fresh approach brought by students new to the field of high performance computing. We look at reproducibility of global sums and why it is important to parallel computing. Next we look at how the concept of hashing has led to the development of more scalable algorithms suitable for next-generation parallel computers. Nearly all of this work has been done by undergraduates and published in leading scientific journals.
Parallelization of a spherical Sn transport theory algorithm
International Nuclear Information System (INIS)
Haghighat, A.
1989-01-01
The work described in this paper derives a parallel algorithm for an R-dependent spherical S N transport theory algorithm and studies its performance by testing different sample problems. The S N transport method is one of the most accurate techniques used to solve the linear Boltzmann equation. Several studies have been done on the vectorization of the S N algorithms; however, very few studies have been performed on the parallelization of this algorithm. Weinke and Hommoto have looked at the parallel processing of the different energy groups, and Azmy recently studied the parallel processing of the inner iterations of an X-Y S N nodal transport theory method. Both studies have reported very encouraging results, which have prompted us to look at the parallel processing of an R-dependent S N spherical geometry algorithm. This geometry was chosen because, in spite of its simplicity, it contains the complications of the curvilinear geometries (i.e., redistribution of neutrons over the discretized angular bins)
Parallel Algorithms for Switching Edges in Heterogeneous Graphs.
Bhuiyan, Hasanuzzaman; Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav
2017-06-01
An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors.
A Novel Parallel Algorithm for Edit Distance Computation
Directory of Open Access Journals (Sweden)
Muhammad Murtaza Yousaf
2018-01-01
Full Text Available The edit distance between two sequences is the minimum number of weighted transformation-operations that are required to transform one string into the other. The weighted transformation-operations are insert, remove, and substitute. Dynamic programming solution to find edit distance exists but it becomes computationally intensive when the lengths of strings become very large. This work presents a novel parallel algorithm to solve edit distance problem of string matching. The algorithm is based on resolving dependencies in the dynamic programming solution of the problem and it is able to compute each row of edit distance table in parallel. In this way, it becomes possible to compute the complete table in min(m,n iterations for strings of size m and n whereas state-of-the-art parallel algorithm solves the problem in max(m,n iterations. The proposed algorithm also increases the amount of parallelism in each of its iteration. The algorithm is also capable of exploiting spatial locality while its implementation. Additionally, the algorithm works in a load balanced way that further improves its performance. The algorithm is implemented for multicore systems having shared memory. Implementation of the algorithm in OpenMP shows linear speedup and better execution time as compared to state-of-the-art parallel approach. Efficiency of the algorithm is also proven better in comparison to its competitor.
Energy Technology Data Exchange (ETDEWEB)
Lober, R.R.; Tautges, T.J.; Vaughan, C.T.
1997-03-01
Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.
An efficient parallel algorithm for matrix-vector multiplication
Energy Technology Data Exchange (ETDEWEB)
Hendrickson, B.; Leland, R.; Plimpton, S.
1993-03-01
The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.
Iterative algorithms for large sparse linear systems on parallel computers
Adams, L. M.
1982-01-01
Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
A Parallel Butterfly Algorithm
Poulson, Jack; Demanet, Laurent; Maxwell, Nicholas; Ying, Lexing
2014-01-01
The butterfly algorithm is a fast algorithm which approximately evaluates a discrete analogue of the integral transform (Equation Presented.) at large numbers of target points when the kernel, K(x, y), is approximately low-rank when restricted to subdomains satisfying a certain simple geometric condition. In d dimensions with O(Nd) quasi-uniformly distributed source and target points, when each appropriate submatrix of K is approximately rank-r, the running time of the algorithm is at most O(r2Nd logN). A parallelization of the butterfly algorithm is introduced which, assuming a message latency of α and per-process inverse bandwidth of β, executes in at most (Equation Presented.) time using p processes. This parallel algorithm was then instantiated in the form of the open-source DistButterfly library for the special case where K(x, y) = exp(iΦ(x, y)), where Φ(x, y) is a black-box, sufficiently smooth, real-valued phase function. Experiments on Blue Gene/Q demonstrate impressive strong-scaling results for important classes of phase functions. Using quasi-uniform sources, hyperbolic Radon transforms, and an analogue of a three-dimensional generalized Radon transform were, respectively, observed to strong-scale from 1-node/16-cores up to 1024-nodes/16,384-cores with greater than 90% and 82% efficiency, respectively. © 2014 Society for Industrial and Applied Mathematics.
A Parallel Butterfly Algorithm
Poulson, Jack
2014-02-04
The butterfly algorithm is a fast algorithm which approximately evaluates a discrete analogue of the integral transform (Equation Presented.) at large numbers of target points when the kernel, K(x, y), is approximately low-rank when restricted to subdomains satisfying a certain simple geometric condition. In d dimensions with O(Nd) quasi-uniformly distributed source and target points, when each appropriate submatrix of K is approximately rank-r, the running time of the algorithm is at most O(r2Nd logN). A parallelization of the butterfly algorithm is introduced which, assuming a message latency of α and per-process inverse bandwidth of β, executes in at most (Equation Presented.) time using p processes. This parallel algorithm was then instantiated in the form of the open-source DistButterfly library for the special case where K(x, y) = exp(iΦ(x, y)), where Φ(x, y) is a black-box, sufficiently smooth, real-valued phase function. Experiments on Blue Gene/Q demonstrate impressive strong-scaling results for important classes of phase functions. Using quasi-uniform sources, hyperbolic Radon transforms, and an analogue of a three-dimensional generalized Radon transform were, respectively, observed to strong-scale from 1-node/16-cores up to 1024-nodes/16,384-cores with greater than 90% and 82% efficiency, respectively. © 2014 Society for Industrial and Applied Mathematics.
von Davier, Matthias
2016-01-01
This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
High Performance Parallel Multigrid Algorithms for Unstructured Grids
Frederickson, Paul O.
1996-01-01
We describe a high performance parallel multigrid algorithm for a rather general class of unstructured grid problems in two and three dimensions. The algorithm PUMG, for parallel unstructured multigrid, is related in structure to the parallel multigrid algorithm PSMG introduced by McBryan and Frederickson, for they both obtain a higher convergence rate through the use of multiple coarse grids. Another reason for the high convergence rate of PUMG is its smoother, an approximate inverse developed by Baumgardner and Frederickson.
Parallel image encryption algorithm based on discretized chaotic map
International Nuclear Information System (INIS)
Zhou Qing; Wong Kwokwo; Liao Xiaofeng; Xiang Tao; Hu Yue
2008-01-01
Recently, a variety of chaos-based algorithms were proposed for image encryption. Nevertheless, none of them works efficiently in parallel computing environment. In this paper, we propose a framework for parallel image encryption. Based on this framework, a new algorithm is designed using the discretized Kolmogorov flow map. It fulfills all the requirements for a parallel image encryption algorithm. Moreover, it is secure and fast. These properties make it a good choice for image encryption on parallel computing platforms
Research on parallel algorithm for sequential pattern mining
Zhou, Lijuan; Qin, Bai; Wang, Yu; Hao, Zhongxiao
2008-03-01
Sequential pattern mining is the mining of frequent sequences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the business database and has extended to new data sources such as Web and advanced science fields such as DNA analysis. The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms haven't considered the above-mentioned characteristics synthetically. According to the traits mentioned above and combining the parallel theory, this paper puts forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent item sets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn't generate the candidated sequences, which abates the access time and improves the mining efficiency. Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll algorithm. The experiments demonstrate that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.
Huang, Fang; Liu, Dingsheng; Tan, Xicheng; Wang, Jian; Chen, Yunping; He, Binbin
2011-04-01
To design and implement an open-source parallel GIS (OP-GIS) based on a Linux cluster, the parallel inverse distance weighting (IDW) interpolation algorithm has been chosen as an example to explore the working model and the principle of algorithm parallel pattern (APP), one of the parallelization patterns for OP-GIS. Based on an analysis of the serial IDW interpolation algorithm of GRASS GIS, this paper has proposed and designed a specific parallel IDW interpolation algorithm, incorporating both single process, multiple data (SPMD) and master/slave (M/S) programming modes. The main steps of the parallel IDW interpolation algorithm are: (1) the master node packages the related information, and then broadcasts it to the slave nodes; (2) each node calculates its assigned data extent along one row using the serial algorithm; (3) the master node gathers the data from all nodes; and (4) iterations continue until all rows have been processed, after which the results are outputted. According to the experiments performed in the course of this work, the parallel IDW interpolation algorithm can attain an efficiency greater than 0.93 compared with similar algorithms, which indicates that the parallel algorithm can greatly reduce processing time and maximize speed and performance.
Parallel algorithms and architecture for computation of manipulator forward dynamics
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel computation of manipulator forward dynamics is investigated. Considering three classes of algorithms for the solution of the problem, that is, the O(n), the O(n exp 2), and the O(n exp 3) algorithms, parallelism in the problem is analyzed. It is shown that the problem belongs to the class of NC and that the time and processors bounds are of O(log2/2n) and O(n exp 4), respectively. However, the fastest stable parallel algorithms achieve the computation time of O(n) and can be derived by parallelization of the O(n exp 3) serial algorithms. Parallel computation of the O(n exp 3) algorithms requires the development of parallel algorithms for a set of fundamentally different problems, that is, the Newton-Euler formulation, the computation of the inertia matrix, decomposition of the symmetric, positive definite matrix, and the solution of triangular systems. Parallel algorithms for this set of problems are developed which can be efficiently implemented on a unique architecture, a triangular array of n(n+2)/2 processors with a simple nearest-neighbor interconnection. This architecture is particularly suitable for VLSI and WSI implementations. The developed parallel algorithm, compared to the best serial O(n) algorithm, achieves an asymptotic speedup of more than two orders-of-magnitude in the computation the forward dynamics.
Introduction to parallel algorithms and architectures arrays, trees, hypercubes
Leighton, F Thomson
1991-01-01
Introduction to Parallel Algorithms and Architectures: Arrays Trees Hypercubes provides an introduction to the expanding field of parallel algorithms and architectures. This book focuses on parallel computation involving the most popular network architectures, namely, arrays, trees, hypercubes, and some closely related networks.Organized into three chapters, this book begins with an overview of the simplest architectures of arrays and trees. This text then presents the structures and relationships between the dominant network architectures, as well as the most efficient parallel algorithms for
Parallelization of a blind deconvolution algorithm
Matson, Charles L.; Borelli, Kathy J.
2006-09-01
Often it is of interest to deblur imagery in order to obtain higher-resolution images. Deblurring requires knowledge of the blurring function - information that is often not available separately from the blurred imagery. Blind deconvolution algorithms overcome this problem by jointly estimating both the high-resolution image and the blurring function from the blurred imagery. Because blind deconvolution algorithms are iterative in nature, they can take minutes to days to deblur an image depending how many frames of data are used for the deblurring and the platforms on which the algorithms are executed. Here we present our progress in parallelizing a blind deconvolution algorithm to increase its execution speed. This progress includes sub-frame parallelization and a code structure that is not specialized to a specific computer hardware architecture.
Graph Transformation and Designing Parallel Sparse Matrix Algorithms beyond Data Dependence Analysis
Directory of Open Access Journals (Sweden)
H.X. Lin
2004-01-01
Full Text Available Algorithms are often parallelized based on data dependence analysis manually or by means of parallel compilers. Some vector/matrix computations such as the matrix-vector products with simple data dependence structures (data parallelism can be easily parallelized. For problems with more complicated data dependence structures, parallelization is less straightforward. The data dependence graph is a powerful means for designing and analyzing parallel algorithms. However, for sparse matrix computations, parallelization based on solely exploiting the existing parallelism in an algorithm does not always give satisfactory results. For example, the conventional Gaussian elimination algorithm for the solution of a tri-diagonal system is inherently sequential, so algorithms specially for parallel computation has to be designed. After briefly reviewing different parallelization approaches, a powerful graph formalism for designing parallel algorithms is introduced. This formalism will be discussed using a tri-diagonal system as an example. Its application to general matrix computations is also discussed. Its power in designing parallel algorithms beyond the ability of data dependence analysis is shown by means of a new algorithm called ACER (Alternating Cyclic Elimination and Reduction algorithm.
Fundamental Parallel Algorithms for Private-Cache Chip Multiprocessors
DEFF Research Database (Denmark)
Arge, Lars Allan; Goodrich, Michael T.; Nelson, Michael
2008-01-01
about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallel algorithms. Indeed, we present...... two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks....... In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache CMPs....
Parallel data encryption with RSA algorithm
Неретин, А. А.
2016-01-01
In this paper a parallel RSA algorithm with preliminary shuffling of source text was presented.Dependence of an encryption speed on the number of encryption nodes has been analysed, The proposed algorithm was implemented on C# language.
Discrete Hadamard transformation algorithm's parallelism analysis and achievement
Hu, Hui
2009-07-01
With respect to Discrete Hadamard Transformation (DHT) wide application in real-time signal processing while limitation in operation speed of DSP. The article makes DHT parallel research and its parallel performance analysis. Based on multiprocessor platform-TMS320C80 programming structure, the research is carried out to achieve two kinds of parallel DHT algorithms. Several experiments demonstrated the effectiveness of the proposed algorithms.
New Parallel Algorithms for Landscape Evolution Model
Jin, Y.; Zhang, H.; Shi, Y.
2017-12-01
Most landscape evolution models (LEM) developed in the last two decades solve the diffusion equation to simulate the transportation of surface sediments. This numerical approach is difficult to parallelize due to the computation of drainage area for each node, which needs huge amount of communication if run in parallel. In order to overcome this difficulty, we developed two parallel algorithms for LEM with a stream net. One algorithm handles the partition of grid with traditional methods and applies an efficient global reduction algorithm to do the computation of drainage areas and transport rates for the stream net; the other algorithm is based on a new partition algorithm, which partitions the nodes in catchments between processes first, and then partitions the cells according to the partition of nodes. Both methods focus on decreasing communication between processes and take the advantage of massive computing techniques, and numerical experiments show that they are both adequate to handle large scale problems with millions of cells. We implemented the two algorithms in our program based on the widely used finite element library deal.II, so that it can be easily coupled with ASPECT.
Where are the parallel algorithms?
Voigt, R. G.
1985-01-01
Four paradigms that can be useful in developing parallel algorithms are discussed. These include computational complexity analysis, changing the order of computation, asynchronous computation, and divide and conquer. Each is illustrated with an example from scientific computation, and it is shown that computational complexity must be used with great care or an inefficient algorithm may be selected.
Parallel conjugate gradient algorithms for manipulator dynamic simulation
Fijany, Amir; Scheld, Robert E.
1989-01-01
Parallel conjugate gradient algorithms for the computation of multibody dynamics are developed for the specialized case of a robot manipulator. For an n-dimensional positive-definite linear system, the Classical Conjugate Gradient (CCG) algorithms are guaranteed to converge in n iterations, each with a computation cost of O(n); this leads to a total computational cost of O(n sq) on a serial processor. A conjugate gradient algorithms is presented that provide greater efficiency using a preconditioner, which reduces the number of iterations required, and by exploiting parallelism, which reduces the cost of each iteration. Two Preconditioned Conjugate Gradient (PCG) algorithms are proposed which respectively use a diagonal and a tridiagonal matrix, composed of the diagonal and tridiagonal elements of the mass matrix, as preconditioners. Parallel algorithms are developed to compute the preconditioners and their inversions in O(log sub 2 n) steps using n processors. A parallel algorithm is also presented which, on the same architecture, achieves the computational time of O(log sub 2 n) for each iteration. Simulation results for a seven degree-of-freedom manipulator are presented. Variants of the proposed algorithms are also developed which can be efficiently implemented on the Robot Mathematics Processor (RMP).
Empirical study of parallel LRU simulation algorithms
Carr, Eric; Nicol, David M.
1994-01-01
This paper reports on the performance of five parallel algorithms for simulating a fully associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are SIMD, and are implemented on the MasPar MP-2 architecture. Two other algorithms are parallelizations of an efficient serial algorithm on the Intel Paragon. One SIMD algorithm is quite simple, but its cost is linear in the cache size. The two other SIMD algorithm are more complex, but have costs that are independent on the cache size. Both the second and third SIMD algorithms compute all stack distances; the second SIMD algorithm is completely general, whereas the third SIMD algorithm presumes and takes advantage of bounds on the range of reference tags. Both MIMD algorithm implemented on the Paragon are general and compute all stack distances; they differ in one step that may affect their respective scalability. We assess the strengths and weaknesses of these algorithms as a function of problem size and characteristics, and compare their performance on traces derived from execution of three SPEC benchmark programs.
Arkin, Ethem; Tekinerdogan, Bedir; Imre, Kayhan M.
2017-01-01
The need for high-performance computing together with the increasing trend from single processor to parallel computer architectures has leveraged the adoption of parallel computing. To benefit from parallel computing power, usually parallel algorithms are defined that can be mapped and executed
Comparison of multihardware parallel implementations for a phase unwrapping algorithm
Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo
2018-04-01
Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
Parallel/vector algorithms for the spherical SN transport theory method
International Nuclear Information System (INIS)
Haghighat, A.; Mattis, R.E.
1990-01-01
This paper discusses vector and parallel processing of a 1-D curvilinear (i.e. spherical) S N transport theory algorithm on the Cornell National SuperComputer Facility (CNSF) IBM 3090/600E. Two different vector algorithms were developed and parallelized based on angular decomposition. It is shown that significant speedups are attainable. For example, for problems with large granularity, using 4 processors, the parallel/vector algorithm achieves speedups (for wall-clock time) of more than 4.5 relative to the old serial/scalar algorithm. Furthermore, this work has demonstrated the existing potential for the development of faster processing vector and parallel algorithms for multidimensional curvilinear geometries. (author)
Online Algorithms for Parallel Job Scheduling and Strip Packing
Hurink, Johann L.; Paulus, J.J.
We consider the online scheduling problem of parallel jobs on parallel machines, $P|online{−}list,m_j |C_{max}$. For this problem we present a 6.6623-competitive algorithm. This improves the best known 7-competitive algorithm for this problem. The presented algorithm also applies to the problem
A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations
Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw
2005-01-01
A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.
Algorithms for computational fluid dynamics n parallel processors
International Nuclear Information System (INIS)
Van de Velde, E.F.
1986-01-01
A study of parallel algorithms for the numerical solution of partial differential equations arising in computational fluid dynamics is presented. The actual implementation on parallel processors of shared and nonshared memory design is discussed. The performance of these algorithms is analyzed in terms of machine efficiency, communication time, bottlenecks and software development costs. For elliptic equations, a parallel preconditioned conjugate gradient method is described, which has been used to solve pressure equations discretized with high order finite elements on irregular grids. A parallel full multigrid method and a parallel fast Poisson solver are also presented. Hyperbolic conservation laws were discretized with parallel versions of finite difference methods like the Lax-Wendroff scheme and with the Random Choice method. Techniques are developed for comparing the behavior of an algorithm on different architectures as a function of problem size and local computational effort. Effective use of these advanced architecture machines requires the use of machine dependent programming. It is shown that the portability problems can be minimized by introducing high level operations on vectors and matrices structured into program libraries
Analysis of a parallel multigrid algorithm
Chan, Tony F.; Tuminaro, Ray S.
1989-01-01
The parallel multigrid algorithm of Frederickson and McBryan (1987) is considered. This algorithm uses multiple coarse-grid problems (instead of one problem) in the hope of accelerating convergence and is found to have a close relationship to traditional multigrid methods. Specifically, the parallel coarse-grid correction operator is identical to a traditional multigrid coarse-grid correction operator, except that the mixing of high and low frequencies caused by aliasing error is removed. Appropriate relaxation operators can be chosen to take advantage of this property. Comparisons between the standard multigrid and the new method are made.
A Parallel Prefix Algorithm for Almost Toeplitz Tridiagonal Systems
Sun, Xian-He; Joslin, Ronald D.
1995-01-01
A compact scheme is a discretization scheme that is advantageous in obtaining highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study has been conducted to provide a simple truncation formula. Experimental results have been measured on a MasPar MP-1 SIMD machine and on a Cray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for symmetric, almost symmetric Toeplitz tridiagonal systems and for the compact scheme on high-performance computers.
A parallel 2-opt algorithm for the traveling salesman problem
Verhoeven, M.G.A.; Aarts, E.H.L.; Swinkels, P.C.J.
1995-01-01
We present a scalable parallel local search algorithm based on data parallelism. The concept of distributed neighborhood structures is introduced, and applied to the Traveling Salesman Problem (TSP). Our parallel local search algorithm finds the same quality solutions as the classical 2-opt
Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis
Brouwer, Randall Jay
1991-01-01
The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.
Exact parallel maximum clique algorithm for general and protein graphs.
Depolli, Matjaž; Konc, Janez; Rozman, Kati; Trobec, Roman; Janežič, Dušanka
2013-09-23
A new exact parallel maximum clique algorithm MaxCliquePara, which finds the maximum clique (the fully connected subgraph) in undirected general and protein graphs, is presented. First, a new branch and bound algorithm for finding a maximum clique on a single computer core, which builds on ideas presented in two published state of the art sequential algorithms is implemented. The new sequential MaxCliqueSeq algorithm is faster than the reference algorithms on both DIMACS benchmark graphs as well as on protein-derived product graphs used for protein structural comparisons. Next, the MaxCliqueSeq algorithm is parallelized by splitting the branch-and-bound search tree to multiple cores, resulting in MaxCliquePara algorithm. The ability to exploit all cores efficiently makes the new parallel MaxCliquePara algorithm markedly superior to other tested algorithms. On a 12-core computer, the parallelization provides up to 2 orders of magnitude faster execution on the large DIMACS benchmark graphs and up to an order of magnitude faster execution on protein product graphs. The algorithms are freely accessible on http://commsys.ijs.si/~matjaz/maxclique.
Efficient sequential and parallel algorithms for record linkage.
Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar
2014-01-01
Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Our sequential and parallel algorithms have been tested on a real dataset of 1,083,878 records and synthetic datasets ranging in size from 50,000 to 9,000,000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.
A class of parallel algorithms for computation of the manipulator inertia matrix
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
New algorithms for parallel MRI
International Nuclear Information System (INIS)
Anzengruber, S; Ramlau, R; Bauer, F; Leitao, A
2008-01-01
Magnetic Resonance Imaging with parallel data acquisition requires algorithms for reconstructing the patient's image from a small number of measured lines of the Fourier domain (k-space). In contrast to well-known algorithms like SENSE and GRAPPA and its flavors we consider the problem as a non-linear inverse problem. However, in order to avoid cost intensive derivatives we will use Landweber-Kaczmarz iteration and in order to improve the overall results some additional sparsity constraints.
Contact-impact algorithms on parallel computers
International Nuclear Information System (INIS)
Zhong Zhihua; Nilsson, Larsgunnar
1994-01-01
Contact-impact algorithms on parallel computers are discussed within the context of explicit finite element analysis. The algorithms concerned include a contact searching algorithm and an algorithm for contact force calculations. The contact searching algorithm is based on the territory concept of the general HITA algorithm. However, no distinction is made between different contact bodies, or between different contact surfaces. All contact segments from contact boundaries are taken as a single set. Hierarchy territories and contact territories are expanded. A three-dimensional bucket sort algorithm is used to sort contact nodes. The defence node algorithm is used in the calculation of contact forces. Both the contact searching algorithm and the defence node algorithm are implemented on the connection machine CM-200. The performance of the algorithms is examined under different circumstances, and numerical results are presented. ((orig.))
Parallel grid generation algorithm for distributed memory computers
Moitra, Stuti; Moitra, Anutosh
1994-01-01
A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
Parallel algorithms for boundary value problems
Lin, Avi
1991-01-01
A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are twofold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed.
Improved Parallel Three-List Algorithm for the Knapsack Problem without Memory Conflicts
Institute of Scientific and Technical Information of China (English)
Pan Jun; Li Kenli; Li Qinghua
2006-01-01
Based on the two-list algorithm and the parallel three-list algorithm, an improved parallel three-list algorithm for knapsack problem is proposed, in which the method of divide and conquer, and parallel merging without memory conflicts are adopted. To find a solution for the n-element knapsack problem, the proposed algorithm needs O(23n/8) time when O(23n/8) shared memory units and O(2n/4) processors are available. The comparisons between the proposed algorithm and 10 existing algorithms show that the improved parallel three-list algorithm is the first exclusive-read exclusive-write (EREW) parallel algorithm that can solve the knapsack instances in less than O(2n/2) time when the available hardware resource is smaller than O(2n/2), and hence is an improved result over the past researches.
Qin, Cheng-Zhi; Zhan, Lijun
2012-06-01
As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU
An Algorithm for Parallel Sn Sweeps on Unstructured Meshes
International Nuclear Information System (INIS)
Pautz, Shawn D.
2002-01-01
A new algorithm for performing parallel S n sweeps on unstructured meshes is developed. The algorithm uses a low-complexity list ordering heuristic to determine a sweep ordering on any partitioned mesh. For typical problems and with 'normal' mesh partitionings, nearly linear speedups on up to 126 processors are observed. This is an important and desirable result, since although analyses of structured meshes indicate that parallel sweeps will not scale with normal partitioning approaches, no severe asymptotic degradation in the parallel efficiency is observed with modest (≤100) levels of parallelism. This result is a fundamental step in the development of efficient parallel S n methods
A Parallel Encryption Algorithm Based on Piecewise Linear Chaotic Map
Directory of Open Access Journals (Sweden)
Xizhong Wang
2013-01-01
Full Text Available We introduce a parallel chaos-based encryption algorithm for taking advantage of multicore processors. The chaotic cryptosystem is generated by the piecewise linear chaotic map (PWLCM. The parallel algorithm is designed with a master/slave communication model with the Message Passing Interface (MPI. The algorithm is suitable not only for multicore processors but also for the single-processor architecture. The experimental results show that the chaos-based cryptosystem possesses good statistical properties. The parallel algorithm provides much better performance than the serial ones and would be useful to apply in encryption/decryption file with large size or multimedia.
Parallel GPU implementation of iterative PCA algorithms.
Andrecut, M
2009-11-01
Principal component analysis (PCA) is a key statistical technique for multivariate data analysis. For large data sets, the common approach to PCA computation is based on the standard NIPALS-PCA algorithm, which unfortunately suffers from loss of orthogonality, and therefore its applicability is usually limited to the estimation of the first few components. Here we present an algorithm based on Gram-Schmidt orthogonalization (called GS-PCA), which eliminates this shortcoming of NIPALS-PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel implementation of both NIPALS-PCA and GS-PCA algorithms. The numerical results show that the GPU parallel optimized versions, based on CUBLAS (NVIDIA), are substantially faster (up to 12 times) than the CPU optimized versions based on CBLAS (GNU Scientific Library).
The Parallel Algorithm Based on Genetic Algorithm for Improving the Performance of Cognitive Radio
Directory of Open Access Journals (Sweden)
Liu Miao
2018-01-01
Full Text Available The intercarrier interference (ICI problem of cognitive radio (CR is severe. In this paper, the machine learning algorithm is used to obtain the optimal interference subcarriers of an unlicensed user (un-LU. Masking the optimal interference subcarriers can suppress the ICI of CR. Moreover, the parallel ICI suppression algorithm is designed to improve the calculation speed and meet the practical requirement of CR. Simulation results show that the data transmission rate threshold of un-LU can be set, the data transmission quality of un-LU can be ensured, the ICI of a licensed user (LU is suppressed, and the bit error rate (BER performance of LU is improved by implementing the parallel suppression algorithm. The ICI problem of CR is solved well by the new machine learning algorithm. The computing performance of the algorithm is improved by designing a new parallel structure and the communication performance of CR is enhanced.
A Parallel Saturation Algorithm on Shared Memory Architectures
Ezekiel, Jonathan; Siminiceanu
2007-01-01
Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm
Povitsky, A.
1998-01-01
In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the backward step computations immediately after the completion of the forward step computations for the first portion of lines This algorithm has data available for other computational tasks while processors are idle from the Thomas algorithm. The proposed 3-D directionally split solver is based on the static scheduling of processors where local and non-local, data-dependent and data-independent computations are scheduled while processors are idle. A theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. It is shown by computational experiments and by the theoretical model that the proposed algorithm reduces the parallelization penalty about two times over the basic algorithm for the range of the number of processors (subdomains) considered and the number of grid nodes per subdomain.
International Nuclear Information System (INIS)
Mo Zeyao
2004-11-01
Multiphysics parallel numerical simulations are usually essential to simplify researches on complex physical phenomena in which several physics are tightly coupled. It is very important on how to concatenate those coupled physics for fully scalable parallel simulation. Meanwhile, three objectives should be balanced, the first is efficient data transfer among simulations, the second and the third are efficient parallel executions and simultaneously developments of those simulation codes. Two concatenating algorithms for multiphysics parallel numerical simulations coupling radiation hydrodynamics with neutron transport on unstructured grid are presented. The first algorithm, Fully Loosely Concatenation (FLC), focuses on the independence of code development and the independence running with optimal performance of code. The second algorithm. Two Level Tightly Concatenation (TLTC), focuses on the optimal tradeoffs among above three objectives. Theoretical analyses for communicational complexity and parallel numerical experiments on hundreds of processors on two parallel machines have showed that these two algorithms are efficient and can be generalized to other multiphysics parallel numerical simulations. In especial, algorithm TLTC is linearly scalable and has achieved the optimal parallel performance. (authors)
Massively Parallel Algorithms for Solution of Schrodinger Equation
Fijany, Amir; Barhen, Jacob; Toomerian, Nikzad
1994-01-01
In this paper massively parallel algorithms for solution of Schrodinger equation are developed. Our results clearly indicate that the Crank-Nicolson method, in addition to its excellent numerical properties, is also highly suitable for massively parallel computation.
Combined spatial/angular domain decomposition SN algorithms for shared memory parallel machines
International Nuclear Information System (INIS)
Hunter, M.A.; Haghighat, A.
1993-01-01
Several parallel processing algorithms on the basis of spatial and angular domain decomposition methods are developed and incorporated into a two-dimensional discrete ordinates transport theory code. These algorithms divide the spatial and angular domains into independent subdomains so that the flux calculations within each subdomain can be processed simultaneously. Two spatial parallel algorithms (Block-Jacobi, red-black), one angular parallel algorithm (η-level), and their combinations are implemented on an eight processor CRAY Y-MP. Parallel performances of the algorithms are measured using a series of fixed source RZ geometry problems. Some of the results are also compared with those executed on an IBM 3090/600J machine. (orig.)
A new parallel molecular dynamics algorithm for organic systems
International Nuclear Information System (INIS)
Plimpton, S.; Hendrickson, B.; Heffelfinger, G.
1993-01-01
A new parallel algorithm for simulating bonded molecular systems such as polymers and proteins by molecular dynamics (MD) is presented. In contrast to methods that extract parallelism by breaking the spatial domain into sub-pieces, the new method does not require regular geometries or uniform particle densities to achieve high parallel efficiency. For very large, regular systems spatial methods are often the best choice, but in practice the new method is faster for systems with tens-of-thousands of atoms simulated on large numbers of processors. It is also several times faster than the techniques commonly used for parallelizing bonded MD that assign a subset of atoms to each processor and require all-to-all communication. Implementation of the algorithm in a CHARMm-like MD model with many body forces and constraint dynamics is discussed and timings on the Intel Delta and Paragon machines are given. Example calculations using the algorithm in simulations of polymers and liquid-crystal molecules will also be briefly discussed
On а Recursive-Parallel Algorithm for Solving the Knapsack Problem
Directory of Open Access Journals (Sweden)
Vladimir V. Vasilchikov
2018-01-01
Full Text Available In this paper, we offer an efficient parallel algorithm for solving the NP-complete Knapsack Problem in its basic, so-called 0-1 variant. To find its exact solution, algorithms belonging to the category ”branch and bound methods” have long been used. To speed up the solving with varying degrees of efficiency, various options for parallelizing computations are also used. We propose here an algorithm for solving the problem, based on the paradigm of recursive-parallel computations. We consider it suited well for problems of this kind, when it is difficult to immediately break up the computations into a sufficient number of subtasks that are comparable in complexity, since they appear dynamically at run time. We used the RPM ParLib library, developed by the author, as the main tool to program the algorithm. This library allows us to develop effective applications for parallel computing on a local network in the .NET Framework. Such applications have the ability to generate parallel branches of computation directly during program execution and dynamically redistribute work between computing modules. Any language with support for the .NET Framework can be used as a programming language in conjunction with this library. For our experiments, we developed some C# applications using this library. The main purpose of these experiments was to study the acceleration achieved by recursive-parallel computing. A detailed description of the algorithm and its testing, as well as the results obtained, are also given in the paper.
Parallel algorithms on the ASTRA SIMD machine
International Nuclear Information System (INIS)
Odor, G.; Rohrbach, F.; Vesztergombi, G.; Varga, G.; Tatrai, F.
1996-01-01
In view of the tremendous computing power jump of modern RISC processors the interest in parallel computing seems to be thinning out. Why use a complicated system of parallel processors, if the problem can be solved by a single powerful micro-chip. It is a general law, however, that exponential growth will always end by some kind of a saturation, and then parallelism will again become a hot topic. We try to prepare ourselves for this eventuality. The MPPC project started in 1990 in the keydeys of parallelism and produced four ASTRA machines (presented at CHEP's 92) with 4k processors (which are expandable to 16k) based on yesterday's chip-technology (chip presented at CHEP'91). These machines now provide excellent test-beds for algorithmic developments in a complete, real environment. We are developing for example fast-pattern recognition algorithms which could be used in high-energy physics experiments at the LHC (planned to be operational after 2004 at CERN) for triggering and data reduction. The basic feature of our ASP (Associate String Processor) approach is to use extremely simple (thus very cheap) processor elements but in huge quantities (up to millions of processors) connected together by a very simple string-like communication chain. In this paper we present powerful algorithms based on this architecture indicating the performance perspectives if the hardware quality reaches present or even future technology levels. (author)
Parallel algorithms for nuclear reactor analysis via domain decomposition method
International Nuclear Information System (INIS)
Kim, Yong Hee
1995-02-01
In this thesis, the neutron diffusion equation in reactor physics is discretized by the finite difference method and is solved on a parallel computer network which is composed of T-800 transputers. T-800 transputer is a message-passing type MIMD (multiple instruction streams and multiple data streams) architecture. A parallel variant of Schwarz alternating procedure for overlapping subdomains is developed with domain decomposition. The thesis provides convergence analysis and improvement of the convergence of the algorithm. The convergence of the parallel Schwarz algorithms with DN(or ND), DD, NN, and mixed pseudo-boundary conditions(a weighted combination of Dirichlet and Neumann conditions) is analyzed for both continuous and discrete models in two-subdomain case and various underlying features are explored. The analysis shows that the convergence rate of the algorithm highly depends on the pseudo-boundary conditions and the theoretically best one is the mixed boundary conditions(MM conditions). Also it is shown that there may exist a significant discrepancy between continuous model analysis and discrete model analysis. In order to accelerate the convergence of the parallel Schwarz algorithm, relaxation in pseudo-boundary conditions is introduced and the convergence analysis of the algorithm for two-subdomain case is carried out. The analysis shows that under-relaxation of the pseudo-boundary conditions accelerates the convergence of the parallel Schwarz algorithm if the convergence rate without relaxation is negative, and any relaxation(under or over) decelerates convergence if the convergence rate without relaxation is positive. Numerical implementation of the parallel Schwarz algorithm on an MIMD system requires multi-level iterations: two levels for fixed source problems, three levels for eigenvalue problems. Performance of the algorithm turns out to be very sensitive to the iteration strategy. In general, multi-level iterations provide good performance when
Parallel algorithms and cluster computing
Hoffmann, Karl Heinz
2007-01-01
This book presents major advances in high performance computing as well as major advances due to high performance computing. It contains a collection of papers in which results achieved in the collaboration of scientists from computer science, mathematics, physics, and mechanical engineering are presented. From the science problems to the mathematical algorithms and on to the effective implementation of these algorithms on massively parallel and cluster computers we present state-of-the-art methods and technology as well as exemplary results in these fields. This book shows that problems which seem superficially distinct become intimately connected on a computational level.
Adapting algorithms to massively parallel hardware
Sioulas, Panagiotis
2016-01-01
In the recent years, the trend in computing has shifted from delivering processors with faster clock speeds to increasing the number of cores per processor. This marks a paradigm shift towards parallel programming in which applications are programmed to exploit the power provided by multi-cores. Usually there is gain in terms of the time-to-solution and the memory footprint. Specifically, this trend has sparked an interest towards massively parallel systems that can provide a large number of processors, and possibly computing nodes, as in the GPUs and MPPAs (Massively Parallel Processor Arrays). In this project, the focus was on two distinct computing problems: k-d tree searches and track seeding cellular automata. The goal was to adapt the algorithms to parallel systems and evaluate their performance in different cases.
A scalable method for parallelizing sampling-based motion planning algorithms
Jacobs, Sam Ade; Manavi, Kasra; Burgos, Juan; Denny, Jory; Thomas, Shawna; Amato, Nancy M.
2012-01-01
This paper describes a scalable method for parallelizing sampling-based motion planning algorithms. It subdivides configuration space (C-space) into (possibly overlapping) regions and independently, in parallel, uses standard (sequential) sampling-based planners to construct roadmaps in each region. Next, in parallel, regional roadmaps in adjacent regions are connected to form a global roadmap. By subdividing the space and restricting the locality of connection attempts, we reduce the work and inter-processor communication associated with nearest neighbor calculation, a critical bottleneck for scalability in existing parallel motion planning methods. We show that our method is general enough to handle a variety of planning schemes, including the widely used Probabilistic Roadmap (PRM) and Rapidly-exploring Random Trees (RRT) algorithms. We compare our approach to two other existing parallel algorithms and demonstrate that our approach achieves better and more scalable performance. Our approach achieves almost linear scalability on a 2400 core LINUX cluster and on a 153,216 core Cray XE6 petascale machine. © 2012 IEEE.
A scalable method for parallelizing sampling-based motion planning algorithms
Jacobs, Sam Ade
2012-05-01
This paper describes a scalable method for parallelizing sampling-based motion planning algorithms. It subdivides configuration space (C-space) into (possibly overlapping) regions and independently, in parallel, uses standard (sequential) sampling-based planners to construct roadmaps in each region. Next, in parallel, regional roadmaps in adjacent regions are connected to form a global roadmap. By subdividing the space and restricting the locality of connection attempts, we reduce the work and inter-processor communication associated with nearest neighbor calculation, a critical bottleneck for scalability in existing parallel motion planning methods. We show that our method is general enough to handle a variety of planning schemes, including the widely used Probabilistic Roadmap (PRM) and Rapidly-exploring Random Trees (RRT) algorithms. We compare our approach to two other existing parallel algorithms and demonstrate that our approach achieves better and more scalable performance. Our approach achieves almost linear scalability on a 2400 core LINUX cluster and on a 153,216 core Cray XE6 petascale machine. © 2012 IEEE.
Parallel algorithms for computation of the manipulator inertia matrix
Amin-Javaheri, Masoud; Orin, David E.
1989-01-01
The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.
A simple and efficient parallel FFT algorithm using the BSP model
Bisseling, R.H.; Inda, M.A.
2000-01-01
In this paper we present a new parallel radix FFT algorithm based on the BSP model Our parallel algorithm uses the groupcyclic distribution family which makes it simple to understand and easy to implement We show how to reduce the com munication cost of the algorithm by a factor of three in the case
A parallel algorithm for the non-symmetric eigenvalue problem
International Nuclear Information System (INIS)
Sidani, M.M.
1991-01-01
An algorithm is presented for the solution of the non-symmetric eigenvalue problem. The algorithm is based on a divide-and-conquer procedure that provides initial approximations to the eigenpairs, which are then refined using Newton iterations. Since the smaller subproblems can be solved independently, and since Newton iterations with different initial guesses can be started simultaneously, the algorithm - unlike the standard QR method - is ideal for parallel computers. The author also reports on his investigation of deflation methods designed to obtain further eigenpairs if needed. Numerical results from implementations on a host of parallel machines (distributed and shared-memory) are presented
Parallel clustering algorithm for large-scale biological data sets.
Wang, Minchao; Zhang, Wu; Ding, Wang; Dai, Dongbo; Zhang, Huiran; Xie, Hao; Chen, Luonan; Guo, Yike; Xie, Jiang
2014-01-01
Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Out of Core Sorting Algorithms for the Parallel Disks Model.
Kundeti, Vamsi; Rajasekaran, Sanguthevar
2011-11-01
In this paper we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Secondly, we show analytically that this idea can reduce the number of parallel I/O's required to sort the input close to the lower bound of [Formula: see text]. We experimentally verify our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on PDM significantly.
Efficient Parallel Algorithm For Direct Numerical Simulation of Turbulent Flows
Moitra, Stuti; Gatski, Thomas B.
1997-01-01
A distributed algorithm for a high-order-accurate finite-difference approach to the direct numerical simulation (DNS) of transition and turbulence in compressible flows is described. This work has two major objectives. The first objective is to demonstrate that parallel and distributed-memory machines can be successfully and efficiently used to solve computationally intensive and input/output intensive algorithms of the DNS class. The second objective is to show that the computational complexity involved in solving the tridiagonal systems inherent in the DNS algorithm can be reduced by algorithm innovations that obviate the need to use a parallelized tridiagonal solver.
Parallel simulated annealing algorithms for cell placement on hypercube multiprocessors
Banerjee, Prithviraj; Jones, Mark Howard; Sargent, Jeff S.
1990-01-01
Two parallel algorithms for standard cell placement using simulated annealing are developed to run on distributed-memory message-passing hypercube multiprocessors. The cells can be mapped in a two-dimensional area of a chip onto processors in an n-dimensional hypercube in two ways, such that both small and large cell exchange and displacement moves can be applied. The computation of the cost function in parallel among all the processors in the hypercube is described, along with a distributed data structure that needs to be stored in the hypercube to support the parallel cost evaluation. A novel tree broadcasting strategy is used extensively for updating cell locations in the parallel environment. A dynamic parallel annealing schedule estimates the errors due to interacting parallel moves and adapts the rate of synchronization automatically. Two novel approaches in controlling error in parallel algorithms are described: heuristic cell coloring and adaptive sequence control.
Optimal Design of Passive Power Filters Based on Pseudo-parallel Genetic Algorithm
Li, Pei; Li, Hongbo; Gao, Nannan; Niu, Lin; Guo, Liangfeng; Pei, Ying; Zhang, Yanyan; Xu, Minmin; Chen, Kerui
2017-05-01
The economic costs together with filter efficiency are taken as targets to optimize the parameter of passive filter. Furthermore, the method of combining pseudo-parallel genetic algorithm with adaptive genetic algorithm is adopted in this paper. In the early stages pseudo-parallel genetic algorithm is introduced to increase the population diversity, and adaptive genetic algorithm is used in the late stages to reduce the workload. At the same time, the migration rate of pseudo-parallel genetic algorithm is improved to change with population diversity adaptively. Simulation results show that the filter designed by the proposed method has better filtering effect with lower economic cost, and can be used in engineering.
Computation of watersheds based on parallel graph algorithms
Meijster, A.; Roerdink, J.B.T.M.; Maragos, P; Schafer, RW; Butt, MA
1996-01-01
In this paper the implementation of a parallel watershed algorithm is described. The algorithm has been implemented on a Cray J932, which is a shared memory architecture with 32 processors. The watershed transform has generally been considered to be inherently sequential, but recently a few research
Optimization approaches to mpi and area merging-based parallel buffer algorithm
Directory of Open Access Journals (Sweden)
Junfu Fan
Full Text Available On buffer zone construction, the rasterization-based dilation method inevitably introduces errors, and the double-sided parallel line method involves a series of complex operations. In this paper, we proposed a parallel buffer algorithm based on area merging and MPI (Message Passing Interface to improve the performances of buffer analyses on processing large datasets. Experimental results reveal that there are three major performance bottlenecks which significantly impact the serial and parallel buffer construction efficiencies, including the area merging strategy, the task load balance method and the MPI inter-process results merging strategy. Corresponding optimization approaches involving tree-like area merging strategy, the vertex number oriented parallel task partition method and the inter-process results merging strategy were suggested to overcome these bottlenecks. Experiments were carried out to examine the performance efficiency of the optimized parallel algorithm. The estimation results suggested that the optimization approaches could provide high performance and processing ability for buffer construction in a cluster parallel environment. Our method could provide insights into the parallelization of spatial analysis algorithm.
An Alternative Algorithm for Computing Watersheds on Shared Memory Parallel Computers
Meijster, A.; Roerdink, J.B.T.M.
1995-01-01
In this paper a parallel implementation of a watershed algorithm is proposed. The algorithm can easily be implemented on shared memory parallel computers. The watershed transform is generally considered to be inherently sequential since the discrete watershed of an image is defined using recursion.
Comparative efficiencies of three parallel algorithms for nonlinear ...
Indian Academy of Sciences (India)
R. Narasimhan (Krishtel eMaging) 1461 1996 Oct 15 13:05:22
This algorithm is better suited for large size problems on coarse ... and reliable time integration algorithms for solving the second-order dynamic equilibrium equations that arise due ... Programming models required to take advantage of the parallel and distributed ..... In addition, MPI added the concept of a 'virtual topology'.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Directory of Open Access Journals (Sweden)
Jinwei Wang
2014-01-01
Full Text Available The active appearance model (AAM is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA on the Nvidia’s GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Parallel asynchronous systems and image processing algorithms
Coon, D. D.; Perera, A. G. U.
1989-01-01
A new hardware approach to implementation of image processing algorithms is described. The approach is based on silicon devices which would permit an independent analog processing channel to be dedicated to evey pixel. A laminar architecture consisting of a stack of planar arrays of the device would form a two-dimensional array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuronlike asynchronous pulse coded form through the laminar processor. Such systems would integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The research is aimed at implementation of algorithms, such as the intensity dependent summation algorithm and pyramid processing structures, which are motivated by the operation of natural vision systems. Implementation of natural vision algorithms would benefit from the use of neuronlike information coding and the laminar, 2-D parallel, vision system type architecture. Besides providing a neural network framework for implementation of natural vision algorithms, a 2-D parallel approach could eliminate the serial bottleneck of conventional processing systems. Conversion to serial format would occur only after raw intensity data has been substantially processed. An interesting challenge arises from the fact that the mathematical formulation of natural vision algorithms does not specify the means of implementation, so that hardware implementation poses intriguing questions involving vision science.
Parallel preconditioned conjugate gradient algorithm applied to neutron diffusion problem
International Nuclear Information System (INIS)
Majumdar, A.; Martin, W.R.
1992-01-01
Numerical solution of the neutron diffusion problem requires solving a linear system of equations such as Ax = b, where A is an n x n symmetric positive definite (SPD) matrix; x and b are vectors with n components. The preconditioned conjugate gradient (PCG) algorithm is an efficient iterative method for solving such a linear system of equations. In this paper, the authors describe the implementation of a parallel PCG algorithm on a shared memory machine (BBN TC2000) and on a distributed workstation (IBM RS6000) environment created by the parallel virtual machine parallelization software
A Parallel Compact Multi-Dimensional Numerical Algorithm with Aeroacoustics Applications
Povitsky, Alex; Morris, Philip J.
1999-01-01
In this study we propose a novel method to parallelize high-order compact numerical algorithms for the solution of three-dimensional PDEs (Partial Differential Equations) in a space-time domain. For this numerical integration most of the computer time is spent in computation of spatial derivatives at each stage of the Runge-Kutta temporal update. The most efficient direct method to compute spatial derivatives on a serial computer is a version of Gaussian elimination for narrow linear banded systems known as the Thomas algorithm. In a straightforward pipelined implementation of the Thomas algorithm processors are idle due to the forward and backward recurrences of the Thomas algorithm. To utilize processors during this time, we propose to use them for either non-local data independent computations, solving lines in the next spatial direction, or local data-dependent computations by the Runge-Kutta method. To achieve this goal, control of processor communication and computations by a static schedule is adopted. Thus, our parallel code is driven by a communication and computation schedule instead of the usual "creative, programming" approach. The obtained parallelization speed-up of the novel algorithm is about twice as much as that for the standard pipelined algorithm and close to that for the explicit DRP algorithm.
Interactive animation of fault-tolerant parallel algorithms
Energy Technology Data Exchange (ETDEWEB)
Apgar, S.W.
1992-02-01
Animation of algorithms makes understanding them intuitively easier. This paper describes the software tool Raft (Robust Animator of Fault Tolerant Algorithms). The Raft system allows the user to animate a number of parallel algorithms which achieve fault tolerant execution. In particular, we use it to illustrate the key Write-All problem. It has an extensive user-interface which allows a choice of the number of processors, the number of elements in the Write-All array, and the adversary to control the processor failures. The novelty of the system is that the interface allows the user to create new on-line adversaries as the algorithm executes.
Acoustic simulation in architecture with parallel algorithm
Li, Xiaohong; Zhang, Xinrong; Li, Dan
2004-03-01
In allusion to complexity of architecture environment and Real-time simulation of architecture acoustics, a parallel radiosity algorithm was developed. The distribution of sound energy in scene is solved with this method. And then the impulse response between sources and receivers at frequency segment, which are calculated with multi-process, are combined into whole frequency response. The numerical experiment shows that parallel arithmetic can improve the acoustic simulating efficiency of complex scene.
Comparison Of Hybrid Sorting Algorithms Implemented On Different Parallel Hardware Platforms
Directory of Open Access Journals (Sweden)
Dominik Zurek
2013-01-01
Full Text Available Sorting is a common problem in computer science. There are lot of well-known sorting algorithms created for sequential execution on a single processor. Recently, hardware platforms enable to create wide parallel algorithms. We have standard processors consist of multiple cores and hardware accelerators like GPU. The graphic cards with their parallel architecture give new possibility to speed up many algorithms. In this paper we describe results of implementation of a few different sorting algorithms on GPU cards and multicore processors. Then hybrid algorithm will be presented which consists of parts executed on both platforms, standard CPU and GPU.
Parallel optimization of IDW interpolation algorithm on multicore platform
Guan, Xuefeng; Wu, Huayi
2009-10-01
Due to increasing power consumption, heat dissipation, and other physical issues, the architecture of central processing unit (CPU) has been turning to multicore rapidly in recent years. Multicore processor is packaged with multiple processor cores in the same chip, which not only offers increased performance, but also presents significant challenges to application developers. As a matter of fact, in GIS field most of current GIS algorithms were implemented serially and could not best exploit the parallelism potential on such multicore platforms. In this paper, we choose Inverse Distance Weighted spatial interpolation algorithm (IDW) as an example to study how to optimize current serial GIS algorithms on multicore platform in order to maximize performance speedup. With the help of OpenMP, threading methodology is introduced to split and share the whole interpolation work among processor cores. After parallel optimization, execution time of interpolation algorithm is greatly reduced and good performance speedup is achieved. For example, performance speedup on Intel Xeon 5310 is 1.943 with 2 execution threads and 3.695 with 4 execution threads respectively. An additional output comparison between pre-optimization and post-optimization is carried out and shows that parallel optimization does to affect final interpolation result.
Massively parallel red-black algorithms for x-y-z response matrix equations
International Nuclear Information System (INIS)
Hanebutte, U.R.; Laurin-Kovitz, K.; Lewis, E.E.
1992-01-01
Recently, both discrete ordinates and spherical harmonic (S n and P n ) methods have been cast in the form of response matrices. In x-y geometry, massively parallel algorithms have been developed to solve the resulting response matrix equations on the Connection Machine family of parallel computers, the CM-2, CM-200, and CM-5. These algorithms utilize two-cycle iteration on a red-black checkerboard. In this work we examine the use of massively parallel red-black algorithms to solve response matric equations in three dimensions. This longer term objective is to utilize massively parallel algorithms to solve S n and/or P n response matrix problems. In this exploratory examination, however, we consider the simple 6 x 6 response matrices that are derivable from fine-mesh diffusion approximations in three dimensions
Optimal parallel algorithms for problems modeled by a family of intervals
Olariu, Stephan; Schwing, James L.; Zhang, Jingyuan
1992-01-01
A family of intervals on the real line provides a natural model for a vast number of scheduling and VLSI problems. Recently, a number of parallel algorithms to solve a variety of practical problems on such a family of intervals have been proposed in the literature. Computational tools are developed, and it is shown how they can be used for the purpose of devising cost-optimal parallel algorithms for a number of interval-related problems including finding a largest subset of pairwise nonoverlapping intervals, a minimum dominating subset of intervals, along with algorithms to compute the shortest path between a pair of intervals and, based on the shortest path, a parallel algorithm to find the center of the family of intervals. More precisely, with an arbitrary family of n intervals as input, all algorithms run in O(log n) time using O(n) processors in the EREW-PRAM model of computation.
Optical flow optimization using parallel genetic algorithm
Zavala-Romero, Olmo; Botella, Guillermo; Meyer-Bäse, Anke; Meyer Base, Uwe
2011-06-01
A new approach to optimize the parameters of a gradient-based optical flow model using a parallel genetic algorithm (GA) is proposed. The main characteristics of the optical flow algorithm are its bio-inspiration and robustness against contrast, static patterns and noise, besides working consistently with several optical illusions where other algorithms fail. This model depends on many parameters which conform the number of channels, the orientations required, the length and shape of the kernel functions used in the convolution stage, among many more. The GA is used to find a set of parameters which improve the accuracy of the optical flow on inputs where the ground-truth data is available. This set of parameters helps to understand which of them are better suited for each type of inputs and can be used to estimate the parameters of the optical flow algorithm when used with videos that share similar characteristics. The proposed implementation takes into account the embarrassingly parallel nature of the GA and uses the OpenMP Application Programming Interface (API) to speedup the process of estimating an optimal set of parameters. The information obtained in this work can be used to dynamically reconfigure systems, with potential applications in robotics, medical imaging and tracking.
A parallel simulated annealing algorithm for standard cell placement on a hypercube computer
Jones, Mark Howard
1987-01-01
A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.
Parallel Algorithm for Wireless Data Compression and Encryption
Directory of Open Access Journals (Sweden)
Qin Jiancheng
2017-01-01
Full Text Available As the wireless network has limited bandwidth and insecure shared media, the data compression and encryption are very useful for the broadcasting transportation of big data in IoT (Internet of Things. However, the traditional techniques of compression and encryption are neither competent nor efficient. In order to solve this problem, this paper presents a combined parallel algorithm named “CZ algorithm” which can compress and encrypt the big data efficiently. CZ algorithm uses a parallel pipeline, mixes the coding of compression and encryption, and supports the data window up to 1 TB (or larger. Moreover, CZ algorithm can encrypt the big data as a chaotic cryptosystem which will not decrease the compression speed. Meanwhile, a shareware named “ComZip” is developed based on CZ algorithm. The experiment results show that ComZip in 64 b system can get better compression ratio than WinRAR and 7-zip, and it can be faster than 7-zip in the big data compression. In addition, ComZip encrypts the big data without extra consumption of computing resources.
The Primordial Soup Algorithm : a systematic approach to the specification of parallel parsers
Janssen, Wil; Janssen, W.P.M.; Poel, Mannes; Sikkel, Nicolaas; Zwiers, Jakob
1992-01-01
A general framework for parallel parsing is presented, which allows for a unitied, systematic approach to parallel parsing. The Primordial Soup Algorithm creates trees by allowing partial parse trees to combine arbitrarily. By adding constraints to the general algorithm, a large, class of parallel
Parallel Evolutionary Optimization Algorithms for Peptide-Protein Docking
Poluyan, Sergey; Ershov, Nikolay
2018-02-01
In this study we examine the possibility of using evolutionary optimization algorithms in protein-peptide docking. We present the main assumptions that reduce the docking problem to a continuous global optimization problem and provide a way of using evolutionary optimization algorithms. The Rosetta all-atom force field was used for structural representation and energy scoring. We describe the parallelization scheme and MPI/OpenMP realization of the considered algorithms. We demonstrate the efficiency and the performance for some algorithms which were applied to a set of benchmark tests.
Parallelized event chain algorithm for dense hard sphere and polymer systems
International Nuclear Information System (INIS)
Kampmann, Tobias A.; Boltz, Horst-Holger; Kierfeld, Jan
2015-01-01
We combine parallelization and cluster Monte Carlo for hard sphere systems and present a parallelized event chain algorithm for the hard disk system in two dimensions. For parallelization we use a spatial partitioning approach into simulation cells. We find that it is crucial for correctness to ensure detailed balance on the level of Monte Carlo sweeps by drawing the starting sphere of event chains within each simulation cell with replacement. We analyze the performance gains for the parallelized event chain and find a criterion for an optimal degree of parallelization. Because of the cluster nature of event chain moves massive parallelization will not be optimal. Finally, we discuss first applications of the event chain algorithm to dense polymer systems, i.e., bundle-forming solutions of attractive semiflexible polymers
Kinetic-Monte-Carlo-Based Parallel Evolution Simulation Algorithm of Dust Particles
Directory of Open Access Journals (Sweden)
Xiaomei Hu
2014-01-01
Full Text Available The evolution simulation of dust particles provides an important way to analyze the impact of dust on the environment. KMC-based parallel algorithm is proposed to simulate the evolution of dust particles. In the parallel evolution simulation algorithm of dust particles, data distribution way and communication optimizing strategy are raised to balance the load of every process and reduce the communication expense among processes. The experimental results show that the simulation of diffusion, sediment, and resuspension of dust particles in virtual campus is realized and the simulation time is shortened by parallel algorithm, which makes up for the shortage of serial computing and makes the simulation of large-scale virtual environment possible.
A Two-Pass Exact Algorithm for Selection on Parallel Disk Systems.
Mi, Tian; Rajasekaran, Sanguthevar
2013-07-01
Numerous OLAP queries process selection operations of "top N", median, "top 5%", in data warehousing applications. Selection is a well-studied problem that has numerous applications in the management of data and databases since, typically, any complex data query can be reduced to a series of basic operations such as sorting and selection. The parallel selection has also become an important fundamental operation, especially after parallel databases were introduced. In this paper, we present a deterministic algorithm Recursive Sampling Selection (RSS) to solve the exact out-of-core selection problem, which we show needs no more than (2 + ε ) passes ( ε being a very small fraction). We have compared our RSS algorithm with two other algorithms in the literature, namely, the Deterministic Sampling Selection and QuickSelect on the Parallel Disks Systems. Our analysis shows that DSS is a (2 + ε )-pass algorithm when the total number of input elements N is a polynomial in the memory size M (i.e., N = M c for some constant c ). While, our proposed algorithm RSS runs in (2 + ε ) passes without any assumptions. Experimental results indicate that both RSS and DSS outperform QuickSelect on the Parallel Disks Systems. Especially, the proposed algorithm RSS is more scalable and robust to handle big data when the input size is far greater than the core memory size, including the case of N ≫ M c .
A parallel version of a multigrid algorithm for isotropic transport equations
International Nuclear Information System (INIS)
Manteuffel, T.; McCormick, S.; Yang, G.; Morel, J.; Oliveira, S.
1994-01-01
The focus of this paper is on a parallel algorithm for solving the transport equations in a slab geometry using multigrid. The spatial discretization scheme used is a finite element method called the modified linear discontinuous (MLD) scheme. The MLD scheme represents a lumped version of the standard linear discontinuous (LD) scheme. The parallel algorithm was implemented on the Connection Machine 2 (CM2). Convergence rates and timings for this algorithm on the CM2 and Cray-YMP are shown
A Robust Parallel Algorithm for Combinatorial Compressed Sensing
Mendoza-Smith, Rodrigo; Tanner, Jared W.; Wechsung, Florian
2018-04-01
In previous work two of the authors have shown that a vector $x \\in \\mathbb{R}^n$ with at most $k Parallel-$\\ell_0$ decoding algorithm, where $\\mathrm{nnz}(A)$ denotes the number of nonzero entries in $A \\in \\mathbb{R}^{m \\times n}$. In this paper we present the Robust-$\\ell_0$ decoding algorithm, which robustifies Parallel-$\\ell_0$ when the sketch $Ax$ is corrupted by additive noise. This robustness is achieved by approximating the asymptotic posterior distribution of values in the sketch given its corrupted measurements. We provide analytic expressions that approximate these posteriors under the assumptions that the nonzero entries in the signal and the noise are drawn from continuous distributions. Numerical experiments presented show that Robust-$\\ell_0$ is superior to existing greedy and combinatorial compressed sensing algorithms in the presence of small to moderate signal-to-noise ratios in the setting of Gaussian signals and Gaussian additive noise.
Parallel computation of nondeterministic algorithms in VLSI
Energy Technology Data Exchange (ETDEWEB)
Hortensius, P D
1987-01-01
This work examines parallel VLSI implementations of nondeterministic algorithms. It is demonstrated that conventional pseudorandom number generators are unsuitable for highly parallel applications. Efficient parallel pseudorandom sequence generation can be accomplished using certain classes of elementary one-dimensional cellular automata. The pseudorandom numbers appear in parallel on each clock cycle. Extensive study of the properties of these new pseudorandom number generators is made using standard empirical random number tests, cycle length tests, and implementation considerations. Furthermore, it is shown these particular cellular automata can form the basis of efficient VLSI architectures for computations involved in the Monte Carlo simulation of both the percolation and Ising models from statistical mechanics. Finally, a variation on a Built-In Self-Test technique based upon cellular automata is presented. These Cellular Automata-Logic-Block-Observation (CALBO) circuits improve upon conventional design for testability circuitry.
When do evolutionary algorithms optimize separable functions in parallel?
DEFF Research Database (Denmark)
Doerr, Benjamin; Sudholt, Dirk; Witt, Carsten
2013-01-01
is that evolutionary algorithms make progress on all subfunctions in parallel, so that optimizing a separable function does not take not much longer than optimizing the hardest subfunction-subfunctions are optimized "in parallel." We show that this is only partially true, already for the simple (1+1) evolutionary...... algorithm ((1+1) EA). For separable functions composed of k Boolean functions indeed the optimization time is the maximum optimization time of these functions times a small O(log k) overhead. More generally, for sums of weighted subfunctions that each attain non-negative integer values less than r = o(log1...
A backtracking algorithm for the stream AND-parallel execution of logic programs
Energy Technology Data Exchange (ETDEWEB)
Somogyi, Z.; Ramamohanarao, K.; Vaghani, J. (Univ. of Melbourne, Parkville (Australia))
1988-06-01
The authors present the first backtracking algorithm for stream AND-parallel logic programs. It relies on compile-time knowledge of the data flow graph of each clause to let it figure out efficiently which goals to kill or restart when a goal fails. This crucial information, which they derive from mode declarations, was not available at compile-time in any previous stream AND-parallel system. They show that modes can increase the precision of the backtracking algorithm, though their algorithm allows this precision to be traded off against overhead on a procedure-by-procedure and call-by-call basis. The modes also allow their algorithm to handle efficiently programs that manipulate partially instantiated data structures and an important class of programs with circular dependency graphs. On code that does not need backtracking, the efficiency of their algorithm approaches that of the committed-choice languages; on code that does need backtracking its overhead is comparable to that of the independent AND-parallel backtracking algorithms.
Fast parallel algorithms for the x-ray transform and its adjoint.
Gao, Hao
2012-11-01
Iterative reconstruction methods often offer better imaging quality and allow for reconstructions with lower imaging dose than classical methods in computed tomography. However, the computational speed is a major concern for these iterative methods, for which the x-ray transform and its adjoint are two most time-consuming components. The speed issue becomes even notable for the 3D imaging such as cone beam scans or helical scans, since the x-ray transform and its adjoint are frequently computed as there is usually not enough computer memory to save the corresponding system matrix. The purpose of this paper is to optimize the algorithm for computing the x-ray transform and its adjoint, and their parallel computation. The fast and highly parallelizable algorithms for the x-ray transform and its adjoint are proposed for the infinitely narrow beam in both 2D and 3D. The extension of these fast algorithms to the finite-size beam is proposed in 2D and discussed in 3D. The CPU and GPU codes are available at https://sites.google.com/site/fastxraytransform. The proposed algorithm is faster than Siddon's algorithm for computing the x-ray transform. In particular, the improvement for the parallel computation can be an order of magnitude. The authors have proposed fast and highly parallelizable algorithms for the x-ray transform and its adjoint, which are extendable for the finite-size beam. The proposed algorithms are suitable for parallel computing in the sense that the computational cost per parallel thread is O(1).
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie
2014-01-01
It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
A GPU-paralleled implementation of an enhanced face recognition algorithm
Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo
2013-03-01
Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.
A new scheduling algorithm for parallel sparse LU factorization with static pivoting
Energy Technology Data Exchange (ETDEWEB)
Grigori, Laura; Li, Xiaoye S.
2002-08-20
In this paper we present a static scheduling algorithm for parallel sparse LU factorization with static pivoting. The algorithm is divided into mapping and scheduling phases, using the symmetric pruned graphs of L' and U to represent dependencies. The scheduling algorithm is designed for driving the parallel execution of the factorization on a distributed-memory architecture. Experimental results and comparisons with SuperLU{_}DIST are reported after applying this algorithm on real world application matrices on an IBM SP RS/6000 distributed memory machine.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods
Xie, Lang; Luo, Yi-han; Bao, Qi-liang
2013-08-01
GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Parallel genetic algorithms with migration for the hybrid flow shop scheduling problem
Directory of Open Access Journals (Sweden)
K. Belkadi
2006-01-01
Full Text Available This paper addresses scheduling problems in hybrid flow shop-like systems with a migration parallel genetic algorithm (PGA_MIG. This parallel genetic algorithm model allows genetic diversity by the application of selection and reproduction mechanisms nearer to nature. The space structure of the population is modified by dividing it into disjoined subpopulations. From time to time, individuals are exchanged between the different subpopulations (migration. Influence of parameters and dedicated strategies are studied. These parameters are the number of independent subpopulations, the interconnection topology between subpopulations, the choice/replacement strategy of the migrant individuals, and the migration frequency. A comparison between the sequential and parallel version of genetic algorithm (GA is provided. This comparison relates to the quality of the solution and the execution time of the two versions. The efficiency of the parallel model highly depends on the parameters and especially on the migration frequency. In the same way this parallel model gives a significant improvement of computational time if it is implemented on a parallel architecture which offers an acceptable number of processors (as many processors as subpopulations.
Characterization of robotics parallel algorithms and mapping onto a reconfigurable SIMD machine
Lee, C. S. G.; Lin, C. T.
1989-01-01
The kinematics, dynamics, Jacobian, and their corresponding inverse computations are six essential problems in the control of robot manipulators. Efficient parallel algorithms for these computations are discussed and analyzed. Their characteristics are identified and a scheme on the mapping of these algorithms to a reconfigurable parallel architecture is presented. Based on the characteristics including type of parallelism, degree of parallelism, uniformity of the operations, fundamental operations, data dependencies, and communication requirement, it is shown that most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure. Moreover, they are well-suited to be implemented on a single-instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network. The model of a reconfigurable dual network SIMD machine with internal direct feedback is introduced. A systematic procedure internal direct feedback is introduced. A systematic procedure to map these computations to the proposed machine is presented. A new scheduling problem for SIMD machines is investigated and a heuristic algorithm, called neighborhood scheduling, that reorders the processing sequence of subtasks to reduce the communication time is described. Mapping results of a benchmark algorithm are illustrated and discussed.
A parallel algorithm for switch-level timing simulation on a hypercube multiprocessor
Rao, Hariprasad Nannapaneni
1989-01-01
The parallel approach to speeding up simulation is studied, specifically the simulation of digital LSI MOS circuitry on the Intel iPSC/2 hypercube. The simulation algorithm is based on RSIM, an event driven switch-level simulator that incorporates a linear transistor model for simulating digital MOS circuits. Parallel processing techniques based on the concepts of Virtual Time and rollback are utilized so that portions of the circuit may be simulated on separate processors, in parallel for as large an increase in speed as possible. A partitioning algorithm is also developed in order to subdivide the circuit for parallel processing.
Lashkin, S. V.; Kozelkov, A. S.; Yalozo, A. V.; Gerasimov, V. Yu.; Zelensky, D. K.
2017-12-01
This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.
General upper bounds on the runtime of parallel evolutionary algorithms.
Lässig, Jörg; Sudholt, Dirk
2014-01-01
We present a general method for analyzing the runtime of parallel evolutionary algorithms with spatially structured populations. Based on the fitness-level method, it yields upper bounds on the expected parallel runtime. This allows for a rigorous estimate of the speedup gained by parallelization. Tailored results are given for common migration topologies: ring graphs, torus graphs, hypercubes, and the complete graph. Example applications for pseudo-Boolean optimization show that our method is easy to apply and that it gives powerful results. In our examples the performance guarantees improve with the density of the topology. Surprisingly, even sparse topologies such as ring graphs lead to a significant speedup for many functions while not increasing the total number of function evaluations by more than a constant factor. We also identify which number of processors lead to the best guaranteed speedups, thus giving hints on how to parameterize parallel evolutionary algorithms.
Innovative Software Algorithms and Tools parallel sessions summary
International Nuclear Information System (INIS)
Gaines, Irwin
2001-01-01
A variety of results were presented in the poster and 5 parallel sessions of the Innovative Software, Algorithms and Tools (ISAT) sessions. I will briefly summarize these presentations and attempt to identify some unifying trends
A New Approach of Parallelism and Load Balance for the Apriori Algorithm
Directory of Open Access Journals (Sweden)
BOLINA, A. C.
2013-06-01
Full Text Available The main goal of data mining is to discover relevant information on digital content. The Apriori algorithm is widely used to this objective, but its sequential version has a low performance when execu- ted over large volumes of data. Among the solutions for this problem is the parallel implementation of the algorithm, and among the parallel implementations presented in the literature that based on Apriori, it highlights the DPA (Distributed Parallel Apriori [10]. This paper presents the DMTA (Distributed Multithread Apriori algorithm, which is based on DPA and exploits the parallelism level of threads in order to increase the performance. Besides, DMTA can be executed over heterogeneous hardware platform, using different number of cores. The results showed that DMTA outperforms DPA, presents load balance among processes and threads, and it is effective in current multicore architectures.
Lee, Wei-Po; Hsiao, Yu-Ting; Hwang, Wei-Che
2014-01-16
To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high
An Optimal Parallel Algorithm for the Knapsack Problem Based on EREW
Institute of Scientific and Technical Information of China (English)
李肯立; 蒋盛益; 王卉; 李庆华
2003-01-01
A new parallel algorithm is proposed for the knapsack problem where the method of divide and conquer is adopted. Based on an EREW-SIMD machine with shared memory, the proposed algorithm utilizes O(2n/4)1-ε processors, 0≤ε≤1, and O(2n/2) memory to find a solution for the n-element knapsack problem in time O(2n/4(2n/4)ε). The cost of the proposed parallel algorithm is O(2n/2), which is an optimal method for solving the knapsack problem without memory conflicts and an improved result over the past researches.
On the Optimization and Parallelizing Little Algorithm for Solving the Traveling Salesman Problem
Directory of Open Access Journals (Sweden)
V. V. Vasilchikov
2016-01-01
Full Text Available The paper describes some ways to accelerate solving the NP-complete Traveling Salesman Problem. The classic Little algorithm belonging to the category of ”branch and bound methods” can solve it both for directed and undirected graphs. However, for undirected graphs its operation can be accelerated by eliminating the consideration of branches examined earlier. The paper proposes changes to be made in the key operations of the algorithm to speed up its execution. It also describes the results of an experiment that demonstrated a significant acceleration of solving the problem by using an advanced algorithm. Another way to speed up the work is to parallelize the algorithm. For problems of this kind it is difficult to break the task into a sufficient number of subtasks having comparable complexity. Their parallelism arises dynamically during the execution. For such problems, it seems reasonable to use parallel-recursive algorithms. In our case the use of the library RPM ParLib developed by the author was a good choice. It allows us to develop effective applications for parallel computing on a local network using any .NET-compatible programming language. We used C# to develop the programs. Parallel applications were developed as for basic and modified algorithms, the comparing of their speed was made. Experiments were performed for the graphs with the number of vertexes up to 45 and with the number of network computers up to 16. We also investigated the acceleration that can be achieved by parallelizing the basic Little algorithm for directed graphs. The results of these experiments are also presented in the paper.
Constraint treatment techniques and parallel algorithms for multibody dynamic analysis. Ph.D. Thesis
Chiou, Jin-Chern
1990-01-01
Computational procedures for kinematic and dynamic analysis of three-dimensional multibody dynamic (MBD) systems are developed from the differential-algebraic equations (DAE's) viewpoint. Constraint violations during the time integration process are minimized and penalty constraint stabilization techniques and partitioning schemes are developed. The governing equations of motion, a two-stage staggered explicit-implicit numerical algorithm, are treated which takes advantage of a partitioned solution procedure. A robust and parallelizable integration algorithm is developed. This algorithm uses a two-stage staggered central difference algorithm to integrate the translational coordinates and the angular velocities. The angular orientations of bodies in MBD systems are then obtained by using an implicit algorithm via the kinematic relationship between Euler parameters and angular velocities. It is shown that the combination of the present solution procedures yields a computationally more accurate solution. To speed up the computational procedures, parallel implementation of the present constraint treatment techniques, the two-stage staggered explicit-implicit numerical algorithm was efficiently carried out. The DAE's and the constraint treatment techniques were transformed into arrowhead matrices to which Schur complement form was derived. By fully exploiting the sparse matrix structural analysis techniques, a parallel preconditioned conjugate gradient numerical algorithm is used to solve the systems equations written in Schur complement form. A software testbed was designed and implemented in both sequential and parallel computers. This testbed was used to demonstrate the robustness and efficiency of the constraint treatment techniques, the accuracy of the two-stage staggered explicit-implicit numerical algorithm, and the speed up of the Schur-complement-based parallel preconditioned conjugate gradient algorithm on a parallel computer.
A scalable parallel algorithm for multiple objective linear programs
Wiecek, Malgorzata M.; Zhang, Hong
1994-01-01
This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.
A parallel algorithm for 3D particle tracking and Lagrangian trajectory reconstruction
International Nuclear Information System (INIS)
Barker, Douglas; Zhang, Yuanhui; Lifflander, Jonathan; Arya, Anshu
2012-01-01
Particle-tracking methods are widely used in fluid mechanics and multi-target tracking research because of their unique ability to reconstruct long trajectories with high spatial and temporal resolution. Researchers have recently demonstrated 3D tracking of several objects in real time, but as the number of objects is increased, real-time tracking becomes impossible due to data transfer and processing bottlenecks. This problem may be solved by using parallel processing. In this paper, a parallel-processing framework has been developed based on frame decomposition and is programmed using the asynchronous object-oriented Charm++ paradigm. This framework can be a key step in achieving a scalable Lagrangian measurement system for particle-tracking velocimetry and may lead to real-time measurement capabilities. The parallel tracking algorithm was evaluated with three data sets including the particle image velocimetry standard 3D images data set #352, a uniform data set for optimal parallel performance and a computational-fluid-dynamics-generated non-uniform data set to test trajectory reconstruction accuracy, consistency with the sequential version and scalability to more than 500 processors. The algorithm showed strong scaling up to 512 processors and no inherent limits of scalability were seen. Ultimately, up to a 200-fold speedup is observed compared to the serial algorithm when 256 processors were used. The parallel algorithm is adaptable and could be easily modified to use any sequential tracking algorithm, which inputs frames of 3D particle location data and outputs particle trajectories
Efficient Parallel Algorithms for Unsteady Incompressible Flows
Guermond, Jean-Luc; Minev, Peter D.
2013-01-01
The objective of this paper is to give an overview of recent developments on splitting schemes for solving the time-dependent incompressible Navier–Stokes equations and to discuss possible extensions to the variable density/viscosity case. A particular attention is given to algorithms that can be implemented efficiently on large parallel clusters.
Parallel Algorithms for Graph Optimization using Tree Decompositions
Energy Technology Data Exchange (ETDEWEB)
Sullivan, Blair D [ORNL; Weerapurage, Dinesh P [ORNL; Groer, Christopher S [ORNL
2012-06-01
Although many $\\cal{NP}$-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the necessary dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task-based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.
Parallel-vector algorithms for particle simulations on shared-memory multiprocessors
International Nuclear Information System (INIS)
Nishiura, Daisuke; Sakaguchi, Hide
2011-01-01
Over the last few decades, the computational demands of massive particle-based simulations for both scientific and industrial purposes have been continuously increasing. Hence, considerable efforts are being made to develop parallel computing techniques on various platforms. In such simulations, particles freely move within a given space, and so on a distributed-memory system, load balancing, i.e., assigning an equal number of particles to each processor, is not guaranteed. However, shared-memory systems achieve better load balancing for particle models, but suffer from the intrinsic drawback of memory access competition, particularly during (1) paring of contact candidates from among neighboring particles and (2) force summation for each particle. Here, novel algorithms are proposed to overcome these two problems. For the first problem, the key is a pre-conditioning process during which particle labels are sorted by a cell label in the domain to which the particles belong. Then, a list of contact candidates is constructed by pairing the sorted particle labels. For the latter problem, a table comprising the list indexes of the contact candidate pairs is created and used to sum the contact forces acting on each particle for all contacts according to Newton's third law. With just these methods, memory access competition is avoided without additional redundant procedures. The parallel efficiency and compatibility of these two algorithms were evaluated in discrete element method (DEM) simulations on four types of shared-memory parallel computers: a multicore multiprocessor computer, scalar supercomputer, vector supercomputer, and graphics processing unit. The computational efficiency of a DEM code was found to be drastically improved with our algorithms on all but the scalar supercomputer. Thus, the developed parallel algorithms are useful on shared-memory parallel computers with sufficient memory bandwidth.
Parallel algorithms for testing finite state machines:Generating UIO sequences
Hierons, RM; Turker, UC
2016-01-01
This paper describes an efficient parallel algorithm that uses many-core GPUs for automatically deriving Unique Input Output sequences (UIOs) from Finite State Machines. The proposed algorithm uses the global scope of the GPU's global memory through coalesced memory access and minimises the transfer between CPU and GPU memory. The results of experiments indicate that the proposed method yields considerably better results compared to a single core UIO construction algorithm. Our algorithm is s...
Options for Parallelizing a Planning and Scheduling Algorithm
Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin D.
2011-01-01
Space missions have a growing interest in putting multi-core processors onboard spacecraft. For many missions processing power significantly slows operations. We investigate how continual planning and scheduling algorithms can exploit multi-core processing and outline different potential design decisions for a parallelized planning architecture. This organization of choices and challenges helps us with an initial design for parallelizing the CASPER planning system for a mesh multi-core processor. This work extends that presented at another workshop with some preliminary results.
International Nuclear Information System (INIS)
Niknam, Mehdi; Thulasiraman, Parimala; Camorlinga, Sergio
2010-01-01
Connected component labelling is an essential step in image processing. We provide a parallel version of Suzuki's sequential connected component algorithm in order to speed up the labelling process. Also, we modify the algorithm to enable labelling gray-scale images. Due to the data dependencies in the algorithm we used a method similar to pipeline to exploit parallelism. The parallel algorithm method achieved a speedup of 2.5 for image size of 256 x 256 pixels using 4 processing threads.
Luke, Edward Allen
1993-01-01
Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
Parallel Global Optimization with the Particle Swarm Algorithm (Preprint)
National Research Council Canada - National Science Library
Schutte, J. F; Reinbolt, J. A; Fregly, B. J; Haftka, R. T; George, A. D
2004-01-01
.... To obtain enhanced computational throughput and global search capability, we detail the coarse-grained parallelization of an increasingly popular global search method, the Particle Swarm Optimization (PSO) algorithm...
Efficient sequential and parallel algorithms for planted motif search.
Nicolae, Marius; Rajasekaran, Sanguthevar
2014-01-31
Motif searching is an important step in the detection of rare events occurring in a set of DNA or protein sequences. One formulation of the problem is known as (l,d)-motif search or Planted Motif Search (PMS). In PMS we are given two integers l and d and n biological sequences. We want to find all sequences of length l that appear in each of the input sequences with at most d mismatches. The PMS problem is NP-complete. PMS algorithms are typically evaluated on certain instances considered challenging. Despite ample research in the area, a considerable performance gap exists because many state of the art algorithms have large runtimes even for moderately challenging instances. This paper presents a fast exact parallel PMS algorithm called PMS8. PMS8 is the first algorithm to solve the challenging (l,d) instances (25,10) and (26,11). PMS8 is also efficient on instances with larger l and d such as (50,21). We include a comparison of PMS8 with several state of the art algorithms on multiple problem instances. This paper also presents necessary and sufficient conditions for 3 l-mers to have a common d-neighbor. The program is freely available at http://engr.uconn.edu/~man09004/PMS8/. We present PMS8, an efficient exact algorithm for Planted Motif Search. PMS8 introduces novel ideas for generating common neighborhoods. We have also implemented a parallel version for this algorithm. PMS8 can solve instances not solved by any previous algorithms.
Multirate-based fast parallel algorithms for 2-D DHT-based real-valued discrete Gabor transform.
Tao, Liang; Kwan, Hon Keung
2012-07-01
Novel algorithms for the multirate and fast parallel implementation of the 2-D discrete Hartley transform (DHT)-based real-valued discrete Gabor transform (RDGT) and its inverse transform are presented in this paper. A 2-D multirate-based analysis convolver bank is designed for the 2-D RDGT, and a 2-D multirate-based synthesis convolver bank is designed for the 2-D inverse RDGT. The parallel channels in each of the two convolver banks have a unified structure and can apply the 2-D fast DHT algorithm to speed up their computations. The computational complexity of each parallel channel is low and is independent of the Gabor oversampling rate. All the 2-D RDGT coefficients of an image are computed in parallel during the analysis process and can be reconstructed in parallel during the synthesis process. The computational complexity and time of the proposed parallel algorithms are analyzed and compared with those of the existing fastest algorithms for 2-D discrete Gabor transforms. The results indicate that the proposed algorithms are the fastest, which make them attractive for real-time image processing.
Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh
2015-07-01
This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.
Implementation of PHENIX trigger algorithms on massively parallel computers
International Nuclear Information System (INIS)
Petridis, A.N.; Wohn, F.K.
1995-01-01
The event selection requirements of contemporary high energy and nuclear physics experiments are met by the introduction of on-line trigger algorithms which identify potentially interesting events and reduce the data acquisition rate to levels that are manageable by the electronics. Such algorithms being parallel in nature can be simulated off-line using massively parallel computers. The PHENIX experiment intends to investigate the possible existence of a new phase of matter called the quark gluon plasma which has been theorized to have existed in very early stages of the evolution of the universe by studying collisions of heavy nuclei at ultra-relativistic energies. Such interactions can also reveal important information regarding the structure of the nucleus and mandate a thorough investigation of the simpler proton-nucleus collisions at the same energies. The complexity of PHENIX events and the need to analyze and also simulate them at rates similar to the data collection ones imposes enormous computation demands. This work is a first effort to implement PHENIX trigger algorithms on parallel computers and to study the feasibility of using such machines to run the complex programs necessary for the simulation of the PHENIX detector response. Fine and coarse grain approaches have been studied and evaluated. Depending on the application the performance of a massively parallel computer can be much better or much worse than that of a serial workstation. A comparison between single instruction and multiple instruction computers is also made and possible applications of the single instruction machines to high energy and nuclear physics experiments are outlined. copyright 1995 American Institute of Physics
A parallel algorithm for filtering gravitational waves from coalescing binaries
International Nuclear Information System (INIS)
Sathyaprakash, B.S.; Dhurandhar, S.V.
1992-10-01
Coalescing binary stars are perhaps the most promising sources for the observation of gravitational waves with laser interferometric gravity wave detectors. The waveform from these sources can be predicted with sufficient accuracy for matched filtering techniques to be applied. In this paper we present a parallel algorithm for detecting signals from coalescing compact binaries by the method of matched filtering. We also report the details of its implementation on a 256-node connection machine consisting of a network of transputers. The results of our analysis indicate that parallel processing is a promising approach to on-line analysis of data from gravitational wave detectors to filter out coalescing binary signals. The algorithm described is quite general in that the kernel of the algorithm is applicable to any set of matched filters. (author). 15 refs, 4 figs
BitPAl: a bit-parallel, general integer-scoring sequence alignment algorithm.
Loving, Joshua; Hernandez, Yozen; Benson, Gary
2014-11-15
Mapping of high-throughput sequencing data and other bulk sequence comparison applications have motivated a search for high-efficiency sequence alignment algorithms. The bit-parallel approach represents individual cells in an alignment scoring matrix as bits in computer words and emulates the calculation of scores by a series of logic operations composed of AND, OR, XOR, complement, shift and addition. Bit-parallelism has been successfully applied to the longest common subsequence (LCS) and edit-distance problems, producing fast algorithms in practice. We have developed BitPAl, a bit-parallel algorithm for general, integer-scoring global alignment. Integer-scoring schemes assign integer weights for match, mismatch and insertion/deletion. The BitPAl method uses structural properties in the relationship between adjacent scores in the scoring matrix to construct classes of efficient algorithms, each designed for a particular set of weights. In timed tests, we show that BitPAl runs 7-25 times faster than a standard iterative algorithm. Source code is freely available for download at http://lobstah.bu.edu/BitPAl/BitPAl.html. BitPAl is implemented in C and runs on all major operating systems. jloving@bu.edu or yhernand@bu.edu or gbenson@bu.edu Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
A hybrid algorithm for parallel molecular dynamics simulations
Mangiardi, Chris M.; Meyer, R.
2017-10-01
This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
Performance modeling of parallel algorithms for solving neutron diffusion problems
International Nuclear Information System (INIS)
Azmy, Y.Y.; Kirk, B.L.
1995-01-01
Neutron diffusion calculations are the most common computational methods used in the design, analysis, and operation of nuclear reactors and related activities. Here, mathematical performance models are developed for the parallel algorithm used to solve the neutron diffusion equation on message passing and shared memory multiprocessors represented by the Intel iPSC/860 and the Sequent Balance 8000, respectively. The performance models are validated through several test problems, and these models are used to estimate the performance of each of the two considered architectures in situations typical of practical applications, such as fine meshes and a large number of participating processors. While message passing computers are capable of producing speedup, the parallel efficiency deteriorates rapidly as the number of processors increases. Furthermore, the speedup fails to improve appreciably for massively parallel computers so that only small- to medium-sized message passing multiprocessors offer a reasonable platform for this algorithm. In contrast, the performance model for the shared memory architecture predicts very high efficiency over a wide range of number of processors reasonable for this architecture. Furthermore, the model efficiency of the Sequent remains superior to that of the hypercube if its model parameters are adjusted to make its processors as fast as those of the iPSC/860. It is concluded that shared memory computers are better suited for this parallel algorithm than message passing computers
Fast parallel molecular algorithms for DNA-based computation: factoring integers.
Chang, Weng-Long; Guo, Minyi; Ho, Michael Shan-Hui
2005-06-01
The RSA public-key cryptosystem is an algorithm that converts input data to an unrecognizable encryption and converts the unrecognizable data back into its original decryption form. The security of the RSA public-key cryptosystem is based on the difficulty of factoring the product of two large prime numbers. This paper demonstrates to factor the product of two large prime numbers, and is a breakthrough in basic biological operations using a molecular computer. In order to achieve this, we propose three DNA-based algorithms for parallel subtractor, parallel comparator, and parallel modular arithmetic that formally verify our designed molecular solutions for factoring the product of two large prime numbers. Furthermore, this work indicates that the cryptosystems using public-key are perhaps insecure and also presents clear evidence of the ability of molecular computing to perform complicated mathematical operations.
A Parallel Genetic Algorithm for Automated Electronic Circuit Design
Long, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris
2000-01-01
Parallelized versions of genetic algorithms (GAs) are popular primarily for three reasons: the GA is an inherently parallel algorithm, typical GA applications are very compute intensive, and powerful computing platforms, especially Beowulf-style computing clusters, are becoming more affordable and easier to implement. In addition, the low communication bandwidth required allows the use of inexpensive networking hardware such as standard office ethernet. In this paper we describe a parallel GA and its use in automated high-level circuit design. Genetic algorithms are a type of trial-and-error search technique that are guided by principles of Darwinian evolution. Just as the genetic material of two living organisms can intermix to produce offspring that are better adapted to their environment, GAs expose genetic material, frequently strings of 1s and Os, to the forces of artificial evolution: selection, mutation, recombination, etc. GAs start with a pool of randomly-generated candidate solutions which are then tested and scored with respect to their utility. Solutions are then bred by probabilistically selecting high quality parents and recombining their genetic representations to produce offspring solutions. Offspring are typically subjected to a small amount of random mutation. After a pool of offspring is produced, this process iterates until a satisfactory solution is found or an iteration limit is reached. Genetic algorithms have been applied to a wide variety of problems in many fields, including chemistry, biology, and many engineering disciplines. There are many styles of parallelism used in implementing parallel GAs. One such method is called the master-slave or processor farm approach. In this technique, slave nodes are used solely to compute fitness evaluations (the most time consuming part). The master processor collects fitness scores from the nodes and performs the genetic operators (selection, reproduction, variation, etc.). Because of dependency
Parallel algorithms for network routing problems and recurrences
International Nuclear Information System (INIS)
Wisniewski, J.A.; Sameh, A.H.
1982-01-01
In this paper, we consider the parallel solution of recurrences, and linear systems in the regular algebra of Carre. These problems are equivalent to solving the shortest path problem in graph theory, and they also arise in the analysis of Fortran programs. Our methods for solving linear systems in the regular algebra are analogues of well-known methods for solving systems of linear algebraic equations. A parallel version of Dijkstra's method, which has no linear algebraic analogue, is presented. Considerations for choosing an algorithm when the problem is large and sparse are also discussed
A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.
Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael H F
2018-03-01
Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about better than the fastest sequential algorithm and speed-up goes up to on 64 threads.
Research in Parallel Algorithms and Software for Computational Aerosciences
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Medical Image Retrieval Based On the Parallelization of the Cluster Sampling Algorithm
Ali, Hesham Arafat; Attiya, Salah; El-henawy, Ibrahim
2017-01-01
In this paper we develop parallel cluster sampling algorithms and show that a multi-chain version is embarrassingly parallel and can be used efficiently for medical image retrieval among other applications.
A new asynchronous parallel algorithm for inferring large-scale gene regulatory networks.
Directory of Open Access Journals (Sweden)
Xiangyun Xiao
Full Text Available The reconstruction of gene regulatory networks (GRNs from high-throughput experimental data has been considered one of the most important issues in systems biology research. With the development of high-throughput technology and the complexity of biological problems, we need to reconstruct GRNs that contain thousands of genes. However, when many existing algorithms are used to handle these large-scale problems, they will encounter two important issues: low accuracy and high computational cost. To overcome these difficulties, the main goal of this study is to design an effective parallel algorithm to infer large-scale GRNs based on high-performance parallel computing environments. In this study, we proposed a novel asynchronous parallel framework to improve the accuracy and lower the time complexity of large-scale GRN inference by combining splitting technology and ordinary differential equation (ODE-based optimization. The presented algorithm uses the sparsity and modularity of GRNs to split whole large-scale GRNs into many small-scale modular subnetworks. Through the ODE-based optimization of all subnetworks in parallel and their asynchronous communications, we can easily obtain the parameters of the whole network. To test the performance of the proposed approach, we used well-known benchmark datasets from Dialogue for Reverse Engineering Assessments and Methods challenge (DREAM, experimentally determined GRN of Escherichia coli and one published dataset that contains more than 10 thousand genes to compare the proposed approach with several popular algorithms on the same high-performance computing environments in terms of both accuracy and time complexity. The numerical results demonstrate that our parallel algorithm exhibits obvious superiority in inferring large-scale GRNs.
A new asynchronous parallel algorithm for inferring large-scale gene regulatory networks.
Xiao, Xiangyun; Zhang, Wei; Zou, Xiufen
2015-01-01
The reconstruction of gene regulatory networks (GRNs) from high-throughput experimental data has been considered one of the most important issues in systems biology research. With the development of high-throughput technology and the complexity of biological problems, we need to reconstruct GRNs that contain thousands of genes. However, when many existing algorithms are used to handle these large-scale problems, they will encounter two important issues: low accuracy and high computational cost. To overcome these difficulties, the main goal of this study is to design an effective parallel algorithm to infer large-scale GRNs based on high-performance parallel computing environments. In this study, we proposed a novel asynchronous parallel framework to improve the accuracy and lower the time complexity of large-scale GRN inference by combining splitting technology and ordinary differential equation (ODE)-based optimization. The presented algorithm uses the sparsity and modularity of GRNs to split whole large-scale GRNs into many small-scale modular subnetworks. Through the ODE-based optimization of all subnetworks in parallel and their asynchronous communications, we can easily obtain the parameters of the whole network. To test the performance of the proposed approach, we used well-known benchmark datasets from Dialogue for Reverse Engineering Assessments and Methods challenge (DREAM), experimentally determined GRN of Escherichia coli and one published dataset that contains more than 10 thousand genes to compare the proposed approach with several popular algorithms on the same high-performance computing environments in terms of both accuracy and time complexity. The numerical results demonstrate that our parallel algorithm exhibits obvious superiority in inferring large-scale GRNs.
Iterative schemes for parallel Sn algorithms in a shared-memory computing environment
International Nuclear Information System (INIS)
Haghighat, A.; Hunter, M.A.; Mattis, R.E.
1995-01-01
Several two-dimensional spatial domain partitioning S n transport theory algorithms are developed on the basis of different iterative schemes. These algorithms are incorporated into TWOTRAN-II and tested on the shared-memory CRAY Y-MP C90 computer. For a series of fixed-source r-z geometry homogeneous problems, it is demonstrated that the concurrent red-black algorithms may result in large parallel efficiencies (>60%) on C90. It is also demonstrated that for a realistic shielding problem, the use of the negative flux fixup causes high load imbalance, which results in a significant loss of parallel efficiency
A Fast parallel tridiagonal algorithm for a class of CFD applications
Moitra, Stuti; Sun, Xian-He
1996-01-01
The parallel diagonal dominant (PDD) algorithm is an efficient tridiagonal solver. This paper presents for study a variation of the PDD algorithm, the reduced PDD algorithm. The new algorithm maintains the minimum communication provided by the PDD algorithm, but has a reduced operation count. The PDD algorithm also has a smaller operation count than the conventional sequential algorithm for many applications. Accuracy analysis is provided for the reduced PDD algorithm for symmetric Toeplitz tridiagonal (STT) systems. Implementation results on Langley's Intel Paragon and IBM SP2 show that both the PDD and reduced PDD algorithms are efficient and scalable.
A parallel row-based algorithm for standard cell placement with integrated error control
Sargent, Jeff S.; Banerjee, Prith
1989-01-01
A new row-based parallel algorithm for standard-cell placement targeted for execution on a hypercube multiprocessor is presented. Key features of this implementation include a dynamic simulated-annealing schedule, row-partitioning of the VLSI chip image, and two novel approaches to control error in parallel cell-placement algorithms: (1) Heuristic Cell-Coloring; (2) Adaptive Sequence Length Control.
A Hybrid Parallel Preconditioning Algorithm For CFD
Barth,Timothy J.; Tang, Wei-Pai; Kwak, Dochan (Technical Monitor)
1995-01-01
A new hybrid preconditioning algorithm will be presented which combines the favorable attributes of incomplete lower-upper (ILU) factorization with the favorable attributes of the approximate inverse method recently advocated by numerous researchers. The quality of the preconditioner is adjustable and can be increased at the cost of additional computation while at the same time the storage required is roughly constant and approximately equal to the storage required for the original matrix. In addition, the preconditioning algorithm suggests an efficient and natural parallel implementation with reduced communication. Sample calculations will be presented for the numerical solution of multi-dimensional advection-diffusion equations. The matrix solver has also been embedded into a Newton algorithm for solving the nonlinear Euler and Navier-Stokes equations governing compressible flow. The full paper will show numerous examples in CFD to demonstrate the efficiency and robustness of the method.
A parallel ILP algorithm that incorporates incremental batch learning
Nuno Fonseca; Rui Camacho; Fernado Silva
2003-01-01
In this paper we tackle the problems of eciency and scala-bility faced by Inductive Logic Programming (ILP) systems. We proposethe use of parallelism to improve eciency and the use of an incrementalbatch learning to address the scalability problem. We describe a novelparallel algorithm that incorporates into ILP the method of incremen-tal batch learning. The theoretical complexity of the algorithm indicatesthat a linear speedup can be achieved.
A Globally Convergent Parallel SSLE Algorithm for Inequality Constrained Optimization
Directory of Open Access Journals (Sweden)
Zhijun Luo
2014-01-01
Full Text Available A new parallel variable distribution algorithm based on interior point SSLE algorithm is proposed for solving inequality constrained optimization problems under the condition that the constraints are block-separable by the technology of sequential system of linear equation. Each iteration of this algorithm only needs to solve three systems of linear equations with the same coefficient matrix to obtain the descent direction. Furthermore, under certain conditions, the global convergence is achieved.
Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan
2016-01-01
A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network's initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data.
Parallelization of the model-based iterative reconstruction algorithm DIRA
International Nuclear Information System (INIS)
Oertenberg, A.; Sandborg, M.; Alm Carlsson, G.; Malusek, A.; Magnusson, M.
2016-01-01
New paradigms for parallel programming have been devised to simplify software development on multi-core processors and many-core graphical processing units (GPU). Despite their obvious benefits, the parallelization of existing computer programs is not an easy task. In this work, the use of the Open Multiprocessing (OpenMP) and Open Computing Language (OpenCL) frameworks is considered for the parallelization of the model-based iterative reconstruction algorithm DIRA with the aim to significantly shorten the code's execution time. Selected routines were parallelized using OpenMP and OpenCL libraries; some routines were converted from MATLAB to C and optimised. Parallelization of the code with the OpenMP was easy and resulted in an overall speedup of 15 on a 16-core computer. Parallelization with OpenCL was more difficult owing to differences between the central processing unit and GPU architectures. The resulting speedup was substantially lower than the theoretical peak performance of the GPU; the cause was explained. (authors)
International Nuclear Information System (INIS)
Roche-Lima, Abiel; Thulasiram, Ruppa K
2012-01-01
Finite automata, in which each transition is augmented with an output label in addition to the familiar input label, are considered finite-state transducers. Transducers have been used to analyze some fundamental issues in bioinformatics. Weighted finite-state transducers have been proposed to pairwise alignments of DNA and protein sequences; as well as to develop kernels for computational biology. Machine learning algorithms for conditional transducers have been implemented and used for DNA sequence analysis. Transducer learning algorithms are based on conditional probability computation. It is calculated by using techniques, such as pair-database creation, normalization (with Maximum-Likelihood normalization) and parameters optimization (with Expectation-Maximization - EM). These techniques are intrinsically costly for computation, even worse when are applied to bioinformatics, because the databases sizes are large. In this work, we describe a parallel implementation of an algorithm to learn conditional transducers using these techniques. The algorithm is oriented to bioinformatics applications, such as alignments, phylogenetic trees, and other genome evolution studies. Indeed, several experiences were developed using the parallel and sequential algorithm on Westgrid (specifically, on the Breeze cluster). As results, we obtain that our parallel algorithm is scalable, because execution times are reduced considerably when the data size parameter is increased. Another experience is developed by changing precision parameter. In this case, we obtain smaller execution times using the parallel algorithm. Finally, number of threads used to execute the parallel algorithm on the Breezy cluster is changed. In this last experience, we obtain as result that speedup is considerably increased when more threads are used; however there is a convergence for number of threads equal to or greater than 16.
Algorithms for parallel flow solvers on message passing architectures
Vanderwijngaart, Rob F.
1995-01-01
The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those
Fast parallel DNA-based algorithms for molecular computation: the set-partition problem.
Chang, Weng-Long
2007-12-01
This paper demonstrates that basic biological operations can be used to solve the set-partition problem. In order to achieve this, we propose three DNA-based algorithms, a signed parallel adder, a signed parallel subtractor and a signed parallel comparator, that formally verify our designed molecular solutions for solving the set-partition problem.
Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms
Fidel, Adam; Jacobs, Sam Ade; Sharma, Shishir; Amato, Nancy M.; Rauchwerger, Lawrence
2014-01-01
Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores. © 2014 IEEE.
Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms
Fidel, Adam
2014-05-01
Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores. © 2014 IEEE.
High-speed detection of emergent market clustering via an unsupervised parallel genetic algorithm
Directory of Open Access Journals (Sweden)
Dieter Hendricks
2016-02-01
Full Text Available We implement a master-slave parallel genetic algorithm with a bespoke log-likelihood fitness function to identify emergent clusters within price evolutions. We use graphics processing units (GPUs to implement a parallel genetic algorithm and visualise the results using disjoint minimal spanning trees. We demonstrate that our GPU parallel genetic algorithm, implemented on a commercially available general purpose GPU, is able to recover stock clusters in sub-second speed, based on a subset of stocks in the South African market. This approach represents a pragmatic choice for low-cost, scalable parallel computing and is significantly faster than a prototype serial implementation in an optimised C-based fourth-generation programming language, although the results are not directly comparable because of compiler differences. Combined with fast online intraday correlation matrix estimation from high frequency data for cluster identification, the proposed implementation offers cost-effective, near-real-time risk assessment for financial practitioners.
A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.
On the impact of communication complexity in the design of parallel numerical algorithms
Gannon, D.; Vanrosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs
Directory of Open Access Journals (Sweden)
Vaughn Matthew
2010-11-01
Full Text Available Abstract Background Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ messages (Σ being the size of the alphabet. Results In this paper we present a Θ(n/p time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/BBlog(M/B (M being the main memory size and B being the size of the disk block. We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster - both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. Conclusions The bi
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.
Kundeti, Vamsi K; Rajasekaran, Sanguthevar; Dinh, Hieu; Vaughn, Matthew; Thapar, Vishal
2010-11-15
Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages (Σ being the size of the alphabet). In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/B)Blog(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster--both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. The bi-directed de Bruijn graph is a fundamental data structure for
Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation
Cai, Xiao-Chuan; Gropp, William D.; Keyes, David E.; Melvin, Robin G.; Young, David P.
1996-01-01
We study parallel two-level overlapping Schwarz algorithms for solving nonlinear finite element problems, in particular, for the full potential equation of aerodynamics discretized in two dimensions with bilinear elements. The overall algorithm, Newton-Krylov-Schwarz (NKS), employs an inexact finite-difference Newton method and a Krylov space iterative method, with a two-level overlapping Schwarz method as a preconditioner. We demonstrate that NKS, combined with a density upwinding continuation strategy for problems with weak shocks, is robust and, economical for this class of mixed elliptic-hyperbolic nonlinear partial differential equations, with proper specification of several parameters. We study upwinding parameters, inner convergence tolerance, coarse grid density, subdomain overlap, and the level of fill-in in the incomplete factorization, and report their effect on numerical convergence rate, overall execution time, and parallel efficiency on a distributed-memory parallel computer.
International Nuclear Information System (INIS)
Schleier, W.; Besold, G.; Heinz, K.
1992-01-01
The authors study the applicability of parallelized/vectorized Monte Carlo (MC) algorithms to the simulation of domain growth in two-dimensional lattice gas models undergoing an ordering process after a rapid quench below an order-disorder transition temperature. As examples they consider models with 2 x 1 and c(2 x 2) equilibrium superstructures on the square and rectangular lattices, respectively. They also study the case of phase separation ('1 x 1' islands) on the square lattice. A generalized parallel checkerboard algorithm for Kawasaki dynamics is shown to give rise to artificial spatial correlations in all three models. However, only if superstructure domains evolve do these correlations modify the kinetics by influencing the nucleation process and result in a reduced growth exponent compared to the value from the conventional heat bath algorithm with random single-site updates. In order to overcome these artificial modifications, two MC algorithms with a reduced degree of parallelism ('hybrid' and 'mask' algorithms, respectively) are presented and applied. As the results indicate, these algorithms are suitable for the simulation of superstructure domain growth on parallel/vector computers. 60 refs., 10 figs., 1 tab
International Nuclear Information System (INIS)
Plimpton, Steven J.; Hendrickson, Bruce; Burns, Shawn P.; McLendon, William III; Rauchwerger, Lawrence
2005-01-01
The method of discrete ordinates is commonly used to solve the Boltzmann transport equation. The solution in each ordinate direction is most efficiently computed by sweeping the radiation flux across the computational grid. For unstructured grids this poses many challenges, particularly when implemented on distributed-memory parallel machines where the grid geometry is spread across processors. We present several algorithms relevant to this approach: (a) an asynchronous message-passing algorithm that performs sweeps simultaneously in multiple ordinate directions, (b) a simple geometric heuristic to prioritize the computational tasks that a processor works on, (c) a partitioning algorithm that creates columnar-style decompositions for unstructured grids, and (d) an algorithm for detecting and eliminating cycles that sometimes exist in unstructured grids and can prevent sweeps from successfully completing. Algorithms (a) and (d) are fully parallel; algorithms (b) and (c) can be used in conjunction with (a) to achieve higher parallel efficiencies. We describe our message-passing implementations of these algorithms within a radiation transport package. Performance and scalability results are given for unstructured grids with up to 3 million elements (500 million unknowns) running on thousands of processors of Sandia National Laboratories' Intel Tflops machine and DEC-Alpha CPlant cluster
An intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces.
Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying
2013-09-01
Poisson disk sampling has excellent spatial and spectral properties, and plays an important role in a variety of visual computing. Although many promising algorithms have been proposed for multidimensional sampling in euclidean space, very few studies have been reported with regard to the problem of generating Poisson disks on surfaces due to the complicated nature of the surface. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. In sharp contrast to the conventional parallel approaches, our method neither partitions the given surface into small patches nor uses any spatial data structure to maintain the voids in the sampling domain. Instead, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. Our algorithm guarantees that the generated Poisson disks are uniformly and randomly distributed without bias. It is worth noting that our method is intrinsic and independent of the embedding space. This intrinsic feature allows us to generate Poisson disk patterns on arbitrary surfaces in IR(n). To our knowledge, this is the first intrinsic, parallel, and accurate algorithm for surface Poisson disk sampling. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
Feed-forward volume rendering algorithm for moderately parallel MIMD machines
Yagel, Roni
1993-01-01
Algorithms for direct volume rendering on parallel and vector processors are investigated. Volumes are transformed efficiently on parallel processors by dividing the data into slices and beams of voxels. Equal sized sets of slices along one axis are distributed to processors. Parallelism is achieved at two levels. Because each slice can be transformed independently of others, processors transform their assigned slices with no communication, thus providing maximum possible parallelism at the first level. Within each slice, consecutive beams are incrementally transformed using coherency in the transformation computation. Also, coherency across slices can be exploited to further enhance performance. This coherency yields the second level of parallelism through the use of the vector processing or pipelining. Other ongoing efforts include investigations into image reconstruction techniques, load balancing strategies, and improving performance.
A Parallel Adaptive Particle Swarm Optimization Algorithm for Economic/Environmental Power Dispatch
Directory of Open Access Journals (Sweden)
Jinchao Li
2012-01-01
Full Text Available A parallel adaptive particle swarm optimization algorithm (PAPSO is proposed for economic/environmental power dispatch, which can overcome the premature characteristic, the slow-speed convergence in the late evolutionary phase, and lacking good direction in particles’ evolutionary process. A search population is randomly divided into several subpopulations. Then for each subpopulation, the optimal solution is searched synchronously using the proposed method, and thus parallel computing is realized. To avoid converging to a local optimum, a crossover operator is introduced to exchange the information among the subpopulations and the diversity of population is sustained simultaneously. Simulation results show that the proposed algorithm can effectively solve the economic/environmental operation problem of hydropower generating units. Performance comparisons show that the solution from the proposed method is better than those from the conventional particle swarm algorithm and other optimization algorithms.
Application of the DMRG in two dimensions: a parallel tempering algorithm
Hu, Shijie; Zhao, Jize; Zhang, Xuefeng; Eggert, Sebastian
The Density Matrix Renormalization Group (DMRG) is known to be a powerful algorithm for treating one-dimensional systems. When the DMRG is applied in two dimensions, however, the convergence becomes much less reliable and typically ''metastable states'' may appear, which are unfortunately quite robust even when keeping a very high number of DMRG states. To overcome this problem we have now successfully developed a parallel tempering DMRG algorithm. Similar to parallel tempering in quantum Monte Carlo, this algorithm allows the systematic switching of DMRG states between different model parameters, which is very efficient for solving convergence problems. Using this method we have figured out the phase diagram of the xxz model on the anisotropic triangular lattice which can be realized by hardcore bosons in optical lattices. SFB Transregio 49 of the Deutsche Forschungsgemeinschaft (DFG) and the Allianz fur Hochleistungsrechnen Rheinland-Pfalz (AHRP).
A parallel algorithm for 3D dislocation dynamics
International Nuclear Information System (INIS)
Wang Zhiqiang; Ghoniem, Nasr; Swaminarayan, Sriram; LeSar, Richard
2006-01-01
Dislocation dynamics (DD), a discrete dynamic simulation method in which dislocations are the fundamental entities, is a powerful tool for investigation of plasticity, deformation and fracture of materials at the micron length scale. However, severe computational difficulties arising from complex, long-range interactions between these curvilinear line defects limit the application of DD in the study of large-scale plastic deformation. We present here the development of a parallel algorithm for accelerated computer simulations of DD. By representing dislocations as a 3D set of dislocation particles, we show here that the problem of an interacting ensemble of dislocations can be converted to a problem of a particle ensemble, interacting with a long-range force field. A grid using binary space partitioning is constructed to keep track of node connectivity across domains. We demonstrate the computational efficiency of the parallel micro-plasticity code and discuss how O(N) methods map naturally onto the parallel data structure. Finally, we present results from applications of the parallel code to deformation in single crystal fcc metals
Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms
Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel
2016-04-01
Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and
Efficient sequential and parallel algorithms for finding edit distance based motifs.
Pal, Soumitra; Xiao, Peng; Rajasekaran, Sanguthevar
2016-08-18
Motif search is an important step in extracting meaningful patterns from biological data. The general problem of motif search is intractable and there is a pressing need to develop efficient, exact and approximation algorithms to solve this problem. In this paper, we present several novel, exact, sequential and parallel algorithms for solving the (l,d) Edit-distance-based Motif Search (EMS) problem: given two integers l,d and n biological strings, find all strings of length l that appear in each input string with atmost d errors of types substitution, insertion and deletion. One popular technique to solve the problem is to explore for each input string the set of all possible l-mers that belong to the d-neighborhood of any substring of the input string and output those which are common for all input strings. We introduce a novel and provably efficient neighborhood exploration technique. We show that it is enough to consider the candidates in neighborhood which are at a distance exactly d. We compactly represent these candidate motifs using wildcard characters and efficiently explore them with very few repetitions. Our sequential algorithm uses a trie based data structure to efficiently store and sort the candidate motifs. Our parallel algorithm in a multi-core shared memory setting uses arrays for storing and a novel modification of radix-sort for sorting the candidate motifs. The algorithms for EMS are customarily evaluated on several challenging instances such as (8,1), (12,2), (16,3), (20,4), and so on. The best previously known algorithm, EMS1, is sequential and in estimated 3 days solves up to instance (16,3). Our sequential algorithms are more than 20 times faster on (16,3). On other hard instances such as (9,2), (11,3), (13,4), our algorithms are much faster. Our parallel algorithm has more than 600 % scaling performance while using 16 threads. Our algorithms have pushed up the state-of-the-art of EMS solvers and we believe that the techniques introduced in
An intelligent allocation algorithm for parallel processing
Carroll, Chester C.; Homaifar, Abdollah; Ananthram, Kishan G.
1988-01-01
The problem of allocating nodes of a program graph to processors in a parallel processing architecture is considered. The algorithm is based on critical path analysis, some allocation heuristics, and the execution granularity of nodes in a program graph. These factors, and the structure of interprocessor communication network, influence the allocation. To achieve realistic estimations of the executive durations of allocations, the algorithm considers the fact that nodes in a program graph have to communicate through varying numbers of tokens. Coarse and fine granularities have been implemented, with interprocessor token-communication duration, varying from zero up to values comparable to the execution durations of individual nodes. The effect on allocation of communication network structures is demonstrated by performing allocations for crossbar (non-blocking) and star (blocking) networks. The algorithm assumes the availability of as many processors as it needs for the optimal allocation of any program graph. Hence, the focus of allocation has been on varying token-communication durations rather than varying the number of processors. The algorithm always utilizes as many processors as necessary for the optimal allocation of any program graph, depending upon granularity and characteristics of the interprocessor communication network.
Large-Scale Parallel Viscous Flow Computations using an Unstructured Multigrid Algorithm
Mavriplis, Dimitri J.
1999-01-01
The development and testing of a parallel unstructured agglomeration multigrid algorithm for steady-state aerodynamic flows is discussed. The agglomeration multigrid strategy uses a graph algorithm to construct the coarse multigrid levels from the given fine grid, similar to an algebraic multigrid approach, but operates directly on the non-linear system using the FAS (Full Approximation Scheme) approach. The scalability and convergence rate of the multigrid algorithm are examined on the SGI Origin 2000 and the Cray T3E. An argument is given which indicates that the asymptotic scalability of the multigrid algorithm should be similar to that of its underlying single grid smoothing scheme. For medium size problems involving several million grid points, near perfect scalability is obtained for the single grid algorithm, while only a slight drop-off in parallel efficiency is observed for the multigrid V- and W-cycles, using up to 128 processors on the SGI Origin 2000, and up to 512 processors on the Cray T3E. For a large problem using 25 million grid points, good scalability is observed for the multigrid algorithm using up to 1450 processors on a Cray T3E, even when the coarsest grid level contains fewer points than the total number of processors.
Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model
Hamam, Alwaleed A.
2017-03-13
Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it\\'s time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.
Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model
Hamam, Alwaleed A.; Khan, Ayaz H.
2017-01-01
Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it's time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.
Optimization Algorithms for Calculation of the Joint Design Point in Parallel Systems
DEFF Research Database (Denmark)
Enevoldsen, I.; Sørensen, John Dalsgaard
1992-01-01
In large structures it is often necessary to estimate the reliability of the system by use of parallel systems. Optimality criteria-based algorithms for calculation of the joint design point in a parallel system are described and efficient active set strategies are developed. Three possible...
Eigenvalues calculation algorithms for {lambda}-modes determination. Parallelization approach
Energy Technology Data Exchange (ETDEWEB)
Vidal, V. [Universidad Politecnica de Valencia (Spain). Departamento de Sistemas Informaticos y Computacion; Verdu, G.; Munoz-Cobo, J.L. [Universidad Politecnica de Valencia (Spain). Departamento de Ingenieria Quimica y Nuclear; Ginestart, D. [Universidad Politecnica de Valencia (Spain). Departamento de Matematica Aplicada
1997-03-01
In this paper, we review two methods to obtain the {lambda}-modes of a nuclear reactor, Subspace Iteration method and Arnoldi`s method, which are popular methods to solve the partial eigenvalue problem for a given matrix. In the developed application for the neutron diffusion equation we include improved acceleration techniques for both methods. Also, we propose two parallelization approaches for these methods, a coarse grain parallelization and a fine grain one. We have tested the developed algorithms with two realistic problems, focusing on the efficiency of the methods according to the CPU times. (author).
Implementation of a parallel algorithm for spherical SN calculations on the IBM 3090
International Nuclear Information System (INIS)
Haghighat, A.; Lawrence, R.D.
1989-01-01
Parallel S N algorithms based on domain decomposition in angle are straightforward to develop in Cartesian geometry because the computation of the angular fluxes for a specific discrete ordinate can be performed independently of all other angles. This is not the case for curvilinear geometries, where the angular redistribution component of the discretized streaming operator results in coupling between angular fluxes along adjacent discrete ordinates. Previously, the authors developed a parallel algorithm for S N calculations in spherical geometry and examined its iterative convergence for criticality and detector problems with differing scattering/absorption ratios. In this paper, the authors describe the implementation of the algorithm on an IBM 3090 Model 400 (four processors) and present computational results illustrating the efficiency of the algorithm relative to serial execution
A structured representation for parallel algorithm design on multicomputers
International Nuclear Information System (INIS)
Sun, Xian-He; Ni, L.M.
1991-01-01
Traditionally, parallel algorithms have been designed by brute force methods and fine-tuned on each architecture to achieve high performance. Rather than studying the design case by case, a systematic approach is proposed. A notation is first developed. Using this notation, most of the frequently used scientific and engineering applications can be presented by simple formulas. The formulas constitute the structured representation of the corresponding applications. The structured representation is simple, adequate and easy to understand. They also contain sufficient information about uneven allocation and communication latency degradations. With the structured representation, applications can be compared, classified and partitioned. Some of the basic building blocks, called computation models, of frequently used applications are identified and studied. Most applications are combinations of some computation models. The structured representation relates general applications to computation models. Studying computation models leads to a guideline for efficient parallel algorithm design for general applications. 6 refs., 7 figs
Algorithm comparison and benchmarking using a parallel spectra transform shallow water model
Energy Technology Data Exchange (ETDEWEB)
Worley, P.H. [Oak Ridge National Lab., TN (United States); Foster, I.T.; Toonen, B. [Argonne National Lab., IL (United States)
1995-04-01
In recent years, a number of computer vendors have produced supercomputers based on a massively parallel processing (MPP) architecture. These computers have been shown to be competitive in performance with conventional vector supercomputers for some applications. As spectral weather and climate models are heavy users of vector supercomputers, it is interesting to determine how these models perform on MPPS, and which MPPs are best suited to the execution of spectral models. The benchmarking of MPPs is complicated by the fact that different algorithms may be more efficient on different architectures. Hence, a comprehensive benchmarking effort must answer two related questions: which algorithm is most efficient on each computer and how do the most efficient algorithms compare on different computers. In general, these are difficult questions to answer because of the high cost associated with implementing and evaluating a range of different parallel algorithms on each MPP platform.
Parallel algorithm for determining motion vectors in ice floe images by matching edge features
Manohar, M.; Ramapriyan, H. K.; Strong, J. P.
1988-01-01
A parallel algorithm is described to determine motion vectors of ice floes using time sequences of images of the Arctic ocean obtained from the Synthetic Aperture Radar (SAR) instrument flown on-board the SEASAT spacecraft. Researchers describe a parallel algorithm which is implemented on the MPP for locating corresponding objects based on their translationally and rotationally invariant features. The algorithm first approximates the edges in the images by polygons or sets of connected straight-line segments. Each such edge structure is then reduced to a seed point. Associated with each seed point are the descriptions (lengths, orientations and sequence numbers) of the lines constituting the corresponding edge structure. A parallel matching algorithm is used to match packed arrays of such descriptions to identify corresponding seed points in the two images. The matching algorithm is designed such that fragmentation and merging of ice floes are taken into account by accepting partial matches. The technique has been demonstrated to work on synthetic test patterns and real image pairs from SEASAT in times ranging from .5 to 0.7 seconds for 128 x 128 images.
Parallel algorithms for 2-D cylindrical transport equations of Eigenvalue problem
International Nuclear Information System (INIS)
Wei, J.; Yang, S.
2013-01-01
In this paper, aimed at the neutron transport equations of eigenvalue problem under 2-D cylindrical geometry on unstructured grid, the discrete scheme of Sn discrete ordinate and discontinuous finite is built, and the parallel computation for the scheme is realized on MPI systems. Numerical experiments indicate that the designed parallel algorithm can reach perfect speedup, it has good practicality and scalability. (authors)
Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.
Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias
2011-01-01
The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.
A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison.
Martins, W S; Del Cuvillo, J B; Useche, F J; Theobald, K B; Gao, G R
2001-01-01
This paper discusses the issues involved in implementing a dynamic programming algorithm for biological sequence comparison on a general-purpose parallel computing platform based on a fine-grain event-driven multithreaded program execution model. Fine-grain multithreading permits efficient parallelism exploitation in this application both by taking advantage of asynchronous point-to-point synchronizations and communication with low overheads and by effectively tolerating latency through the overlapping of computation and communication. We have implemented our scheme on EARTH, a fine-grain event-driven multithreaded execution and architecture model which has been ported to a number of parallel machines with off-the-shelf processors. Our experimental results show that the dynamic programming algorithm can be efficiently implemented on EARTH systems with high performance (e.g., speedup of 90 on 120 nodes), good programmability and reasonable cost.
An efficient parallel algorithm for the calculation of canonical MP2 energies.
Baker, Jon; Pulay, Peter
2002-09-01
We present the parallel version of a previous serial algorithm for the efficient calculation of canonical MP2 energies (Pulay, P.; Saebo, S.; Wolinski, K. Chem Phys Lett 2001, 344, 543). It is based on the Saebo-Almlöf direct-integral transformation, coupled with an efficient prescreening of the AO integrals. The parallel algorithm avoids synchronization delays by spawning a second set of slaves during the bin-sort prior to the second half-transformation. Results are presented for systems with up to 2000 basis functions. MP2 energies for molecules with 400-500 basis functions can be routinely calculated to microhartree accuracy on a small number of processors (6-8) in a matter of minutes with modern PC-based parallel computers. Copyright 2002 Wiley Periodicals, Inc. J Comput Chem 23: 1150-1156, 2002
Directory of Open Access Journals (Sweden)
Zhiteng Wang
2014-01-01
Full Text Available Service oriented modeling and simulation are hot issues in the field of modeling and simulation, and there is need to call service resources when simulation task workflow is running. How to optimize the service resource allocation to ensure that the task is complete effectively is an important issue in this area. In military modeling and simulation field, it is important to improve the probability of success and timeliness in simulation task workflow. Therefore, this paper proposes an optimization algorithm for multipath service resource parallel allocation, in which multipath service resource parallel allocation model is built and multiple chains coding scheme quantum optimization algorithm is used for optimization and solution. The multiple chains coding scheme quantum optimization algorithm is to extend parallel search space to improve search efficiency. Through the simulation experiment, this paper investigates the effect for the probability of success in simulation task workflow from different optimization algorithm, service allocation strategy, and path number, and the simulation result shows that the optimization algorithm for multipath service resource parallel allocation is an effective method to improve the probability of success and timeliness in simulation task workflow.
An efficient parallel algorithm for the solution of a tridiagonal linear system of equations
Stone, H. S.
1971-01-01
Tridiagonal linear systems of equations are solved on conventional serial machines in a time proportional to N, where N is the number of equations. The conventional algorithms do not lend themselves directly to parallel computations on computers of the ILLIAC IV class, in the sense that they appear to be inherently serial. An efficient parallel algorithm is presented in which computation time grows as log sub 2 N. The algorithm is based on recursive doubling solutions of linear recurrence relations, and can be used to solve recurrence relations of all orders.
Parallel Algorithm for Incremental Betweenness Centrality on Large Graphs
Jamour, Fuad Tarek
2017-10-17
Betweenness centrality quantifies the importance of nodes in a graph in many applications, including network analysis, community detection and identification of influential users. Typically, graphs in such applications evolve over time. Thus, the computation of betweenness centrality should be performed incrementally. This is challenging because updating even a single edge may trigger the computation of all-pairs shortest paths in the entire graph. Existing approaches cannot scale to large graphs: they either require excessive memory (i.e., quadratic to the size of the input graph) or perform unnecessary computations rendering them prohibitively slow. We propose iCentral; a novel incremental algorithm for computing betweenness centrality in evolving graphs. We decompose the graph into biconnected components and prove that processing can be localized within the affected components. iCentral is the first algorithm to support incremental betweeness centrality computation within a graph component. This is done efficiently, in linear space; consequently, iCentral scales to large graphs. We demonstrate with real datasets that the serial implementation of iCentral is up to 3.7 times faster than existing serial methods. Our parallel implementation that scales to large graphs, is an order of magnitude faster than the state-of-the-art parallel algorithm, while using an order of magnitude less computational resources.
Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra
1989-01-01
In part 1 architecture of NETRA is presented. A performance evaluation of NETRA using several common vision algorithms is also presented. Performance of algorithms when they are mapped on one cluster is described. It is shown that SIMD, MIMD, and systolic algorithms can be easily mapped onto processor clusters, and almost linear speedups are possible. For some algorithms, analytical performance results are compared with implementation performance results. It is observed that the analysis is very accurate. Performance analysis of parallel algorithms when mapped across clusters is presented. Mappings across clusters illustrate the importance and use of shared as well as distributed memory in achieving high performance. The parameters for evaluation are derived from the characteristics of the parallel algorithms, and these parameters are used to evaluate the alternative communication strategies in NETRA. Furthermore, the effect of communication interference from other processors in the system on the execution of an algorithm is studied. Using the analysis, performance of many algorithms with different characteristics is presented. It is observed that if communication speeds are matched with the computation speeds, good speedups are possible when algorithms are mapped across clusters.
Directory of Open Access Journals (Sweden)
SILVA JUNIOR,J. B.
2016-12-01
Full Text Available The Intrusion Detection System (IDS needs to compare the contents of all packets arriving at the network interface with a set of signatures for indicating possible attacks, a task that consumes much CPU processing time. In order to alleviate this problem, some researchers have tried to parallelize the IDS's comparison engine, transferring execution from the CPU to GPU. This paper identifies and maps the parallelization features of the Aho-Corasick algorithm, which is used in Snort to compare patterns, in order to show this algorithm's implementation and execution issues, as well as optimization techniques for the Aho-Corasick machine. We have found 147 papers from important computer science publications databases, and have mapped them. We selected 22 and analyzed them in order to find our results. Our analysis of the papers showed, among other results, that parallelization of the AC algorithm is a new task and the authors have focused on the State Transition Table as the most common way to implement the algorithm on the GPU. Furthermore, we found that some techniques speed up the algorithm and reduce the required machine storage space are highly used, such as the algorithm running on the fastest memories and mechanisms for reducing the number of nodes and bit maping.
Parallel Quasi Newton Algorithms for Large Scale Non Linear Unconstrained Optimization
International Nuclear Information System (INIS)
Rahman, M. A.; Basarudin, T.
1997-01-01
This paper discusses about Quasi Newton (QN) method to solve non-linear unconstrained minimization problems. One of many important of QN method is choice of matrix Hk. to be positive definite and satisfies to QN method. Our interest here is the parallel QN methods which will suite for the solution of large-scale optimization problems. The QN methods became less attractive in large-scale problems because of the storage and computational requirements. How ever, it is often the case that the Hessian is space matrix. In this paper we include the mechanism of how to reduce the Hessian update and hold the Hessian properties.One major reason of our research is that the QN method may be good in solving certain type of minimization problems, but it is efficiency degenerate when is it applied to solve other category of problems. For this reason, we use an algorithm containing several direction strategies which are processed in parallel. We shall attempt to parallelized algorithm by exploring different search directions which are generated by various QN update during the minimization process. The different line search strategies will be employed simultaneously in the process of locating the minimum along each direction.The code of algorithm will be written in Occam language 2 which is run on the transputer machine
Parallelizing Gene Expression Programming Algorithm in Enabling Large-Scale Classification
Directory of Open Access Journals (Sweden)
Lixiong Xu
2017-01-01
Full Text Available As one of the most effective function mining algorithms, Gene Expression Programming (GEP algorithm has been widely used in classification, pattern recognition, prediction, and other research fields. Based on the self-evolution, GEP is able to mine an optimal function for dealing with further complicated tasks. However, in big data researches, GEP encounters low efficiency issue due to its long time mining processes. To improve the efficiency of GEP in big data researches especially for processing large-scale classification tasks, this paper presents a parallelized GEP algorithm using MapReduce computing model. The experimental results show that the presented algorithm is scalable and efficient for processing large-scale classification tasks.
Parallel-Vector Algorithm For Rapid Structural Anlysis
Agarwal, Tarun R.; Nguyen, Duc T.; Storaasli, Olaf O.
1993-01-01
New algorithm developed to overcome deficiency of skyline storage scheme by use of variable-band storage scheme. Exploits both parallel and vector capabilities of modern high-performance computers. Gives engineers and designers opportunity to include more design variables and constraints during optimization of structures. Enables use of more refined finite-element meshes to obtain improved understanding of complex behaviors of aerospace structures leading to better, safer designs. Not only attractive for current supercomputers but also for next generation of shared-memory supercomputers.
Implementation of a Monte Carlo algorithm for neutron transport on a massively parallel SIMD machine
International Nuclear Information System (INIS)
Baker, R.S.
1992-01-01
We present some results from the recent adaptation of a vectorized Monte Carlo algorithm to a massively parallel architecture. The performance of the algorithm on a single processor Cray Y-MP and a Thinking Machine Corporations CM-2 and CM-200 is compared for several test problems. The results show that significant speedups are obtainable for vectorized Monte Carlo algorithms on massively parallel machines, even when the algorithms are applied to realistic problems which require extensive variance reduction. However, the architecture of the Connection Machine does place some limitations on the regime in which the Monte Carlo algorithm may be expected to perform well
Implementation of a Monte Carlo algorithm for neutron transport on a massively parallel SIMD machine
International Nuclear Information System (INIS)
Baker, R.S.
1993-01-01
We present some results from the recent adaptation of a vectorized Monte Carlo algorithm to a massively parallel architecture. The performance of the algorithm on a single processor Cray Y-MP and a Thinking Machine Corporations CM-2 and CM-200 is compared for several test problems. The results show that significant speedups are obtainable for vectorized Monte Carlo algorithms on massively parallel machines, even when the algorithms are applied to realistic problems which require extensive variance reduction. However, the architecture of the Connection Machine does place some limitations on the regime in which the Monte Carlo algorithm may be expected to perform well. (orig.)
High-speed parallel implementation of a modified PBR algorithm on DSP-based EH topology
Rajan, K.; Patnaik, L. M.; Ramakrishna, J.
1997-08-01
Algebraic Reconstruction Technique (ART) is an age-old method used for solving the problem of three-dimensional (3-D) reconstruction from projections in electron microscopy and radiology. In medical applications, direct 3-D reconstruction is at the forefront of investigation. The simultaneous iterative reconstruction technique (SIRT) is an ART-type algorithm with the potential of generating in a few iterations tomographic images of a quality comparable to that of convolution backprojection (CBP) methods. Pixel-based reconstruction (PBR) is similar to SIRT reconstruction, and it has been shown that PBR algorithms give better quality pictures compared to those produced by SIRT algorithms. In this work, we propose a few modifications to the PBR algorithms. The modified algorithms are shown to give better quality pictures compared to PBR algorithms. The PBR algorithm and the modified PBR algorithms are highly compute intensive, Not many attempts have been made to reconstruct objects in the true 3-D sense because of the high computational overhead. In this study, we have developed parallel two-dimensional (2-D) and 3-D reconstruction algorithms based on modified PBR. We attempt to solve the two problems encountered by the PBR and modified PBR algorithms, i.e., the long computational time and the large memory requirements, by parallelizing the algorithm on a multiprocessor system. We investigate the possible task and data partitioning schemes by exploiting the potential parallelism in the PBR algorithm subject to minimizing the memory requirement. We have implemented an extended hypercube (EH) architecture for the high-speed execution of the 3-D reconstruction algorithm using the commercially available fast floating point digital signal processor (DSP) chips as the processing elements (PEs) and dual-port random access memories (DPR) as channels between the PEs. We discuss and compare the performances of the PBR algorithm on an IBM 6000 RISC workstation, on a Silicon
Solving the Flood Propagation Problem with Newton Algorithm on Parallel Systems
Directory of Open Access Journals (Sweden)
Chefi Triki
2012-04-01
Full Text Available In this paper we propose a parallel implementation for the flood propagation method Flo2DH. The model is built on a finite element spatial approximation combined with a Newton algorithm that uses a direct LU linear solver. The parallel implementation has been developed by using the standard MPI protocol and has been tested on a set of real world problems.
International Nuclear Information System (INIS)
Taraglio, S.; Massaioli, F.
1995-08-01
A parallel implementation of a library to build and train Multi Layer Perceptrons via the Back Propagation algorithm is presented. The target machine is the SIMD massively parallel supercomputer Quadrics. Performance measures are provided on three different machines with different number of processors, for two network examples. A sample source code is given
An Improved Parallel DNA Algorithm of 3-SAT
Directory of Open Access Journals (Sweden)
Wei Liu
2007-09-01
Full Text Available There are many large-size and difficult computational problems in mathematics and computer science. For many of these problems, traditional computers cannot handle the mass of data in acceptable timeframes, which we call an NP problem. DNA computing is a means of solving a class of intractable computational problems in which the computing time grows exponentially with problem size. This paper proposes a parallel algorithm model for the universal 3-SAT problem based on the Adleman-Lipton model and applies biological operations to handling the mass of data in solution space. In this manner, we can control the run time of the algorithm to be finite and approximately constant.
Predicting mining activity with parallel genetic algorithms
Talaie, S.; Leigh, R.; Louis, S.J.; Raines, G.L.; Beyer, H.G.; O'Reilly, U.M.; Banzhaf, Arnold D.; Blum, W.; Bonabeau, C.; Cantu-Paz, E.W.; ,; ,
2005-01-01
We explore several different techniques in our quest to improve the overall model performance of a genetic algorithm calibrated probabilistic cellular automata. We use the Kappa statistic to measure correlation between ground truth data and data predicted by the model. Within the genetic algorithm, we introduce a new evaluation function sensitive to spatial correctness and we explore the idea of evolving different rule parameters for different subregions of the land. We reduce the time required to run a simulation from 6 hours to 10 minutes by parallelizing the code and employing a 10-node cluster. Our empirical results suggest that using the spatially sensitive evaluation function does indeed improve the performance of the model and our preliminary results also show that evolving different rule parameters for different regions tends to improve overall model performance. Copyright 2005 ACM.
PARALLEL ADAPTIVE MULTILEVEL SAMPLING ALGORITHMS FOR THE BAYESIAN ANALYSIS OF MATHEMATICAL MODELS
Prudencio, Ernesto; Cheung, Sai Hung
2012-01-01
In recent years, Bayesian model updating techniques based on measured data have been applied to many engineering and applied science problems. At the same time, parallel computational platforms are becoming increasingly more powerful and are being used more frequently by the engineering and scientific communities. Bayesian techniques usually require the evaluation of multi-dimensional integrals related to the posterior probability density function (PDF) of uncertain model parameters. The fact that such integrals cannot be computed analytically motivates the research of stochastic simulation methods for sampling posterior PDFs. One such algorithm is the adaptive multilevel stochastic simulation algorithm (AMSSA). In this paper we discuss the parallelization of AMSSA, formulating the necessary load balancing step as a binary integer programming problem. We present a variety of results showing the effectiveness of load balancing on the overall performance of AMSSA in a parallel computational environment.
Directory of Open Access Journals (Sweden)
Long-Hua Ma
2011-08-01
Full Text Available A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform.
A parallel algorithm for transient solid dynamics simulations with contact detection
International Nuclear Information System (INIS)
Attaway, S.; Hendrickson, B.; Plimpton, S.; Gardner, D.; Vaughan, C.; Heinstein, M.; Peery, J.
1996-01-01
Solid dynamics simulations with Lagrangian finite elements are used to model a wide variety of problems, such as the calculation of impact damage to shipping containers for nuclear waste and the analysis of vehicular crashes. Using parallel computers for these simulations has been hindered by the difficulty of searching efficiently for material surface contacts in parallel. A new parallel algorithm for calculation of arbitrary material contacts in finite element simulations has been developed and implemented in the PRONTO3D transient solid dynamics code. This paper will explore some of the issues involved in developing efficient, portable, parallel finite element models for nonlinear transient solid dynamics simulations. The contact-detection problem poses interesting challenges for efficient implementation of a solid dynamics simulation on a parallel computer. The finite element mesh is typically partitioned so that each processor owns a localized region of the finite element mesh. This mesh partitioning is optimal for the finite element portion of the calculation since each processor must communicate only with the few connected neighboring processors that share boundaries with the decomposed mesh. However, contacts can occur between surfaces that may be owned by any two arbitrary processors. Hence, a global search across all processors is required at every time step to search for these contacts. Load-imbalance can become a problem since the finite element decomposition divides the volumetric mesh evenly across processors but typically leaves the surface elements unevenly distributed. In practice, these complications have been limiting factors in the performance and scalability of transient solid dynamics on massively parallel computers. In this paper the authors present a new parallel algorithm for contact detection that overcomes many of these limitations
Parallelization of MCNP4 code by using simple FORTRAN algorithms
International Nuclear Information System (INIS)
Yazid, P.I.; Takano, Makoto; Masukawa, Fumihiro; Naito, Yoshitaka.
1993-12-01
Simple FORTRAN algorithms, that rely only on open, close, read and write statements, together with disk files and some UNIX commands have been applied to parallelization of MCNP4. The code, named MCNPNFS, maintains almost all capabilities of MCNP4 in solving shielding problems. It is able to perform parallel computing on a set of any UNIX workstations connected by a network, regardless of the heterogeneity in hardware system, provided that all processors produce a binary file in the same format. Further, it is confirmed that MCNPNFS can be executed also on Monte-4 vector-parallel computer. MCNPNFS has been tested intensively by executing 5 photon-neutron benchmark problems, a spent fuel cask problem and 17 sample problems included in the original code package of MCNP4. Three different workstations, connected by a network, have been used to execute MCNPNFS in parallel. By measuring CPU time, the parallel efficiency is determined to be 58% to 99% and 86% in average. On Monte-4, MCNPNFS has been executed using 4 processors concurrently and has achieved the parallel efficiency of 79% in average. (author)
Fijany, Amir
1993-01-01
In this paper, parallel O(log n) algorithms for computation of rigid multibody dynamics are developed. These parallel algorithms are derived by parallelization of new O(n) algorithms for the problem. The underlying feature of these O(n) algorithms is a drastically different strategy for decomposition of interbody force which leads to a new factorization of the mass matrix (M). Specifically, it is shown that a factorization of the inverse of the mass matrix in the form of the Schur Complement is derived as M(exp -1) = C - B(exp *)A(exp -1)B, wherein matrices C, A, and B are block tridiagonal matrices. The new O(n) algorithm is then derived as a recursive implementation of this factorization of M(exp -1). For the closed-chain systems, similar factorizations and O(n) algorithms for computation of Operational Space Mass Matrix lambda and its inverse lambda(exp -1) are also derived. It is shown that these O(n) algorithms are strictly parallel, that is, they are less efficient than other algorithms for serial computation of the problem. But, to our knowledge, they are the only known algorithms that can be parallelized and that lead to both time- and processor-optimal parallel algorithms for the problem, i.e., parallel O(log n) algorithms with O(n) processors. The developed parallel algorithms, in addition to their theoretical significance, are also practical from an implementation point of view due to their simple architectural requirements.
Parallel algorithm for dominant points correspondences in robot binocular stereo vision
Al-Tammami, A.; Singh, B.
1993-01-01
This paper presents an algorithm to find the correspondences of points representing dominant feature in robot stereo vision. The algorithm consists of two main steps: dominant point extraction and dominant point matching. In the feature extraction phase, the algorithm utilizes the widely used Moravec Interest Operator and two other operators: the Prewitt Operator and a new operator called Gradient Angle Variance Operator. The Interest Operator in the Moravec algorithm was used to exclude featureless areas and simple edges which are oriented in the vertical, horizontal, and two diagonals. It was incorrectly detecting points on edges which are not on the four main directions (vertical, horizontal, and two diagonals). The new algorithm uses the Prewitt operator to exclude featureless areas, so that the Interest Operator is applied only on the edges to exclude simple edges and to leave interesting points. This modification speeds-up the extraction process by approximately 5 times. The Gradient Angle Variance (GAV), an operator which calculates the variance of the gradient angle in a window around the point under concern, is then applied on the interesting points to exclude the redundant ones and leave the actual dominant ones. The matching phase is performed after the extraction of the dominant points in both stereo images. The matching starts with dominant points in the left image and does a local search, looking for corresponding dominant points in the right image. The search is geometrically constrained the epipolar line of the parallel-axes stereo geometry and the maximum disparity of the application environment. If one dominant point in the right image lies in the search areas, then it is the corresponding point of the reference dominant point in the left image. A parameter provided by the GAV is thresholded and used as a rough similarity measure to select the corresponding dominant point if there is more than one point the search area. The correlation is used as
Parallel Algorithm for Solving TOV Equations for Sequence of Cold and Dense Nuclear Matter Models
Ayriyan, Alexander; Buša, Ján; Grigorian, Hovik; Poghosyan, Gevorg
2018-04-01
We have introduced parallel algorithm simulation of neutron star configurations for set of equation of state models. The performance of the parallel algorithm has been investigated for testing set of EoS models on two computational systems. It scales when using with MPI on modern CPUs and this investigation allowed us also to compare two different types of computational nodes.
Algorithms for a parallel implementation of Hidden Markov Models with a small state space
DEFF Research Database (Denmark)
Nielsen, Jesper; Sand, Andreas
2011-01-01
Two of the most important algorithms for Hidden Markov Models are the forward and the Viterbi algorithms. We show how formulating these using linear algebra naturally lends itself to parallelization. Although the obtained algorithms are slow for Hidden Markov Models with large state spaces...
D'Angelo, Gianni; Rampone, Salvatore
2014-01-01
The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n(3)) and of O(n(5)) order, respectively, and so, the algorithm is unaffordable for huge data sets. We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of
Optimization of tokamak plasma equilibrium shape using parallel genetic algorithms
International Nuclear Information System (INIS)
Zhulin An; Bin Wu; Lijian Qiu
2006-01-01
In the device of non-circular cross sectional tokamaks, the plasma equilibrium shape has a strong influence on the confinement and MHD stability. The plasma equilibrium shape is determined by the configuration of the poloidal field (PF) system. Usually there are many PF systems that could support the specified plasma equilibrium, the differences are the number of coils used, their positions, sizes and currents. It is necessary to find the optimal choice that meets the engineering constrains, which is often done by a constrained optimization. The Genetic Algorithms (GAs) based method has been used to solve the problem of the optimization, but the time complexity limits the algorithms to become widely used. Due to the large search space that the optimization has, it takes several hours to get a nice result. The inherent parallelism in GAs can be exploited to enhance their search efficiency. In this paper, we introduce a parallel genetic algorithms (PGAs) based approach which can reduce the computational time. The algorithm has a master-slave structure, the slave explore the search space separately and return the results to the master. A program is also developed, and it can be running on any computers which support massage passing interface. Both the algorithm and the program are detailed discussed in the paper. We also include an application that uses the program to determine the positions and currents of PF coils in EAST. The program reach the target value within half an hour and yield a speedup rate of 5.21 on 8 CPUs. (author)
International Nuclear Information System (INIS)
Bastiens, K.; Lemahieu, I.
1994-01-01
The application of a maximum entropy reconstruction algorithm to PET images requires a lot of computing resources. A parallel implementation could seriously reduce the execution time. However, programming a parallel application is still a non trivial task, needing specialized people. In this paper a programming environment based on a visual programming language is used for a parallel implementation of the reconstruction algorithm. This programming environment allows less experienced programmers to use the performance of multiprocessor systems. (authors)
Parallel Algorithm of Geometrical Hashing Based on NumPy Package and Processes Pool
Directory of Open Access Journals (Sweden)
Klyachin Vladimir Aleksandrovich
2015-10-01
Full Text Available The article considers the problem of multi-dimensional geometric hashing. The paper describes a mathematical model of geometric hashing and considers an example of its use in localization problems for the point. A method of constructing the corresponding hash matrix by parallel algorithm is considered. In this paper an algorithm of parallel geometric hashing using a development pattern «pool processes» is proposed. The implementation of the algorithm is executed using the Python programming language and NumPy package for manipulating multidimensional data. To implement the process pool it is proposed to use a class Process Pool Executor imported from module concurrent.futures, which is included in the distribution of the interpreter Python since version 3.2. All the solutions are presented in the paper by corresponding UML class diagrams. Designed GeomNash package includes classes Data, Result, GeomHash, Job. The results of the developed program presents the corresponding graphs. Also, the article presents the theoretical justification for the application process pool for the implementation of parallel algorithms. It is obtained condition t2 > (p/(p-1*t1 of the appropriateness of process pool. Here t1 - the time of transmission unit of data between processes, and t2 - the time of processing unit data by one processor.
Cao, Jianfang; Chen, Lichao; Wang, Min; Tian, Yun
2018-01-01
The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance.
Parallel pipeline algorithm of real time star map preprocessing
Wang, Hai-yong; Qin, Tian-mu; Liu, Jia-qi; Li, Zhi-feng; Li, Jian-hua
2016-03-01
To improve the preprocessing speed of star map and reduce the resource consumption of embedded system of star tracker, a parallel pipeline real-time preprocessing algorithm is presented. The two characteristics, the mean and the noise standard deviation of the background gray of a star map, are firstly obtained dynamically by the means that the intervene of the star image itself to the background is removed in advance. The criterion on whether or not the following noise filtering is needed is established, then the extraction threshold value is assigned according to the level of background noise, so that the centroiding accuracy is guaranteed. In the processing algorithm, as low as two lines of pixel data are buffered, and only 100 shift registers are used to record the connected domain label, by which the problems of resources wasting and connected domain overflow are solved. The simulating results show that the necessary data of the selected bright stars could be immediately accessed in a delay time as short as 10us after the pipeline processing of a 496×496 star map in 50Mb/s is finished, and the needed memory and registers resource total less than 80kb. To verify the accuracy performance of the algorithm proposed, different levels of background noise are added to the processed ideal star map, and the statistic centroiding error is smaller than 1/23 pixel under the condition that the signal to noise ratio is greater than 1. The parallel pipeline algorithm of real time star map preprocessing helps to increase the data output speed and the anti-dynamic performance of star tracker.
PARALLEL ALGORITHM FOR THREE-DIMENSIONAL STOKES FLOW SIMULATION USING BOUNDARY ELEMENT METHOD
Directory of Open Access Journals (Sweden)
D. G. Pribytok
2016-01-01
Full Text Available Parallel computing technique for modeling three-dimensional viscous flow (Stokes flow using direct boundary element method is presented. The problem is solved in three phases: sampling and construction of system of linear algebraic equations (SLAE, its decision and finding the velocity of liquid at predetermined points. For construction of the system and finding the velocity, the parallel algorithms using graphics CUDA cards programming technology have been developed and implemented. To solve the system of linear algebraic equations the implemented software libraries are used. A comparison of time consumption for three main algorithms on the example of calculation of viscous fluid motion in three-dimensional cavity is performed.
Energy Technology Data Exchange (ETDEWEB)
Bastiens, K; Lemahieu, I [University of Ghent - ELIS Department, St. Pietersnieuwstraat 41, B-9000 Ghent (Belgium)
1994-12-31
The application of a maximum entropy reconstruction algorithm to PET images requires a lot of computing resources. A parallel implementation could seriously reduce the execution time. However, programming a parallel application is still a non trivial task, needing specialized people. In this paper a programming environment based on a visual programming language is used for a parallel implementation of the reconstruction algorithm. This programming environment allows less experienced programmers to use the performance of multiprocessor systems. (authors). 8 refs, 3 figs, 1 tab.
A Parallel Genetic Algorithm for Automated Electronic Circuit Design
Lohn, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris; Norvig, Peter (Technical Monitor)
2000-01-01
We describe a parallel genetic algorithm (GA) that automatically generates circuit designs using evolutionary search. A circuit-construction programming language is introduced and we show how evolution can generate practical analog circuit designs. Our system allows circuit size (number of devices), circuit topology, and device values to be evolved. We present experimental results as applied to analog filter and amplifier design tasks.
SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws
Cooke, Daniel; Rushton, Nelson
2013-01-01
With the introduction of new parallel architectures like the cell and multicore chips from IBM, Intel, AMD, and ARM, as well as the petascale processing available for highend computing, a larger number of programmers will need to write parallel codes. Adding the parallel control structure to the sequence, selection, and iterative control constructs increases the complexity of code development, which often results in increased development costs and decreased reliability. SequenceL is a high-level programming language that is, a programming language that is closer to a human s way of thinking than to a machine s. Historically, high-level languages have resulted in decreased development costs and increased reliability, at the expense of performance. In recent applications at JSC and in industry, SequenceL has demonstrated the usual advantages of high-level programming in terms of low cost and high reliability. SequenceL programs, however, have run at speeds typically comparable with, and in many cases faster than, their counterparts written in C and C++ when run on single-core processors. Moreover, SequenceL is able to generate parallel executables automatically for multicore hardware, gaining parallel speedups without any extra effort from the programmer beyond what is required to write the sequen tial/singlecore code. A SequenceL-to-C++ translator has been developed that automatically renders readable multithreaded C++ from a combination of a SequenceL program and sample data input. The SequenceL language is based on two fundamental computational laws, Consume-Simplify- Produce (CSP) and Normalize-Trans - pose (NT), which enable it to automate the creation of parallel algorithms from high-level code that has no annotations of parallelism whatsoever. In our anecdotal experience, SequenceL development has been in every case less costly than development of the same algorithm in sequential (that is, single-core, single process) C or C++, and an order of magnitude less
International Nuclear Information System (INIS)
Waintraub, Marcel; Pereira, Claudio M.N.A.; Baptista, Rafael P.
2005-01-01
This work presents the development of a distributed parallel genetic algorithm applied to a nuclear reactor core design optimization. In the implementation of the parallelism, a 'Message Passing Interface' (MPI) library, standard for parallel computation in distributed memory platforms, has been used. Another important characteristic of MPI is its portability for various architectures. The main objectives of this paper are: validation of the results obtained by the application of this algorithm in a nuclear reactor core optimization problem, through comparisons with previous results presented by Pereira et al.; and performance test of the Brazilian Nuclear Engineering Institute (IEN) cluster in reactors physics optimization problems. The experiments demonstrated that the developed parallel genetic algorithm using the MPI library presented significant gains in the obtained results and an accentuated reduction of the processing time. Such results ratify the use of the parallel genetic algorithms for the solution of nuclear reactor core optimization problems. (author)
Sargent, Jeff Scott
1988-01-01
A new row-based parallel algorithm for standard-cell placement targeted for execution on a hypercube multiprocessor is presented. Key features of this implementation include a dynamic simulated-annealing schedule, row-partitioning of the VLSI chip image, and two novel new approaches to controlling error in parallel cell-placement algorithms; Heuristic Cell-Coloring and Adaptive (Parallel Move) Sequence Control. Heuristic Cell-Coloring identifies sets of noninteracting cells that can be moved repeatedly, and in parallel, with no buildup of error in the placement cost. Adaptive Sequence Control allows multiple parallel cell moves to take place between global cell-position updates. This feedback mechanism is based on an error bound derived analytically from the traditional annealing move-acceptance profile. Placement results are presented for real industry circuits and the performance is summarized of an implementation on the Intel iPSC/2 Hypercube. The runtime of this algorithm is 5 to 16 times faster than a previous program developed for the Hypercube, while producing equivalent quality placement. An integrated place and route program for the Intel iPSC/2 Hypercube is currently being developed.
Variation in efficiency of parallel algorithms. [for study of stiffness matrices in planar trusses
Hayashi, A.; Melosh, R. J.; Utku, S.; Salama, M.
1985-01-01
The present study has the objective to investigate some iterative parallel-processor linear equation solving algorithms with respect to efficiency for analyses of typical linear engineering systems. Attention is given to a set of n linear equations, Ku = p, where K = an n x n positive definite, sparsely populated, symmetric matrix, u = an n x 1 vector of unknown responses, and p = an n x 1 vector of prescribed constants. This study is concerned with a hybrid method in which iteration is used to solve the problem, while a direct method is used on the local processor level. Variations in the efficiency of parallel algorithms are explored. Measures of the efficiency are based on computer experiments regarding the algorithms. For all the algorithms, the wall clock time is found to decrease as the number of processors increases.
An Intrinsic Algorithm for Parallel Poisson Disk Sampling on Arbitrary Surfaces.
Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying
2013-03-08
Poisson disk sampling plays an important role in a variety of visual computing, due to its useful statistical property in distribution and the absence of aliasing artifacts. While many effective techniques have been proposed to generate Poisson disk distribution in Euclidean space, relatively few work has been reported to the surface counterpart. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. We propose a new technique for parallelizing the dart throwing. Rather than the conventional approaches that explicitly partition the spatial domain to generate the samples in parallel, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. It is worth noting that our algorithm is accurate as the generated Poisson disks are uniformly and randomly distributed without bias. Our method is intrinsic in that all the computations are based on the intrinsic metric and are independent of the embedding space. This intrinsic feature allows us to generate Poisson disk distributions on arbitrary surfaces. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
A Parallel, Finite-Volume Algorithm for Large-Eddy Simulation of Turbulent Flows
Bui, Trong T.
1999-01-01
A parallel, finite-volume algorithm has been developed for large-eddy simulation (LES) of compressible turbulent flows. This algorithm includes piecewise linear least-square reconstruction, trilinear finite-element interpolation, Roe flux-difference splitting, and second-order MacCormack time marching. Parallel implementation is done using the message-passing programming model. In this paper, the numerical algorithm is described. To validate the numerical method for turbulence simulation, LES of fully developed turbulent flow in a square duct is performed for a Reynolds number of 320 based on the average friction velocity and the hydraulic diameter of the duct. Direct numerical simulation (DNS) results are available for this test case, and the accuracy of this algorithm for turbulence simulations can be ascertained by comparing the LES solutions with the DNS results. The effects of grid resolution, upwind numerical dissipation, and subgrid-scale dissipation on the accuracy of the LES are examined. Comparison with DNS results shows that the standard Roe flux-difference splitting dissipation adversely affects the accuracy of the turbulence simulation. For accurate turbulence simulations, only 3-5 percent of the standard Roe flux-difference splitting dissipation is needed.
Ferrucci, Filomena; Salza, Pasquale; Sarro, Federica
2017-06-29
The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time. Moreover, we are interested in understanding which PGA model can be most effective among the global, grid, and island models. We empirically assessed the performance of these three parallel models with respect to a sequential GA on a software engineering problem, evaluating the execution time and the achieved speedup. We also analysed the behaviour of the parallel models in relation to the overhead produced by the use of Hadoop MapReduce and the GAs' computational effort, which gives a more machine-independent measure of these algorithms. We exploited three problem instances to differentiate the computation load and three cluster configurations based on 2, 4, and 8 parallel nodes. Moreover, we estimated the costs of the execution of the experimentation on a potential cloud infrastructure, based on the pricing of the major commercial cloud providers. The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters. Using 2, 4, and 8 nodes, the island model achieves an average speedup over the three datasets of 1.8, 3.4, and 7.0 times, respectively. Hadoop MapReduce has a set of different constraints that need to be considered during the design and the implementation of parallel algorithms. The overhead of data store (i.e., HDFS) accesses, communication, and latency requires solutions that reduce data store
Massively parallel algorithms for trace-driven cache simulations
Nicol, David M.; Greenberg, Albert G.; Lubachevsky, Boris D.
1991-01-01
Trace driven cache simulation is central to computer design. A trace is a very long sequence of reference lines from main memory. At the t(exp th) instant, reference x sub t is hashed into a set of cache locations, the contents of which are then compared with x sub t. If at the t sup th instant x sub t is not present in the cache, then it is said to be a miss, and is loaded into the cache set, possibly forcing the replacement of some other memory line, and making x sub t present for the (t+1) sup st instant. The problem of parallel simulation of a subtrace of N references directed to a C line cache set is considered, with the aim of determining which references are misses and related statistics. A simulation method is presented for the Least Recently Used (LRU) policy, which regradless of the set size C runs in time O(log N) using N processors on the exclusive read, exclusive write (EREW) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. Timings are presented of the second algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference based line replacement policies are considered, which includes LRU as well as the Least Frequently Used and Random replacement policies. A simulation method is presented for any such policy that on any trace of length N directed to a C line set runs in the O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well suited for SIMD implementation.
International Nuclear Information System (INIS)
Azmy, Y.Y.; Kirk, B.L.
1990-01-01
Modern parallel computer architectures offer an enormous potential for reducing CPU and wall-clock execution times of large-scale computations commonly performed in various applications in science and engineering. Recently, several authors have reported their efforts in developing and implementing parallel algorithms for solving the neutron diffusion equation on a variety of shared- and distributed-memory parallel computers. Testing of these algorithms for a variety of two- and three-dimensional meshes showed significant speedup of the computation. Even for very large problems (i.e., three-dimensional fine meshes) executed concurrently on a few nodes in serial (nonvector) mode, however, the measured computational efficiency is very low (40 to 86%). In this paper, the authors present a highly efficient (∼85 to 99.9%) algorithm for solving the two-dimensional nodal diffusion equations on the Sequent Balance 8000 parallel computer. Also presented is a model for the performance, represented by the efficiency, as a function of problem size and the number of participating processors. The model is validated through several tests and then extrapolated to larger problems and more processors to predict the performance of the algorithm in more computationally demanding situations
Energy Technology Data Exchange (ETDEWEB)
Carey, G.F.; Young, D.M.
1993-12-31
The program outlined here is directed to research on methods, algorithms, and software for distributed parallel supercomputers. Of particular interest are finite element methods and finite difference methods together with sparse iterative solution schemes for scientific and engineering computations of very large-scale systems. Both linear and nonlinear problems will be investigated. In the nonlinear case, applications with bifurcation to multiple solutions will be considered using continuation strategies. The parallelizable numerical methods of particular interest are a family of partitioning schemes embracing domain decomposition, element-by-element strategies, and multi-level techniques. The methods will be further developed incorporating parallel iterative solution algorithms with associated preconditioners in parallel computer software. The schemes will be implemented on distributed memory parallel architectures such as the CRAY MPP, Intel Paragon, the NCUBE3, and the Connection Machine. We will also consider other new architectures such as the Kendall-Square (KSQ) and proposed machines such as the TERA. The applications will focus on large-scale three-dimensional nonlinear flow and reservoir problems with strong convective transport contributions. These are legitimate grand challenge class computational fluid dynamics (CFD) problems of significant practical interest to DOE. The methods developed and algorithms will, however, be of wider interest.
First massively parallel algorithm to be implemented in Apollo-II code
International Nuclear Information System (INIS)
Stankovski, Z.
1994-01-01
The collision probability (CP) method in neutron transport, as applied to arbitrary 2D XY geometries, like the TDT module in APOLLO-II, is very time consuming. Consequently RZ or 3D extensions became prohibitive. Fortunately, this method is very suitable for parallelization. Massively parallel computer architectures, especially MIMD machines, bring a new breath to this method. In this paper we present a CM5 implementation of the CP method. Parallelization is applied to the energy groups, using the CMMD message passing library. In our case we use 32 processors for the standard 99-group APOLLIB-II library. The real advantage of this algorithm will appear in the calculation of the future fine multigroup library (about 8000 groups) of the SAPHYR project with a massively parallel computer (to the order of hundreds of processors). (author). 3 tabs., 4 figs., 4 refs
First massively parallel algorithm to be implemented in APOLLO-II code
International Nuclear Information System (INIS)
Stankovski, Z.
1994-01-01
The collision probability method in neutron transport, as applied to arbitrary 2-dimensional geometries, like the two dimensional transport module in APOLLO-II is very time consuming. Consequently 3-dimensional extension became prohibitive. Fortunately, this method is very suitable for parallelization. Massively parallel computer architectures, especially MIMD machines, bring a new breath to this method. In this paper we present a CM5 implementation of the collision probability method. Parallelization is applied to the energy groups, using the CMMD massage passing library. In our case we used 32 processors for the standard 99-group APOLLIB-II library. The real advantage of this algorithm will appear in the calculation of the future multigroup library (about 8000 groups) of the SAPHYR project with a massively parallel computer (to the order of hundreds of processors). (author). 4 refs., 4 figs., 3 tabs
A parallel algorithm for solving the integral form of the discrete ordinates equations
International Nuclear Information System (INIS)
Zerr, R. J.; Azmy, Y. Y.
2009-01-01
The integral form of the discrete ordinates equations involves a system of equations that has a large, dense coefficient matrix. The serial construction methodology is presented and properties that affect the execution times to construct and solve the system are evaluated. Two approaches for massively parallel implementation of the solution algorithm are proposed and the current results of one of these are presented. The system of equations May be solved using two parallel solvers-block Jacobi and conjugate gradient. Results indicate that both methods can reduce overall wall-clock time for execution. The conjugate gradient solver exhibits better performance to compete with the traditional source iteration technique in terms of execution time and scalability. The parallel conjugate gradient method is synchronous, hence it does not increase the number of iterations for convergence compared to serial execution, and the efficiency of the algorithm demonstrates an apparent asymptotic decline. (authors)
A Highly Parallel and Scalable Motion Estimation Algorithm with GPU for HEVC
Directory of Open Access Journals (Sweden)
Yun-gang Xue
2017-01-01
Full Text Available We propose a highly parallel and scalable motion estimation algorithm, named multilevel resolution motion estimation (MLRME for short, by combining the advantages of local full search and downsampling. By subsampling a video frame, a large amount of computation is saved. While using the local full-search method, it can exploit massive parallelism and make full use of the powerful modern many-core accelerators, such as GPU and Intel Xeon Phi. We implanted the proposed MLRME into HM12.0, and the experimental results showed that the encoding quality of the MLRME method is close to that of the fast motion estimation in HEVC, which declines by less than 1.5%. We also implemented the MLRME with CUDA, which obtained 30–60x speed-up compared to the serial algorithm on single CPU. Specifically, the parallel implementation of MLRME on a GTX 460 GPU can meet the real-time coding requirement with about 25 fps for the 2560×1600 video format, while, for 832×480, the performance is more than 100 fps.
Directory of Open Access Journals (Sweden)
Wensheng Guo
Full Text Available In biological systems, the dynamic analysis method has gained increasing attention in the past decade. The Boolean network is the most common model of a genetic regulatory network. The interactions of activation and inhibition in the genetic regulatory network are modeled as a set of functions of the Boolean network, while the state transitions in the Boolean network reflect the dynamic property of a genetic regulatory network. A difficult problem for state transition analysis is the finding of attractors. In this paper, we modeled the genetic regulatory network as a Boolean network and proposed a solving algorithm to tackle the attractor finding problem. In the proposed algorithm, we partitioned the Boolean network into several blocks consisting of the strongly connected components according to their gradients, and defined the connection between blocks as decision node. Based on the solutions calculated on the decision nodes and using a satisfiability solving algorithm, we identified the attractors in the state transition graph of each block. The proposed algorithm is benchmarked on a variety of genetic regulatory networks. Compared with existing algorithms, it achieved similar performance on small test cases, and outperformed it on larger and more complex ones, which happens to be the trend of the modern genetic regulatory network. Furthermore, while the existing satisfiability-based algorithms cannot be parallelized due to their inherent algorithm design, the proposed algorithm exhibits a good scalability on parallel computing architectures.
Fast parallel tracking algorithm for the muon detector of the CBM experiment at FAIR
International Nuclear Information System (INIS)
Lebedev, A.; Hoehne, C.; Kisel', I.; Ososkov, G.
2010-01-01
Particle trajectory recognition is an important and challenging task in the Compressed Baryonic Matter (CBM) experiment at the future FAIR accelerator at Darmstadt. The tracking algorithms have to process terabytes of input data produced in particle collisions. Therefore, the speed of the tracking software is extremely important for data analysis. In this contribution, a fast parallel track reconstruction algorithm, which uses available features of modern processors is presented. These features comprise a SIMD instruction set (SSE) and multithreading. The first allows one to pack several data items into one register and to operate on all of them in parallel thus achieving more operations per cycle. The second feature enables the routines to exploit all available CPU cores and hardware threads. This parallel version of the tracking algorithm has been compared to the initial serial scalar version which uses a similar approach for tracking. A speed-upfactor of 487 was achieved (from 730 to 1.5 ms/event) for a computer with 2 x Intel Core 17 processors at 2.66 GHz
Directory of Open Access Journals (Sweden)
Helio Yochihiro Fuchigami
2014-08-01
Full Text Available This article addresses the problem of minimizing makespan on two parallel flow shops with proportional processing and setup times. The setup times are separated and sequence-independent. The parallel flow shop scheduling problem is a specific case of well-known hybrid flow shop, characterized by a multistage production system with more than one machine working in parallel at each stage. This situation is very common in various kinds of companies like chemical, electronics, automotive, pharmaceutical and food industries. This work aimed to propose six Simulated Annealing algorithms, their perturbation schemes and an algorithm for initial sequence generation. This study can be classified as “applied research” regarding the nature, “exploratory” about the objectives and “experimental” as to procedures, besides the “quantitative” approach. The proposed algorithms were effective regarding the solution and computationally efficient. Results of Analysis of Variance (ANOVA revealed no significant difference between the schemes in terms of makespan. It’s suggested the use of PS4 scheme, which moves a subsequence of jobs, for providing the best percentage of success. It was also found that there is a significant difference between the results of the algorithms for each value of the proportionality factor of the processing and setup times of flow shops.
From Massively Parallel Algorithms and Fluctuating Time Horizons to Nonequilibrium Surface Growth
International Nuclear Information System (INIS)
Korniss, G.; Toroczkai, Z.; Novotny, M. A.; Rikvold, P. A.
2000-01-01
We study the asymptotic scaling properties of a massively parallel algorithm for discrete-event simulations where the discrete events are Poisson arrivals. The evolution of the simulated time horizon is analogous to a nonequilibrium surface. Monte Carlo simulations and a coarse-grained approximation indicate that the macroscopic landscape in the steady state is governed by the Edwards-Wilkinson Hamiltonian. Since the efficiency of the algorithm corresponds to the density of local minima in the associated surface, our results imply that the algorithm is asymptotically scalable. (c) 2000 The American Physical Society
O'Keeffe, C J; Ren, Ruichao; Orkoulas, G
2007-11-21
Spatial updating grand canonical Monte Carlo algorithms are generalizations of random and sequential updating algorithms for lattice systems to continuum fluid models. The elementary steps, insertions or removals, are constructed by generating points in space either at random (random updating) or in a prescribed order (sequential updating). These algorithms have previously been developed only for systems of impenetrable spheres for which no particle overlap occurs. In this work, spatial updating grand canonical algorithms are generalized to continuous, soft-core potentials to account for overlapping configurations. Results on two- and three-dimensional Lennard-Jones fluids indicate that spatial updating grand canonical algorithms, both random and sequential, converge faster than standard grand canonical algorithms. Spatial algorithms based on sequential updating not only exhibit the fastest convergence but also are ideal for parallel implementation due to the absence of strict detailed balance and the nature of the updating that minimizes interprocessor communication. Parallel simulation results for three-dimensional Lennard-Jones fluids show a substantial reduction of simulation time for systems of moderate and large size. The efficiency improvement by parallel processing through domain decomposition is always in addition to the efficiency improvement by sequential updating.
Zhang, Lei; Yang, Fengbao; Ji, Linna; Lv, Sheng
2018-01-01
Diverse image fusion methods perform differently. Each method has advantages and disadvantages compared with others. One notion is that the advantages of different image methods can be effectively combined. A multiple-algorithm parallel fusion method based on algorithmic complementarity and synergy is proposed. First, in view of the characteristics of the different algorithms and difference-features among images, an index vector-based feature-similarity is proposed to define the degree of complementarity and synergy. This proposed index vector is a reliable evidence indicator for algorithm selection. Second, the algorithms with a high degree of complementarity and synergy are selected. Then, the different degrees of various features and infrared intensity images are used as the initial weights for the nonnegative matrix factorization (NMF). This avoids randomness of the NMF initialization parameter. Finally, the fused images of different algorithms are integrated using the NMF because of its excellent data fusing performance on independent features. Experimental results demonstrate that the visual effect and objective evaluation index of the fused images obtained using the proposed method are better than those obtained using traditional methods. The proposed method retains all the advantages that individual fusion algorithms have.
Parallel Algorithm for Adaptive Numerical Integration
International Nuclear Information System (INIS)
Sujatmiko, M.; Basarudin, T.
1997-01-01
This paper presents an automation algorithm for integration using adaptive trapezoidal method. The interval is adaptively divided where the width of sub interval are different and fit to the behavior of its function. For a function f, an integration on interval [a,b] can be obtained, with maximum tolerance ε, using estimation (f, a, b, ε). The estimated solution is valid if the error is still in a reasonable range, fulfil certain criteria. If the error is big, however, the problem is solved by dividing it into to similar and independent sub problem on to separate [a, (a+b)/2] and [(a+b)/2, b] interval, i. e. ( f, a, (a+b)/2, ε/2) and (f, (a+b)/2, b, ε/2) estimations. The problems are solved in two different kinds of processor, root processor and worker processor. Root processor function ti divide a main problem into sub problems and distribute them to worker processor. The division mechanism may go further until all of the sub problem are resolved. The solution of each sub problem is then submitted to the root processor such that the solution for the main problem can be obtained. The algorithm is implemented on C-programming-base distributed computer networking system under parallel virtual machine platform
Massively parallel performance of neutron transport response matrix algorithms
International Nuclear Information System (INIS)
Hanebutte, U.R.; Lewis, E.E.
1993-01-01
Massively parallel red/black response matrix algorithms for the solution of within-group neutron transport problems are implemented on the Connection Machines-2, 200 and 5. The response matrices are dericed from the diamond-differences and linear-linear nodal discrete ordinate and variational nodal P 3 approximations. The unaccelerated performance of the iterative procedure is examined relative to the maximum rated performances of the machines. The effects of processor partitions size, of virtual processor ratio and of problems size are examined in detail. For the red/black algorithm, the ratio of inter-node communication to computing times is found to be quite small, normally of the order of ten percent or less. Performance increases with problems size and with virtual processor ratio, within the memeory per physical processor limitation. Algorithm adaptation to courser grain machines is straight-forward, with total computing time being virtually inversely proportional to the number of physical processors. (orig.)
A new parallelization algorithm of ocean model with explicit scheme
Fu, X. D.
2017-08-01
This paper will focus on the parallelization of ocean model with explicit scheme which is one of the most commonly used schemes in the discretization of governing equation of ocean model. The characteristic of explicit schema is that calculation is simple, and that the value of the given grid point of ocean model depends on the grid point at the previous time step, which means that one doesn’t need to solve sparse linear equations in the process of solving the governing equation of the ocean model. Aiming at characteristics of the explicit scheme, this paper designs a parallel algorithm named halo cells update with tiny modification of original ocean model and little change of space step and time step of the original ocean model, which can parallelize ocean model by designing transmission module between sub-domains. This paper takes the GRGO for an example to implement the parallelization of GRGO (Global Reduced Gravity Ocean model) with halo update. The result demonstrates that the higher speedup can be achieved at different problem size.
A novel highly parallel algorithm for linearly unmixing hyperspectral images
Guerra, Raúl; López, Sebastián.; Callico, Gustavo M.; López, Jose F.; Sarmiento, Roberto
2014-10-01
Endmember extraction and abundances calculation represent critical steps within the process of linearly unmixing a given hyperspectral image because of two main reasons. The first one is due to the need of computing a set of accurate endmembers in order to further obtain confident abundance maps. The second one refers to the huge amount of operations involved in these time-consuming processes. This work proposes an algorithm to estimate the endmembers of a hyperspectral image under analysis and its abundances at the same time. The main advantage of this algorithm is its high parallelization degree and the mathematical simplicity of the operations implemented. This algorithm estimates the endmembers as virtual pixels. In particular, the proposed algorithm performs the descent gradient method to iteratively refine the endmembers and the abundances, reducing the mean square error, according with the linear unmixing model. Some mathematical restrictions must be added so the method converges in a unique and realistic solution. According with the algorithm nature, these restrictions can be easily implemented. The results obtained with synthetic images demonstrate the well behavior of the algorithm proposed. Moreover, the results obtained with the well-known Cuprite dataset also corroborate the benefits of our proposal.
International Nuclear Information System (INIS)
Pedron, Antoine
2013-01-01
This thesis work is placed between the scientific domain of ultrasound non-destructive testing and algorithm-architecture adequation. Ultrasound non-destructive testing includes a group of analysis techniques used in science and industry to evaluate the properties of a material, component, or system without causing damage. In order to characterise possible defects, determining their position, size and shape, imaging and reconstruction tools have been developed at CEA-LIST, within the CIVA software platform. Evolution of acquisition sensors implies a continuous growth of datasets and consequently more and more computing power is needed to maintain interactive reconstructions. General purpose processors (GPP) evolving towards parallelism and emerging architectures such as GPU allow large acceleration possibilities than can be applied to these algorithms. The main goal of the thesis is to evaluate the acceleration than can be obtained for two reconstruction algorithms on these architectures. These two algorithms differ in their parallelization scheme. The first one can be properly parallelized on GPP whereas on GPU, an intensive use of atomic instructions is required. Within the second algorithm, parallelism is easier to express, but loop ordering on GPP, as well as thread scheduling and a good use of shared memory on GPU are necessary in order to obtain efficient results. Different API or libraries, such as OpenMP, CUDA and OpenCL are evaluated through chosen benchmarks. An integration of both algorithms in the CIVA software platform is proposed and different issues related to code maintenance and durability are discussed. (author) [fr
Parallel SN algorithms in shared- and distributed-memory environments
International Nuclear Information System (INIS)
Haghighat, Alireza; Hunter, Melissa A.; Mattis, Ronald E.
1995-01-01
Different 2-D spatial domain partitioning Sn transport theory algorithms have been developed on the basis of the Block-Jacobi iterative scheme. These algorithms have been incorporated into TWOTRAN-II, and tested on a shared-memory CRAY Y-MP C90 and a distributed-memory IBM SP1. For a series of fixed source r-z geometry homogeneous problems, parallel efficiencies in a range of 50-90% are achieved on the C90 with 6 processors, and lower values (20-60%) are obtained on the SP1. It is demonstrated that better performance is attainable if one addresses issues such as convergence rate, load-balancing, and granularity for both architectures, as well as message passing (network bandwidth and latency) for SP1. (author). 17 refs, 4 figs
Parallel algorithms for interactive manipulation of digital terrain models
Davis, E. W.; Mcallister, D. F.; Nagaraj, V.
1988-01-01
Interactive three-dimensional graphics applications, such as terrain data representation and manipulation, require extensive arithmetic processing. Massively parallel machines are attractive for this application since they offer high computational rates, and grid connected architectures provide a natural mapping for grid based terrain models. Presented here are algorithms for data movement on the massive parallel processor (MPP) in support of pan and zoom functions over large data grids. It is an extension of earlier work that demonstrated real-time performance of graphics functions on grids that were equal in size to the physical dimensions of the MPP. When the dimensions of a data grid exceed the processing array size, data is packed in the array memory. Windows of the total data grid are interactively selected for processing. Movement of packed data is needed to distribute items across the array for efficient parallel processing. Execution time for data movement was found to exceed that for arithmetic aspects of graphics functions. Performance figures are given for routines written in MPP Pascal.
Parallel algorithms for finding cliques in a graph
International Nuclear Information System (INIS)
Szabo, S
2011-01-01
A clique is a subgraph in a graph that is complete in the sense that each two of its nodes are connected by an edge. Finding cliques in a given graph is an important procedure in discrete mathematical modeling. The paper will show how concepts such as splitting partitions, quasi coloring, node and edge dominance are related to clique search problems. In particular we will discuss the connection with parallel clique search algorithms. These concepts also suggest practical guide lines to inspect a given graph before starting a large scale search.
International Nuclear Information System (INIS)
Chen, C.M.; Lee, S.Y.
1995-01-01
The EM algorithm promises an estimated image with the maximal likelihood for 3D PET image reconstruction. However, due to its long computation time, the EM algorithm has not been widely used in practice. While several parallel implementations of the EM algorithm have been developed to make the EM algorithm feasible, they do not guarantee an optimal parallelization efficiency. In this paper, the authors propose a new parallel EM algorithm which maximizes the performance by optimizing data replication on a mesh-connected message-passing multiprocessor. To optimize data replication, the authors have formally derived the optimal allocation of shared data, group sizes, integration and broadcasting of replicated data as well as the scheduling of shared data accesses. The proposed parallel EM algorithm has been implemented on an iPSC/860 with 16 PEs. The experimental and theoretical results, which are consistent with each other, have shown that the proposed parallel EM algorithm could improve performance substantially over those using unoptimized data replication
Parallel Implementation and Scaling of an Adaptive Mesh Discrete Ordinates Algorithm for Transport
International Nuclear Information System (INIS)
Howell, L H
2004-01-01
Block-structured adaptive mesh refinement (AMR) uses a mesh structure built up out of locally-uniform rectangular grids. In the BoxLib parallel framework used by the Raptor code, each processor operates on one or more of these grids at each refinement level. The decomposition of the mesh into grids and the distribution of these grids among processors may change every few timesteps as a calculation proceeds. Finer grids use smaller timesteps than coarser grids, requiring additional work to keep the system synchronized and ensure conservation between different refinement levels. In a paper for NECDC 2002 I presented preliminary results on implementation of parallel transport sweeps on the AMR mesh, conjugate gradient acceleration, accuracy of the AMR solution, and scalar speedup of the AMR algorithm compared to a uniform fully-refined mesh. This paper continues with a more in-depth examination of the parallel scaling properties of the scheme, both in single-level and multi-level calculations. Both sweeping and setup costs are considered. The algorithm scales with acceptable performance to several hundred processors. Trends suggest, however, that this is the limit for efficient calculations with traditional transport sweeps, and that modifications to the sweep algorithm will be increasingly needed as job sizes in the thousands of processors become common
Directory of Open Access Journals (Sweden)
GORGUNOGLU, S.
2014-05-01
Full Text Available In analysis of minutiae based fingerprint systems, fingerprints needs to be pre-processed. The pre-processing is carried out to enhance the quality of the fingerprint and to obtain more accurate minutiae points. Reducing the pre-processing time is important for identification and verification in real time systems and especially for databases holding large fingerprints information. Parallel processing and parallel CPU computing can be considered as distribution of processes over multi core processor. This is done by using parallel programming techniques. Reducing the execution time is the main objective in parallel processing. In this study, pre-processing of minutiae based fingerprint system is implemented by parallel processing on multi core computers using OpenMP and on graphics processor using CUDA to improve execution time. The execution times and speedup ratios are compared with the one that of single core processor. The results show that by using parallel processing, execution time is substantially improved. The improvement ratios obtained for different pre-processing algorithms allowed us to make suggestions on the more suitable approaches for parallelization.
Parallel implementation of DNA sequences matching algorithms using PWM on GPU architecture.
Sharma, Rahul; Gupta, Nitin; Narang, Vipin; Mittal, Ankush
2011-01-01
Positional Weight Matrices (PWMs) are widely used in representation and detection of Transcription Factor Of Binding Sites (TFBSs) on DNA. We implement online PWM search algorithm over parallel architecture. A large PWM data can be processed on Graphic Processing Unit (GPU) systems in parallel which can help in matching sequences at a faster rate. Our method employs extensive usage of highly multithreaded architecture and shared memory of multi-cored GPU. An efficient use of shared memory is required to optimise parallel reduction in CUDA. Our optimised method has a speedup of 230-280x over linear implementation on GPU named GeForce GTX 280.
Creating IRT-Based Parallel Test Forms Using the Genetic Algorithm Method
Sun, Koun-Tem; Chen, Yu-Jen; Tsai, Shu-Yen; Cheng, Chien-Fen
2008-01-01
In educational measurement, the construction of parallel test forms is often a combinatorial optimization problem that involves the time-consuming selection of items to construct tests having approximately the same test information functions (TIFs) and constraints. This article proposes a novel method, genetic algorithm (GA), to construct parallel…
Fast parallel algorithms that compute transitive closure of a fuzzy relation
Kreinovich, Vladik YA.
1993-01-01
The notion of a transitive closure of a fuzzy relation is very useful for clustering in pattern recognition, for fuzzy databases, etc. The original algorithm proposed by L. Zadeh (1971) requires the computation time O(n(sup 4)), where n is the number of elements in the relation. In 1974, J. C. Dunn proposed a O(n(sup 2)) algorithm. Since we must compute n(n-1)/2 different values s(a, b) (a not equal to b) that represent the fuzzy relation, and we need at least one computational step to compute each of these values, we cannot compute all of them in less than O(n(sup 2)) steps. So, Dunn's algorithm is in this sense optimal. For small n, it is ok. However, for big n (e.g., for big databases), it is still a lot, so it would be desirable to decrease the computation time (this problem was formulated by J. Bezdek). Since this decrease cannot be done on a sequential computer, the only way to do it is to use a computer with several processors working in parallel. We show that on a parallel computer, transitive closure can be computed in time O((log(sub 2)(n))2).
Computational experience with a parallel algorithm for tetrangle inequality bound smoothing.
Rajan, K; Deo, N
1999-09-01
Determining molecular structure from interatomic distances is an important and challenging problem. Given a molecule with n atoms, lower and upper bounds on interatomic distances can usually be obtained only for a small subset of the 2(n(n-1)) atom pairs, using NMR. Given the bounds so obtained on the distances between some of the atom pairs, it is often useful to compute tighter bounds on all the 2(n(n-1)) pairwise distances. This process is referred to as bound smoothing. The initial lower and upper bounds for the pairwise distances not measured are usually assumed to be 0 and infinity. One method for bound smoothing is to use the limits imposed by the triangle inequality. The distance bounds so obtained can often be tightened further by applying the tetrangle inequality--the limits imposed on the six pairwise distances among a set of four atoms (instead of three for the triangle inequalities). The tetrangle inequality is expressed by the Cayley-Menger determinants. For every quadruple of atoms, each pass of the tetrangle inequality bound smoothing procedure finds upper and lower limits on each of the six distances in the quadruple. Applying the tetrangle inequalities to each of the (4n) quadruples requires O(n4) time. Here, we propose a parallel algorithm for bound smoothing employing the tetrangle inequality. Each pass of our algorithm requires O(n3 log n) time on a REW PRAM (Concurrent Read Exclusive Write Parallel Random Access Machine) with O(log(n)n) processors. An implementation of this parallel algorithm on the Intel Paragon XP/S and its performance are also discussed.
Parallel algorithm of real-time infrared image restoration based on total variation theory
Zhu, Ran; Li, Miao; Long, Yunli; Zeng, Yaoyuan; An, Wei
2015-10-01
Image restoration is a necessary preprocessing step for infrared remote sensing applications. Traditional methods allow us to remove the noise but penalize too much the gradients corresponding to edges. Image restoration techniques based on variational approaches can solve this over-smoothing problem for the merits of their well-defined mathematical modeling of the restore procedure. The total variation (TV) of infrared image is introduced as a L1 regularization term added to the objective energy functional. It converts the restoration process to an optimization problem of functional involving a fidelity term to the image data plus a regularization term. Infrared image restoration technology with TV-L1 model exploits the remote sensing data obtained sufficiently and preserves information at edges caused by clouds. Numerical implementation algorithm is presented in detail. Analysis indicates that the structure of this algorithm can be easily implemented in parallelization. Therefore a parallel implementation of the TV-L1 filter based on multicore architecture with shared memory is proposed for infrared real-time remote sensing systems. Massive computation of image data is performed in parallel by cooperating threads running simultaneously on multiple cores. Several groups of synthetic infrared image data are used to validate the feasibility and effectiveness of the proposed parallel algorithm. Quantitative analysis of measuring the restored image quality compared to input image is presented. Experiment results show that the TV-L1 filter can restore the varying background image reasonably, and that its performance can achieve the requirement of real-time image processing.
International Nuclear Information System (INIS)
Doster, J.M.; Sills, E.D.
1986-01-01
Current efforts are under way to develop and evaluate numerical algorithms for the parallel solution of the large sparse matrix equations associated with the finite difference representation of the macroscopic Navier-Stokes equations. Previous work has shown that these equations can be cast into smaller coupled matrix equations suitable for solution utilizing multiple computer processors operating in parallel. The individual processors themselves may exhibit parallelism through the use of vector pipelines. This wor, has concentrated on the one-dimensional drift flux form of the Navier-Stokes equations. Direct and iterative algorithms that may be suitable for implementation on parallel computer architectures are evaluated in terms of accuracy and overall execution speed. This work has application to engineering and training simulations, on-line process control systems, and engineering workstations where increased computational speeds are required
A parallel graded-mesh FDTD algorithm for human-antenna interaction problems.
Catarinucci, Luca; Tarricone, Luciano
2009-01-01
The finite difference time domain method (FDTD) is frequently used for the numerical solution of a wide variety of electromagnetic (EM) problems and, among them, those concerning human exposure to EM fields. In many practical cases related to the assessment of occupational EM exposure, large simulation domains are modeled and high space resolution adopted, so that strong memory and central processing unit power requirements have to be satisfied. To better afford the computational effort, the use of parallel computing is a winning approach; alternatively, subgridding techniques are often implemented. However, the simultaneous use of subgridding schemes and parallel algorithms is very new. In this paper, an easy-to-implement and highly-efficient parallel graded-mesh (GM) FDTD scheme is proposed and applied to human-antenna interaction problems, demonstrating its appropriateness in dealing with complex occupational tasks and showing its capability to guarantee the advantages of a traditional subgridding technique without affecting the parallel FDTD performance.
Multiscale Architectures and Parallel Algorithms for Video Object Tracking
2011-10-01
larger number of cores using the IBM QS22 Blade for handling higher video processing workloads (but at higher cost per core), low power consumption and...Cell/B.E. Blade processors which have a lot more main memory but also higher power consumption . More detailed performance figures for HD and SD video...Parallelism in Algorithms and Architectures, pages 289–298, 2007. [3] S. Ali and M. Shah. COCOA - Tracking in aerial imagery. In Daniel J. Henry
Efficient Serial and Parallel Algorithms for Selection of Unique Oligos in EST Databases.
Mata-Montero, Manrique; Shalaby, Nabil; Sheppard, Bradley
2013-01-01
Obtaining unique oligos from an EST database is a problem of great importance in bioinformatics, particularly in the discovery of new genes and the mapping of the human genome. Many algorithms have been developed to find unique oligos, many of which are much less time consuming than the traditional brute force approach. An algorithm was presented by Zheng et al. (2004) which finds the solution of the unique oligos search problem efficiently. We implement this algorithm as well as several new algorithms based on some theorems included in this paper. We demonstrate how, with these new algorithms, we can obtain unique oligos much faster than with previous ones. We parallelize these new algorithms to further improve the time of finding unique oligos. All algorithms are run on ESTs obtained from a Barley EST database.
Directory of Open Access Journals (Sweden)
Vasiliy Yu. Meltsov
2012-05-01
Full Text Available This paper presents the results of the development of one of the modules of the system verification of parallel algorithms that are used to verify the inference engine. This module is designed to build the specification requirements, the feasibility of which on the algorithm is necessary to prove (test.
International Nuclear Information System (INIS)
Al-Hallaq, A.; Amin, S.
1998-01-01
This paper introduces a new parallel algorithm and its simulation on a hypercube simulator for the low pass digital image filtering using a systolic array. This new algorithm is faster than the old one (Amin, 1988). This is due to the the fact that the old algorithm carries out the addition operations in a sequential mode. But in our new design these addition operations are divided into tow groups, which can be performed in parallel. One group will be performed on one half of the systolic array and the other on the second half, that is, by folding. This parallelism reduces the time required for the whole process by almost quarter the time of the old algorithm.(authors). 18 refs., 3 figs
International Nuclear Information System (INIS)
Kirk, B.L.; Azmy, Y.Y.
1992-01-01
In this paper the one-group, steady-state neutron diffusion equation in two-dimensional Cartesian geometry is solved using the nodal integral method. The discrete variable equations comprise loosely coupled sets of equations representing the nodal balance of neutrons, as well as neutron current continuity along rows or columns of computational cells. An iterative algorithm that is more suitable for solving large problems concurrently is derived based on the decomposition of the spatial domain and is accelerated using successive overrelaxation. This algorithm is very well suited for parallel computers, especially since the spatial domain decomposition occurs naturally, so that the number of iterations required for convergence does not depend on the number of processors participating in the calculation. Implementation of the authors' algorithm on the Intel iPSC/2 hypercube and Sequent Balance 8000 parallel computer is presented, and measured speedup and efficiency for test problems are reported. The results suggest that the efficiency of the hypercube quickly deteriorates when many processors are used, while the Sequent Balance retains very high efficiency for a comparable number of participating processors. This leads to the conjecture that message-passing parallel computers are not as well suited for this algorithm as shared-memory machines
Jiang, Y.; Xing, H. L.
2016-12-01
Micro-seismic events induced by water injection, mining activity or oil/gas extraction are quite informative, the interpretation of which can be applied for the reconstruction of underground stress and monitoring of hydraulic fracturing progress in oil/gas reservoirs. The source characterises and locations are crucial parameters that required for these purposes, which can be obtained through the waveform matching inversion (WMI) method. Therefore it is imperative to develop a WMI algorithm with high accuracy and convergence speed. Heuristic algorithm, as a category of nonlinear method, possesses a very high convergence speed and good capacity to overcome local minimal values, and has been well applied for many areas (e.g. image processing, artificial intelligence). However, its effectiveness for micro-seismic WMI is still poorly investigated; very few literatures exits that addressing this subject. In this research an advanced heuristic algorithm, gravitational search algorithm (GSA) , is proposed to estimate the focal mechanism (angle of strike, dip and rake) and source locations in three dimension. Unlike traditional inversion methods, the heuristic algorithm inversion does not require the approximation of green function. The method directly interacts with a CPU parallelized finite difference forward modelling engine, and updating the model parameters under GSA criterions. The effectiveness of this method is tested with synthetic data form a multi-layered elastic model; the results indicate GSA can be well applied on WMI and has its unique advantages. Keywords: Micro-seismicity, Waveform matching inversion, gravitational search algorithm, parallel computation
Parallel algorithms for online trackfinding at PANDA
Energy Technology Data Exchange (ETDEWEB)
Bianchi, Ludovico; Ritman, James; Stockmanns, Tobias [IKP, Forschungszentrum Juelich GmbH (Germany); Herten, Andreas [JSC, Forschungszentrum Juelich GmbH (Germany); Collaboration: PANDA-Collaboration
2016-07-01
The PANDA experiment, one of the four scientific pillars of the FAIR facility currently in construction in Darmstadt, is a next-generation particle detector that will study collisions of antiprotons with beam momenta of 1.5-15 GeV/c on a fixed proton target. Because of the broad physics scope and the similar signature of signal and background events, PANDA's strategy for data acquisition is to continuously record data from the whole detector and use this global information to perform online event reconstruction and filtering. A real-time rejection factor of up to 1000 must be achieved to match the incoming data rate for offline storage, making all components of the data processing system computationally very challenging. Online particle track identification and reconstruction is an essential step, since track information is used as input in all following phases. Online tracking algorithms must ensure a delicate balance between high tracking efficiency and quality, and minimal computational footprint. For this reason, a massively parallel solution exploiting multiple Graphic Processing Units (GPUs) is under investigation. The talk presents the core concepts of the algorithms being developed for primary trackfinding, along with details of their implementation on GPUs.
Automatic mesh refinement and parallel load balancing for Fokker-Planck-DSMC algorithm
Küchlin, Stephan; Jenny, Patrick
2018-06-01
Recently, a parallel Fokker-Planck-DSMC algorithm for rarefied gas flow simulation in complex domains at all Knudsen numbers was developed by the authors. Fokker-Planck-DSMC (FP-DSMC) is an augmentation of the classical DSMC algorithm, which mitigates the near-continuum deficiencies in terms of computational cost of pure DSMC. At each time step, based on a local Knudsen number criterion, the discrete DSMC collision operator is dynamically switched to the Fokker-Planck operator, which is based on the integration of continuous stochastic processes in time, and has fixed computational cost per particle, rather than per collision. In this contribution, we present an extension of the previous implementation with automatic local mesh refinement and parallel load-balancing. In particular, we show how the properties of discrete approximations to space-filling curves enable an efficient implementation. Exemplary numerical studies highlight the capabilities of the new code.
Energy Technology Data Exchange (ETDEWEB)
Ellison, C. Leland [PPPL; Finn, J. M. [LANL; Qin, H. [PPPL; Tang, William M. [PPPL
2014-10-01
Structure-preserving algorithms obtained via discrete variational principles exhibit strong promise for the calculation of guiding center test particle trajectories. The non-canonical Hamiltonian structure of the guiding center equations forms a novel and challenging context for geometric integration. To demonstrate the practical relevance of these methods, a prototypical variational midpoint algorithm is applied to an experimental magnetic equilibrium. The stability characteristics, conservation properties, and implementation requirements associated with the variational algorithms are addressed. Furthermore, computational run time is reduced for large numbers of particles by parallelizing the calculation on GPU hardware.
Directory of Open Access Journals (Sweden)
Alexander B. Bakulev
2012-11-01
Full Text Available This article deals with mathematical models and algorithms, providing mobility of sequential programs parallel representation on the high-level language, presents formal model of operation environment processes management, based on the proposed model of programs parallel representation, presenting computation process on the base of multi-core processors.
International Nuclear Information System (INIS)
Yang Lei; Gong Xueyu; Wang Ling
2013-01-01
Combined with standard mathematical model for evaluating quality of deploying results, a new high-performance parallel algorithm for source pencils' deployment was obtained by using parallel plant growth simulation algorithm which was completely parallelized with CUDA execute model, and the corresponding code can run on GPU. Based on such work, several instances in various scales were used to test the new version of algorithm. The results show that, based on the advantage of old versions. the performance of new one is improved more than 500 times comparing with the CPU version, and also 30 times with the CPU plus GPU hybrid version. The computation time of new version is less than ten minutes for the irradiator of which the activity is less than 111 PBq. For a single GTX275 GPU, the maximum computing power of new version is no more than 167 PBq as well as the computation time is no more than 25 minutes, and for multiple GPUs, the power can be improved more. Overall, the new version of algorithm running on GPU can satisfy the requirement of source pencils' deployment of any domestic irradiator, and it is of high competitiveness. (authors)
Directory of Open Access Journals (Sweden)
Yu Huang
Full Text Available Parameter estimation for fractional-order chaotic systems is an important issue in fractional-order chaotic control and synchronization and could be essentially formulated as a multidimensional optimization problem. A novel algorithm called quantum parallel particle swarm optimization (QPPSO is proposed to solve the parameter estimation for fractional-order chaotic systems. The parallel characteristic of quantum computing is used in QPPSO. This characteristic increases the calculation of each generation exponentially. The behavior of particles in quantum space is restrained by the quantum evolution equation, which consists of the current rotation angle, individual optimal quantum rotation angle, and global optimal quantum rotation angle. Numerical simulation based on several typical fractional-order systems and comparisons with some typical existing algorithms show the effectiveness and efficiency of the proposed algorithm.
Eddy current testing probe optimization using a parallel genetic algorithm
Directory of Open Access Journals (Sweden)
Dolapchiev Ivaylo
2008-01-01
Full Text Available This paper uses the developed parallel version of Michalewicz's Genocop III Genetic Algorithm (GA searching technique to optimize the coil geometry of an eddy current non-destructive testing probe (ECTP. The electromagnetic field is computed using FEMM 2D finite element code. The aim of this optimization was to determine coil dimensions and positions that improve ECTP sensitivity to physical properties of the tested devices.
Efficient parallel algorithms for string editing and related problems
Apostolico, Alberto; Atallah, Mikhail J.; Larmore, Lawrence; Mcfaddin, H. S.
1988-01-01
The string editing problem for input strings x and y consists of transforming x into y by performing a series of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbol in x or the substitution of a symbol x with another symbol. This problem has a well known O((absolute value of x)(absolute value of y)) time sequential solution (25). The efficient Program Requirements Analysis Methods (PRAM) parallel algorithms for the string editing problem are given. If m = ((absolute value of x),(absolute value of y)) and n = max((absolute value of x),(absolute value of y)), then the CREW bound is O (log m log n) time with O (mn/log m) processors. In all algorithms, space is O (mn).
Massively Parallel and Scalable Implicit Time Integration Algorithms for Structural Dynamics
Farhat, Charbel
1997-01-01
Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because of the following additional facts: (a) explicit schemes are easier to parallelize than implicit ones, and (b) explicit schemes induce short range interprocessor communications that are relatively inexpensive, while the factorization methods used in most implicit schemes induce long range interprocessor communications that often ruin the sought-after speed-up. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet be offset by the speed of the currently available parallel hardware. Therefore, it is essential to develop efficient alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating the low-frequency dynamics of aerospace structures.
Directory of Open Access Journals (Sweden)
Fang Huang
2016-06-01
Full Text Available In some digital Earth engineering applications, spatial interpolation algorithms are required to process and analyze large amounts of data. Due to its powerful computing capacity, heterogeneous computing has been used in many applications for data processing in various fields. In this study, we explore the design and implementation of a parallel universal kriging spatial interpolation algorithm using the OpenCL programming model on heterogeneous computing platforms for massive Geo-spatial data processing. This study focuses primarily on transforming the hotspots in serial algorithms, i.e., the universal kriging interpolation function, into the corresponding kernel function in OpenCL. We also employ parallelization and optimization techniques in our implementation to improve the code performance. Finally, based on the results of experiments performed on two different high performance heterogeneous platforms, i.e., an NVIDIA graphics processing unit system and an Intel Xeon Phi system (MIC, we show that the parallel universal kriging algorithm can achieve the highest speedup of up to 40× with a single computing device and the highest speedup of up to 80× with multiple devices.
Hou, Zhenlong; Huang, Danian
2017-09-01
In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.
International Nuclear Information System (INIS)
Kim, Heungseob; Kim, Pansoo
2017-01-01
To maximize the reliability of a system, the traditional reliability–redundancy allocation problem (RRAP) determines the component reliability and level of redundancy for each subsystem. This paper proposes an advanced RRAP that also considers the optimal redundancy strategy, either active or cold standby. In addition, new examples are presented for it. Furthermore, the exact reliability function for a cold standby redundant subsystem with an imperfect detector/switch is suggested, and is expected to replace the previous approximating model that has been used in most related studies. A parallel genetic algorithm for solving the RRAP as a mixed-integer nonlinear programming model is presented, and its performance is compared with those of previous studies by using numerical examples on three benchmark problems. - Highlights: • Optimal strategy is proposed to solve reliability redundancy allocation problem. • The redundancy strategy uses parallel genetic algorithm. • Improved reliability function for a cold standby subsystem is suggested. • Proposed redundancy strategy enhances the system reliability.
HPC-NMF: A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization
Energy Technology Data Exchange (ETDEWEB)
2016-08-22
NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient distributed algorithms to solve the problem for big data sets. We propose a high-performance distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for $\\WW$ and $\\HH$. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementation, our algorithm is also flexible: It performs well for both dense and sparse matrices, and allows the user to choose any one of the multiple algorithms for solving the updates to low rank factors $\\WW$ and $\\HH$ within the alternating iterations.
Romano, Paul Kollath
Monte Carlo particle transport methods are being considered as a viable option for high-fidelity simulation of nuclear reactors. While Monte Carlo methods offer several potential advantages over deterministic methods, there are a number of algorithmic shortcomings that would prevent their immediate adoption for full-core analyses. In this thesis, algorithms are proposed both to ameliorate the degradation in parallel efficiency typically observed for large numbers of processors and to offer a means of decomposing large tally data that will be needed for reactor analysis. A nearest-neighbor fission bank algorithm was proposed and subsequently implemented in the OpenMC Monte Carlo code. A theoretical analysis of the communication pattern shows that the expected cost is O( N ) whereas traditional fission bank algorithms are O(N) at best. The algorithm was tested on two supercomputers, the Intrepid Blue Gene/P and the Titan Cray XK7, and demonstrated nearly linear parallel scaling up to 163,840 processor cores on a full-core benchmark problem. An algorithm for reducing network communication arising from tally reduction was analyzed and implemented in OpenMC. The proposed algorithm groups only particle histories on a single processor into batches for tally purposes---in doing so it prevents all network communication for tallies until the very end of the simulation. The algorithm was tested, again on a full-core benchmark, and shown to reduce network communication substantially. A model was developed to predict the impact of load imbalances on the performance of domain decomposed simulations. The analysis demonstrated that load imbalances in domain decomposed simulations arise from two distinct phenomena: non-uniform particle densities and non-uniform spatial leakage. The dominant performance penalty for domain decomposition was shown to come from these physical effects rather than insufficient network bandwidth or high latency. The model predictions were verified with
Directory of Open Access Journals (Sweden)
Dawen Xia
2018-01-01
Full Text Available Frequent pattern mining is an effective approach for spatiotemporal association analysis of mobile trajectory big data in data-driven intelligent transportation systems. While existing parallel algorithms have been successfully applied to frequent pattern mining of large-scale trajectory data, two major challenges are how to overcome the inherent defects of Hadoop to cope with taxi trajectory big data including massive small files and how to discover the implicitly spatiotemporal frequent patterns with MapReduce. To conquer these challenges, this paper presents a MapReduce-based Parallel Frequent Pattern growth (MR-PFP algorithm to analyze the spatiotemporal characteristics of taxi operating using large-scale taxi trajectories with massive small file processing strategies on a Hadoop platform. More specifically, we first implement three methods, that is, Hadoop Archives (HAR, CombineFileInputFormat (CFIF, and Sequence Files (SF, to overcome the existing defects of Hadoop and then propose two strategies based on their performance evaluations. Next, we incorporate SF into Frequent Pattern growth (FP-growth algorithm and then implement the optimized FP-growth algorithm on a MapReduce framework. Finally, we analyze the characteristics of taxi operating in both spatial and temporal dimensions by MR-PFP in parallel. The results demonstrate that MR-PFP is superior to existing Parallel FP-growth (PFP algorithm in efficiency and scalability.
Highly parallel algorithm for high pT physics at FAIR-CBM
International Nuclear Information System (INIS)
Fueloep, A; Vesztergombi, G
2010-01-01
The limitations of presently available data on p T range are discussed and planned future upgrades are outlined. Special attention is given to the FAIR-CBM experiment as a unique high luminosity facility for future continuation of the measurements at very high p T with emphasis on the so-called mosaic trigger system to use the highly parallel online algorithm.
Parallel particle swarm optimization algorithm in nuclear problems
International Nuclear Information System (INIS)
Waintraub, Marcel; Pereira, Claudio M.N.A.; Schirru, Roberto
2009-01-01
Particle Swarm Optimization (PSO) is a population-based metaheuristic (PBM), in which solution candidates evolve through simulation of a simplified social adaptation model. Putting together robustness, efficiency and simplicity, PSO has gained great popularity. Many successful applications of PSO are reported, in which PSO demonstrated to have advantages over other well-established PBM. However, computational costs are still a great constraint for PSO, as well as for all other PBMs, especially in optimization problems with time consuming objective functions. To overcome such difficulty, parallel computation has been used. The default advantage of parallel PSO (PPSO) is the reduction of computational time. Master-slave approaches, exploring this characteristic are the most investigated. However, much more should be expected. It is known that PSO may be improved by more elaborated neighborhood topologies. Hence, in this work, we develop several different PPSO algorithms exploring the advantages of enhanced neighborhood topologies implemented by communication strategies in multiprocessor architectures. The proposed PPSOs have been applied to two complex and time consuming nuclear engineering problems: reactor core design and fuel reload optimization. After exhaustive experiments, it has been concluded that: PPSO still improves solutions after many thousands of iterations, making prohibitive the efficient use of serial (non-parallel) PSO in such kind of realworld problems; and PPSO with more elaborated communication strategies demonstrated to be more efficient and robust than the master-slave model. Advantages and peculiarities of each model are carefully discussed in this work. (author)
Tilton, James C.; Plaza, Antonio J. (Editor); Chang, Chein-I. (Editor)
2008-01-01
The hierarchical image segmentation algorithm (referred to as HSEG) is a hybrid of hierarchical step-wise optimization (HSWO) and constrained spectral clustering that produces a hierarchical set of image segmentations. HSWO is an iterative approach to region grooving segmentation in which the optimal image segmentation is found at N(sub R) regions, given a segmentation at N(sub R+1) regions. HSEG's addition of constrained spectral clustering makes it a computationally intensive algorithm, for all but, the smallest of images. To counteract this, a computationally efficient recursive approximation of HSEG (called RHSEG) has been devised. Further improvements in processing speed are obtained through a parallel implementation of RHSEG. This chapter describes this parallel implementation and demonstrates its computational efficiency on a Landsat Thematic Mapper test scene.
A Parallel Decoding Algorithm for Short Polar Codes Based on Error Checking and Correcting
Pan, Xiaofei; Pan, Kegang; Ye, Zhan; Gong, Chao
2014-01-01
We propose a parallel decoding algorithm based on error checking and correcting to improve the performance of the short polar codes. In order to enhance the error-correcting capacity of the decoding algorithm, we first derive the error-checking equations generated on the basis of the frozen nodes, and then we introduce the method to check the errors in the input nodes of the decoder by the solutions of these equations. In order to further correct those checked errors, we adopt the method of modifying the probability messages of the error nodes with constant values according to the maximization principle. Due to the existence of multiple solutions of the error-checking equations, we formulate a CRC-aided optimization problem of finding the optimal solution with three different target functions, so as to improve the accuracy of error checking. Besides, in order to increase the throughput of decoding, we use a parallel method based on the decoding tree to calculate probability messages of all the nodes in the decoder. Numerical results show that the proposed decoding algorithm achieves better performance than that of some existing decoding algorithms with the same code length. PMID:25540813
GPS Signal Offset Detection and Noise Strength Estimation in a Parallel Kalman Filter Algorithm
National Research Council Canada - National Science Library
Vanek, Barry
1999-01-01
.... The variance of the noise process is estimated and provided to the second algorithm, a parallel Kalman filter structure, which then adapts to changes in the real-world measurement noise strength...
A parallel adaptive quantum genetic algorithm for the controllability of arbitrary networks.
Li, Yuhong; Gong, Guanghong; Li, Ni
2018-01-01
In this paper, we propose a novel algorithm-parallel adaptive quantum genetic algorithm-which can rapidly determine the minimum control nodes of arbitrary networks with both control nodes and state nodes. The corresponding network can be fully controlled with the obtained control scheme. We transformed the network controllability issue into a combinational optimization problem based on the Popov-Belevitch-Hautus rank condition. A set of canonical networks and a list of real-world networks were experimented. Comparison results demonstrated that the algorithm was more ideal to optimize the controllability of networks, especially those larger-size networks. We demonstrated subsequently that there were links between the optimal control nodes and some network statistical characteristics. The proposed algorithm provides an effective approach to improve the controllability optimization of large networks or even extra-large networks with hundreds of thousands nodes.
International Nuclear Information System (INIS)
Zee, S.K.
1987-01-01
A numeric algorithm and an associated computer code were developed for the rapid solution of the finite-difference method representation of the few-group neutron-diffusion equations on parallel computers. Applications of the numeric algorithm on both SIMD (vector pipeline) and MIMD/SIMD (multi-CUP/vector pipeline) architectures were explored. The algorithm was successfully implemented in the two-group, 3-D neutron diffusion computer code named DIFPAR3D (DIFfusion PARallel 3-Dimension). Numerical-solution techniques used in the code include the Chebyshev polynomial acceleration technique in conjunction with the power method of outer iteration. For inner iterations, a parallel form of red-black (cyclic) line SOR with automated determination of group dependent relaxation factors and iteration numbers required to achieve specified inner iteration error tolerance is incorporated. The code employs a macroscopic depletion model with trace capability for selected fission products' transients and critical boron. In addition to this, moderator and fuel temperature feedback models are also incorporated into the DIFPAR3D code, for realistic simulation of power reactor cores. The physics models used were proven acceptable in separate benchmarking studies
Parallel Atomistic Simulations
Energy Technology Data Exchange (ETDEWEB)
HEFFELFINGER,GRANT S.
2000-01-18
Algorithms developed to enable the use of atomistic molecular simulation methods with parallel computers are reviewed. Methods appropriate for bonded as well as non-bonded (and charged) interactions are included. While strategies for obtaining parallel molecular simulations have been developed for the full variety of atomistic simulation methods, molecular dynamics and Monte Carlo have received the most attention. Three main types of parallel molecular dynamics simulations have been developed, the replicated data decomposition, the spatial decomposition, and the force decomposition. For Monte Carlo simulations, parallel algorithms have been developed which can be divided into two categories, those which require a modified Markov chain and those which do not. Parallel algorithms developed for other simulation methods such as Gibbs ensemble Monte Carlo, grand canonical molecular dynamics, and Monte Carlo methods for protein structure determination are also reviewed and issues such as how to measure parallel efficiency, especially in the case of parallel Monte Carlo algorithms with modified Markov chains are discussed.
Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel Implicit CFD
Gropp, W. D.; Keyes, D. E.; McInnes, L. C.; Tidriri, M. D.
1998-01-01
Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, "routine" parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz (Psi-NKS) algorithmic framework is presented as an answer. We show that, for the classical problem of three-dimensional transonic Euler flow about an M6 wing, Psi-NKS can simultaneously deliver: globalized, asymptotically rapid convergence through adaptive pseudo- transient continuation and Newton's method-, reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver; and high per- processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of Psi-NKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. We therefore distill several recommendations from our experience and from our reading of the literature on various algorithmic components of Psi-NKS, and we describe a freely available, MPI-based portable parallel software implementation of the solver employed here.
Algorithm of parallel: hierarchical transformation and its implementation on FPGA
Timchenko, Leonid I.; Petrovskiy, Mykola S.; Kokryatskay, Natalia I.; Barylo, Alexander S.; Dembitska, Sofia V.; Stepanikuk, Dmytro S.; Suleimenov, Batyrbek; Zyska, Tomasz; Uvaysova, Svetlana; Shedreyeva, Indira
2017-08-01
In this paper considers the algorithm of laser beam spots image classification in atmospheric-optical transmission systems. It discusses the need for images filtering using adaptive methods, using, for example, parallel-hierarchical networks. The article also highlights the need to create high-speed memory devices for such networks. Implementation and simulation results of the developed method based on the PLD are demonstrated, which shows that the presented method gives 15-20% better prediction results than similar methods.
International Nuclear Information System (INIS)
Pereira, Claudio M.N.A.; Lapa, Celso M.F.
2003-01-01
This work extends the research related to generic algorithms (GA) in core design optimization problems, which basic investigations were presented in previous work. Here we explore the use of the Island Genetic Algorithm (IGA), a coarse-grained parallel GA model, comparing its performance to that obtained by the application of a traditional non-parallel GA. The optimization problem consists on adjusting several reactor cell parameters, such as dimensions, enrichment and materials, in order to minimize the average peak-factor in a 3-enrichment zone reactor, considering restrictions on the average thermal flux, criticality and sub-moderation. Our IGA implementation runs as a distributed application on a conventional local area network (LAN), avoiding the use of expensive parallel computers or architectures. After exhaustive experiments, taking more than 1500 h in 550 MHz personal computers, we have observed that the IGA provided gains not only in terms of computational time, but also in the optimization outcome. Besides, we have also realized that, for such kind of problem, which fitness evaluation is itself time consuming, the time overhead in the IGA, due to the communication in LANs, is practically imperceptible, leading to the conclusion that the use of expensive parallel computers or architecture can be avoided
Parallel CFD Algorithms for Aerodynamical Flow Solvers on Unstructured Meshes. Parts 1 and 2
Barth, Timothy J.; Kwak, Dochan (Technical Monitor)
1995-01-01
The Advisory Group for Aerospace Research and Development (AGARD) has requested my participation in the lecture series entitled Parallel Computing in Computational Fluid Dynamics to be held at the von Karman Institute in Brussels, Belgium on May 15-19, 1995. In addition, a request has been made from the US Coordinator for AGARD at the Pentagon for NASA Ames to hold a repetition of the lecture series on October 16-20, 1995. I have been asked to be a local coordinator for the Ames event. All AGARD lecture series events have attendance limited to NATO allied countries. A brief of the lecture series is provided in the attached enclosure. Specifically, I have been asked to give two lectures of approximately 75 minutes each on the subject of parallel solution techniques for the fluid flow equations on unstructured meshes. The title of my lectures is "Parallel CFD Algorithms for Aerodynamical Flow Solvers on Unstructured Meshes" (Parts I-II). The contents of these lectures will be largely review in nature and will draw upon previously published work in this area. Topics of my lectures will include: (1) Mesh partitioning algorithms. Recursive techniques based on coordinate bisection, Cuthill-McKee level structures, and spectral bisection. (2) Newton's method for large scale CFD problems. Size and complexity estimates for Newton's method, modifications for insuring global convergence. (3) Techniques for constructing the Jacobian matrix. Analytic and numerical techniques for Jacobian matrix-vector products, constructing the transposed matrix, extensions to optimization and homotopy theories. (4) Iterative solution algorithms. Practical experience with GIVIRES and BICG-STAB matrix solvers. (5) Parallel matrix preconditioning. Incomplete Lower-Upper (ILU) factorization, domain-decomposed ILU, approximate Schur complement strategies.
Optimization Design by Genetic Algorithm Controller for Trajectory Control of a 3-RRR Parallel Robot
Directory of Open Access Journals (Sweden)
Lianchao Sheng
2018-01-01
Full Text Available In order to improve the control precision and robustness of the existing proportion integration differentiation (PID controller of a 3-Revolute–Revolute–Revolute (3-RRR parallel robot, a variable PID parameter controller optimized by a genetic algorithm controller is proposed in this paper. Firstly, the inverse kinematics model of the 3-RRR parallel robot was established according to the vector method, and the motor conversion matrix was deduced. Then, the error square integral was chosen as the fitness function, and the genetic algorithm controller was designed. Finally, the control precision of the new controller was verified through the simulation model of the 3-RRR planar parallel robot—built in SimMechanics—and the robustness of the new controller was verified by adding interference. The results show that compared with the traditional PID controller, the new controller designed in this paper has better control precision and robustness, which provides the basis for practical application.
An improved non-uniformity correction algorithm and its GPU parallel implementation
Cheng, Kuanhong; Zhou, Huixin; Qin, Hanlin; Zhao, Dong; Qian, Kun; Rong, Shenghui
2018-05-01
The performance of SLP-THP based non-uniformity correction algorithm is seriously affected by the result of SLP filter, which always leads to image blurring and ghosting artifacts. To address this problem, an improved SLP-THP based non-uniformity correction method with curvature constraint was proposed. Here we put forward a new way to estimate spatial low frequency component. First, the details and contours of input image were obtained respectively by minimizing local Gaussian curvature and mean curvature of image surface. Then, the guided filter was utilized to combine these two parts together to get the estimate of spatial low frequency component. Finally, we brought this SLP component into SLP-THP method to achieve non-uniformity correction. The performance of proposed algorithm was verified by several real and simulated infrared image sequences. The experimental results indicated that the proposed algorithm can reduce the non-uniformity without detail losing. After that, a GPU based parallel implementation that runs 150 times faster than CPU was presented, which showed the proposed algorithm has great potential for real time application.
An image-space parallel convolution filtering algorithm based on shadow map
Li, Hua; Yang, Huamin; Zhao, Jianping
2017-07-01
Shadow mapping is commonly used in real-time rendering. In this paper, we presented an accurate and efficient method of soft shadows generation from planar area lights. First this method generated a depth map from light's view, and analyzed the depth-discontinuities areas as well as shadow boundaries. Then these areas were described as binary values in the texture map called binary light-visibility map, and a parallel convolution filtering algorithm based on GPU was enforced to smooth out the boundaries with a box filter. Experiments show that our algorithm is an effective shadow map based method that produces perceptually accurate soft shadows in real time with more details of shadow boundaries compared with the previous works.
Mesh Partitioning Algorithm Based on Parallel Finite Element Analysis and Its Actualization
Directory of Open Access Journals (Sweden)
Lei Zhang
2013-01-01
Full Text Available In parallel computing based on finite element analysis, domain decomposition is a key technique for its preprocessing. Generally, a domain decomposition of a mesh can be realized through partitioning of a graph which is converted from a finite element mesh. This paper discusses the method for graph partitioning and the way to actualize mesh partitioning. Relevant softwares are introduced, and the data structure and key functions of Metis and ParMetis are introduced. The writing, compiling, and testing of the mesh partitioning interface program based on these key functions are performed. The results indicate some objective law and characteristics to guide the users who use the graph partitioning algorithm and software to write PFEM program, and ideal partitioning effects can be achieved by actualizing mesh partitioning through the program. The interface program can also be used directly by the engineering researchers as a module of the PFEM software. So that it can reduce the application of the threshold of graph partitioning algorithm, improve the calculation efficiency, and promote the application of graph theory and parallel computing.
International Nuclear Information System (INIS)
Laurent, C.; Chassery, J.M.; Peyrin, F.; Girerd, C.
1996-01-01
This paper deals with the parallel implementations of reconstruction methods in 3D tomography. 3D tomography requires voluminous data and long computation times. Parallel computing, on MIMD computers, seems to be a good approach to manage this problem. In this study, we present the different steps of the parallelization on an abstract parallel computer. Depending on the method, we use two main approaches to parallelize the algorithms: the local approach and the global approach. Experimental results on MIMD computers are presented. Two 3D images reconstructed from realistic data are showed
Bansal, Shonak; Singh, Arun Kumar; Gupta, Neena
2017-02-01
In real-life, multi-objective engineering design problems are very tough and time consuming optimization problems due to their high degree of nonlinearities, complexities and inhomogeneity. Nature-inspired based multi-objective optimization algorithms are now becoming popular for solving multi-objective engineering design problems. This paper proposes original multi-objective Bat algorithm (MOBA) and its extended form, namely, novel parallel hybrid multi-objective Bat algorithm (PHMOBA) to generate shortest length Golomb ruler called optimal Golomb ruler (OGR) sequences at a reasonable computation time. The OGRs found their application in optical wavelength division multiplexing (WDM) systems as channel-allocation algorithm to reduce the four-wave mixing (FWM) crosstalk. The performances of both the proposed algorithms to generate OGRs as optical WDM channel-allocation is compared with other existing classical computing and nature-inspired algorithms, including extended quadratic congruence (EQC), search algorithm (SA), genetic algorithms (GAs), biogeography based optimization (BBO) and big bang-big crunch (BB-BC) optimization algorithms. Simulations conclude that the proposed parallel hybrid multi-objective Bat algorithm works efficiently as compared to original multi-objective Bat algorithm and other existing algorithms to generate OGRs for optical WDM systems. The algorithm PHMOBA to generate OGRs, has higher convergence and success rate than original MOBA. The efficiency improvement of proposed PHMOBA to generate OGRs up to 20-marks, in terms of ruler length and total optical channel bandwidth (TBW) is 100 %, whereas for original MOBA is 85 %. Finally the implications for further research are also discussed.
Parallel Implicit Algorithms for CFD
Keyes, David E.
1998-01-01
The main goal of this project was efficient distributed parallel and workstation cluster implementations of Newton-Krylov-Schwarz (NKS) solvers for implicit Computational Fluid Dynamics (CFD.) "Newton" refers to a quadratically convergent nonlinear iteration using gradient information based on the true residual, "Krylov" to an inner linear iteration that accesses the Jacobian matrix only through highly parallelizable sparse matrix-vector products, and "Schwarz" to a domain decomposition form of preconditioning the inner Krylov iterations with primarily neighbor-only exchange of data between the processors. Prior experience has established that Newton-Krylov methods are competitive solvers in the CFD context and that Krylov-Schwarz methods port well to distributed memory computers. The combination of the techniques into Newton-Krylov-Schwarz was implemented on 2D and 3D unstructured Euler codes on the parallel testbeds that used to be at LaRC and on several other parallel computers operated by other agencies or made available by the vendors. Early implementations were made directly in Massively Parallel Integration (MPI) with parallel solvers we adapted from legacy NASA codes and enhanced for full NKS functionality. Later implementations were made in the framework of the PETSC library from Argonne National Laboratory, which now includes pseudo-transient continuation Newton-Krylov-Schwarz solver capability (as a result of demands we made upon PETSC during our early porting experiences). A secondary project pursued with funding from this contract was parallel implicit solvers in acoustics, specifically in the Helmholtz formulation. A 2D acoustic inverse problem has been solved in parallel within the PETSC framework.
1982-01-01
Parallel Computations focuses on parallel computation, with emphasis on algorithms used in a variety of numerical and physical applications and for many different types of parallel computers. Topics covered range from vectorization of fast Fourier transforms (FFTs) and of the incomplete Cholesky conjugate gradient (ICCG) algorithm on the Cray-1 to calculation of table lookups and piecewise functions. Single tridiagonal linear systems and vectorized computation of reactive flow are also discussed.Comprised of 13 chapters, this volume begins by classifying parallel computers and describing techn
Parallel implementation of D-Phylo algorithm for maximum likelihood clusters.
Malik, Shamita; Sharma, Dolly; Khatri, Sunil Kumar
2017-03-01
This study explains a newly developed parallel algorithm for phylogenetic analysis of DNA sequences. The newly designed D-Phylo is a more advanced algorithm for phylogenetic analysis using maximum likelihood approach. The D-Phylo while misusing the seeking capacity of k -means keeps away from its real constraint of getting stuck at privately conserved motifs. The authors have tested the behaviour of D-Phylo on Amazon Linux Amazon Machine Image(Hardware Virtual Machine)i2.4xlarge, six central processing unit, 122 GiB memory, 8 × 800 Solid-state drive Elastic Block Store volume, high network performance up to 15 processors for several real-life datasets. Distributing the clusters evenly on all the processors provides us the capacity to accomplish a near direct speed if there should arise an occurrence of huge number of processors.
Experiences with serial and parallel algorithms for channel routing using simulated annealing
Brouwer, Randall Jay
1988-01-01
Two algorithms for channel routing using simulated annealing are presented. Simulated annealing is an optimization methodology which allows the solution process to back up out of local minima that may be encountered by inappropriate selections. By properly controlling the annealing process, it is very likely that the optimal solution to an NP-complete problem such as channel routing may be found. The algorithm presented proposes very relaxed restrictions on the types of allowable transformations, including overlapping nets. By freeing that restriction and controlling overlap situations with an appropriate cost function, the algorithm becomes very flexible and can be applied to many extensions of channel routing. The selection of the transformation utilizes a number of heuristics, still retaining the pseudorandom nature of simulated annealing. The algorithm was implemented as a serial program for a workstation, and a parallel program designed for a hypercube computer. The details of the serial implementation are presented, including many of the heuristics used and some of the resulting solutions.
A Parallel Algorithm for the Counting of Ellipses Present in Conglomerates Using GPU
Directory of Open Access Journals (Sweden)
Reyes Yam-Uicab
2018-01-01
Full Text Available Detecting and counting elliptical objects are an interesting problem in digital image processing. There are real-world applications of this problem in various disciplines. Solving this problem is harder when there is occlusion among the elliptical objects, since in general these objects are considered as part of the bigger object (conglomerate. The solution to this problem focusses on the detection and segmentation of the precise number of occluded elliptical objects, while omitting all noninteresting objects. There are a variety of computational approximations that focus on this problem; however, such approximations are not accurate when there is occlusion. This paper presents an algorithm designed to solve this problem, specifically, to detect, segment, and count elliptical objects of a specific size when these are in occlusion with other objects within the conglomerate. Our algorithm deals with a time-consuming combinatorial process. To optimize the execution time of our algorithm, we implemented a parallel GPU version with CUDA-C, which experimentally improved the detection of occluded objects, as well as lowering processing times compared to the sequential version of the method. Comparative test results with another method featured in literature showed improved detection of objects in occlusion when using the proposed parallel method.
Förster, Michael
2014-01-01
Numerical programs often use parallel programming techniques such as OpenMP to compute the program's output values as efficient as possible. In addition, derivative values of these output values with respect to certain input values play a crucial role. To achieve code that computes not only the output values simultaneously but also the derivative values, this work introduces several source-to-source transformation rules. These rules are based on a technique called algorithmic differentiation. The main focus of this work lies on the important reverse mode of algorithmic differentiation. The inh
Li, Kenli; Zou, Shuting; Xv, Jin
2008-01-01
Elliptic curve cryptographic algorithms convert input data to unrecognizable encryption and the unrecognizable data back again into its original decrypted form. The security of this form of encryption hinges on the enormous difficulty that is required to solve the elliptic curve discrete logarithm problem (ECDLP), especially over GF(2(n)), n in Z+. This paper describes an effective method to find solutions to the ECDLP by means of a molecular computer. We propose that this research accomplishment would represent a breakthrough for applied biological computation and this paper demonstrates that in principle this is possible. Three DNA-based algorithms: a parallel adder, a parallel multiplier, and a parallel inverse over GF(2(n)) are described. The biological operation time of all of these algorithms is polynomial with respect to n. Considering this analysis, cryptography using a public key might be less secure. In this respect, a principal contribution of this paper is to provide enhanced evidence of the potential of molecular computing to tackle such ambitious computations.
International Nuclear Information System (INIS)
Stankovski, Z.
1995-01-01
The collision probability method in neutron transport, as applied to 2D geometries, consume a great amount of computer time, for a typical 2D assembly calculation evaluations. Consequently RZ or 3D calculations became prohibitive. In this paper we present a simple but efficient parallel algorithm based on the message passing host/node programing model. Parallelization was applied to the energy group treatment. Such approach permits parallelization of the existing code, requiring only limited modifications. Sequential/parallel computer portability is preserved, witch is a necessary condition for a industrial code. Sequential performances are also preserved. The algorithm is implemented on a CRAY 90 coupled to a 128 processor T3D computer, a 16 processor IBM SP1 and a network of workstations, using the Public Domain PVM library. The tests were executed for a 2D geometry with the standard 99-group library. All results were very satisfactory, the best ones with IBM SP1. Because of heterogeneity of the workstation network, we did ask high performances for this architecture. The same source code was used for all computers. A more impressive advantage of this algorithm will appear in the calculations of the SAPHYR project (with the future fine multigroup library of about 8000 groups) with a massively parallel computer, using several hundreds of processors. (author). 5 refs., 6 figs., 2 tabs
International Nuclear Information System (INIS)
Stankovski, Z.
1995-01-01
The collision probability method in neutron transport, as applied to 2D geometries, consume a great amount of computer time, for a typical 2D assembly calculation about 90% of the computing time is consumed in the collision probability evaluations. Consequently RZ or 3D calculations became prohibitive. In this paper the author presents a simple but efficient parallel algorithm based on the message passing host/node programmation model. Parallelization was applied to the energy group treatment. Such approach permits parallelization of the existing code, requiring only limited modifications. Sequential/parallel computer portability is preserved, which is a necessary condition for a industrial code. Sequential performances are also preserved. The algorithm is implemented on a CRAY 90 coupled to a 128 processor T3D computer, a 16 processor IBM SPI and a network of workstations, using the Public Domain PVM library. The tests were executed for a 2D geometry with the standard 99-group library. All results were very satisfactory, the best ones with IBM SPI. Because of heterogeneity of the workstation network, the author did not ask high performances for this architecture. The same source code was used for all computers. A more impressive advantage of this algorithm will appear in the calculations of the SAPHYR project (with the future fine multigroup library of about 8000 groups) with a massively parallel computer, using several hundreds of processors
Parallel Genetic Algorithms for calibrating Cellular Automata models: Application to lava flows
International Nuclear Information System (INIS)
D'Ambrosio, D.; Spataro, W.; Di Gregorio, S.; Calabria Univ., Cosenza; Crisci, G.M.; Rongo, R.; Calabria Univ., Cosenza
2005-01-01
Cellular Automata are highly nonlinear dynamical systems which are suitable far simulating natural phenomena whose behaviour may be specified in terms of local interactions. The Cellular Automata model SCIARA, developed far the simulation of lava flows, demonstrated to be able to reproduce the behaviour of Etnean events. However, in order to apply the model far the prediction of future scenarios, a thorough calibrating phase is required. This work presents the application of Genetic Algorithms, general-purpose search algorithms inspired to natural selection and genetics, far the parameters optimisation of the model SCIARA. Difficulties due to the elevated computational time suggested the adoption a Master-Slave Parallel Genetic Algorithm far the calibration of the model with respect to the 2001 Mt. Etna eruption. Results demonstrated the usefulness of the approach, both in terms of computing time and quality of performed simulations
The parallel algorithm for the 2D discrete wavelet transform
Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel
2018-04-01
The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.
Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor
International Nuclear Information System (INIS)
Whiteside, R.A.; Binkley, J.S.; Colvin, M.E.; Schaefer, H.F. III
1987-01-01
For many years it has been recognized that fundamental physical constraints such as the speed of light will limit the ultimate speed of single processor computers to less than about three billion floating point operations per second (3 GFLOPS). This limitation is becoming increasingly restrictive as commercially available machines are now within an order of magnitude of this asymptotic limit. A natural way to avoid this limit is to harness together many processors to work on a single computational problem. In principle, these parallel processing computers have speeds limited only by the number of processors one chooses to acquire. The usefulness of potentially unlimited processing speed to a computationally intensive field such as quantum chemistry is obvious. If these methods are to be applied to significantly larger chemical systems, parallel schemes will have to be employed. For this reason we have developed distributed-memory algorithms for a number of standard quantum chemical methods. We are currently implementing these on a 32 processor Intel hypercube. In this paper we present our algorithm and benchmark results for one of the bottleneck steps in quantum chemical calculations: the four index integral transformation
DEFF Research Database (Denmark)
Keibler, Evan; Arumugam, Manimozhiyan; Brent, Michael R
2007-01-01
MOTIVATION: Hidden Markov models (HMMs) and generalized HMMs been successfully applied to many problems, but the standard Viterbi algorithm for computing the most probable interpretation of an input sequence (known as decoding) requires memory proportional to the length of the sequence, which can...... be prohibitive. Existing approaches to reducing memory usage either sacrifice optimality or trade increased running time for reduced memory. RESULTS: We developed two novel decoding algorithms, Treeterbi and Parallel Treeterbi, and implemented them in the TWINSCAN/N-SCAN gene-prediction system. The worst case...... asymptotic space and time are the same as for standard Viterbi, but in practice, Treeterbi optimally decodes arbitrarily long sequences with generalized HMMs in bounded memory without increasing running time. Parallel Treeterbi uses the same ideas to split optimal decoding across processors, dividing latency...
A parallel neural network training algorithm for control of discrete dynamical systems.
Energy Technology Data Exchange (ETDEWEB)
Gordillo, J. L.; Hanebutte, U. R.; Vitela, J. E.
1998-01-20
In this work we present a parallel neural network controller training code, that uses MPI, a portable message passing environment. A comprehensive performance analysis is reported which compares results of a performance model with actual measurements. The analysis is made for three different load assignment schemes: block distribution, strip mining and a sliding average bin packing (best-fit) algorithm. Such analysis is crucial since optimal load balance can not be achieved because the work load information is not available a priori. The speedup results obtained with the above schemes are compared with those corresponding to the bin packing load balance scheme with perfect load prediction based on a priori knowledge of the computing effort. Two multiprocessor platforms: a SGI/Cray Origin 2000 and a IBM SP have been utilized for this study. It is shown that for the best load balance scheme a parallel efficiency of over 50% for the entire computation is achieved by 17 processors of either parallel computers.
A parallel algorithm for solving linear equations arising from one-dimensional network problems
International Nuclear Information System (INIS)
Mesina, G.L.
1991-01-01
One-dimensional (1-D) network problems, such as those arising from 1- D fluid simulations and electrical circuitry, produce systems of sparse linear equations which are nearly tridiagonal and contain a few non-zero entries outside the tridiagonal. Most direct solution techniques for such problems either do not take advantage of the special structure of the matrix or do not fully utilize parallel computer architectures. We describe a new parallel direct linear equation solution algorithm, called TRBR, which is especially designed to take advantage of this structure on MIMD shared memory machines. The new method belongs to a family of methods which split the coefficient matrix into the sum of a tridiagonal matrix T and a matrix comprised of the remaining coefficients R. Efficient tridiagonal methods are used to algebraically simplify the linear system. A smaller auxiliary subsystem is created and solved and its solution is used to calculate the solution of the original system. The newly devised BR method solves the subsystem. The serial and parallel operation counts are given for the new method and related earlier methods. TRBR is shown to have the smallest operation count in this class of direct methods. Numerical results are given. Although the algorithm is designed for one-dimensional networks, it has been applied successfully to three-dimensional problems as well. 20 refs., 2 figs., 4 tabs
A New Parallel Approach for Accelerating the GPU-Based Execution of Edge Detection Algorithms.
Emrani, Zahra; Bateni, Soroosh; Rabbani, Hossein
2017-01-01
Real-time image processing is used in a wide variety of applications like those in medical care and industrial processes. This technique in medical care has the ability to display important patient information graphi graphically, which can supplement and help the treatment process. Medical decisions made based on real-time images are more accurate and reliable. According to the recent researches, graphic processing unit (GPU) programming is a useful method for improving the speed and quality of medical image processing and is one of the ways of real-time image processing. Edge detection is an early stage in most of the image processing methods for the extraction of features and object segments from a raw image. The Canny method, Sobel and Prewitt filters, and the Roberts' Cross technique are some examples of edge detection algorithms that are widely used in image processing and machine vision. In this work, these algorithms are implemented using the Compute Unified Device Architecture (CUDA), Open Source Computer Vision (OpenCV), and Matrix Laboratory (MATLAB) platforms. An existing parallel method for Canny approach has been modified further to run in a fully parallel manner. This has been achieved by replacing the breadth- first search procedure with a parallel method. These algorithms have been compared by testing them on a database of optical coherence tomography images. The comparison of results shows that the proposed implementation of the Canny method on GPU using the CUDA platform improves the speed of execution by 2-100× compared to the central processing unit-based implementation using the OpenCV and MATLAB platforms.
Parallel field line and stream line tracing algorithms for space physics applications
Toth, G.; de Zeeuw, D.; Monostori, G.
2004-05-01
Field line and stream line tracing is required in various space physics applications, such as the coupling of the global magnetosphere and inner magnetosphere models, the coupling of the solar energetic particle and heliosphere models, or the modeling of comets, where the multispecies chemical equations are solved along stream lines of a steady state solution obtained with single fluid MHD model. Tracing a vector field is an inherently serial process, which is difficult to parallelize. This is especially true when the data corresponding to the vector field is distributed over a large number of processors. We designed algorithms for the various applications, which scale well to a large number of processors. In the first algorithm the computational domain is divided into blocks. Each block is on a single processor. The algorithm folows the vector field inside the blocks, and calculates a mapping of the block surfaces. The blocks communicate the values at the coinciding surfaces, and the results are interpolated. Finally all block surfaces are defined and values inside the blocks are obtained. In the second algorithm all processors start integrating along the vector field inside the accessible volume. When the field line leaves the local subdomain, the position and other information is stored in a buffer. Periodically the processors exchange the buffers, and continue integration of the field lines until they reach a boundary. At that point the results are sent back to the originating processor. Efficiency is achieved by a careful phasing of computation and communication. In the third algorithm the results of a steady state simulation are stored on a hard drive. The vector field is contained in blocks. All processors read in all the grid and vector field data and the stream lines are integrated in parallel. If a stream line enters a block, which has already been integrated, the results can be interpolated. By a clever ordering of the blocks the execution speed can be
An Efficient MapReduce-Based Parallel Clustering Algorithm for Distributed Traffic Subarea Division
Directory of Open Access Journals (Sweden)
Dawen Xia
2015-01-01
Full Text Available Traffic subarea division is vital for traffic system management and traffic network analysis in intelligent transportation systems (ITSs. Since existing methods may not be suitable for big traffic data processing, this paper presents a MapReduce-based Parallel Three-Phase K-Means (Par3PKM algorithm for solving traffic subarea division problem on a widely adopted Hadoop distributed computing platform. Specifically, we first modify the distance metric and initialization strategy of K-Means and then employ a MapReduce paradigm to redesign the optimized K-Means algorithm for parallel clustering of large-scale taxi trajectories. Moreover, we propose a boundary identifying method to connect the borders of clustering results for each cluster. Finally, we divide traffic subarea of Beijing based on real-world trajectory data sets generated by 12,000 taxis in a period of one month using the proposed approach. Experimental evaluation results indicate that when compared with K-Means, Par2PK-Means, and ParCLARA, Par3PKM achieves higher efficiency, more accuracy, and better scalability and can effectively divide traffic subarea with big taxi trajectory data.
A parallel adaptive mesh refinement algorithm for predicting turbulent non-premixed combusting flows
International Nuclear Information System (INIS)
Gao, X.; Groth, C.P.T.
2005-01-01
A parallel adaptive mesh refinement (AMR) algorithm is proposed for predicting turbulent non-premixed combusting flows characteristic of gas turbine engine combustors. The Favre-averaged Navier-Stokes equations governing mixture and species transport for a reactive mixture of thermally perfect gases in two dimensions, the two transport equations of the κ-ψ turbulence model, and the time-averaged species transport equations, are all solved using a fully coupled finite-volume formulation. A flexible block-based hierarchical data structure is used to maintain the connectivity of the solution blocks in the multi-block mesh and facilitate automatic solution-directed mesh adaptation according to physics-based refinement criteria. This AMR approach allows for anisotropic mesh refinement and the block-based data structure readily permits efficient and scalable implementations of the algorithm on multi-processor architectures. Numerical results for turbulent non-premixed diffusion flames, including cold- and hot-flow predictions for a bluff body burner, are described and compared to available experimental data. The numerical results demonstrate the validity and potential of the parallel AMR approach for predicting complex non-premixed turbulent combusting flows. (author)
Duan, Jizhong; Liu, Yu; Jing, Peiguang
2018-02-01
Self-consistent parallel imaging (SPIRiT) is an auto-calibrating model for the reconstruction of parallel magnetic resonance imaging, which can be formulated as a regularized SPIRiT problem. The Projection Over Convex Sets (POCS) method was used to solve the formulated regularized SPIRiT problem. However, the quality of the reconstructed image still needs to be improved. Though methods such as NonLinear Conjugate Gradients (NLCG) can achieve higher spatial resolution, these methods always demand very complex computation and converge slowly. In this paper, we propose a new algorithm to solve the formulated Cartesian SPIRiT problem with the JTV and JL1 regularization terms. The proposed algorithm uses the operator splitting (OS) technique to decompose the problem into a gradient problem and a denoising problem with two regularization terms, which is solved by our proposed split Bregman based denoising algorithm, and adopts the Barzilai and Borwein method to update step size. Simulation experiments on two in vivo data sets demonstrate that the proposed algorithm is 1.3 times faster than ADMM for datasets with 8 channels. Especially, our proposal is 2 times faster than ADMM for the dataset with 32 channels. Copyright © 2017 Elsevier Inc. All rights reserved.
'Iconic' tracking algorithms for high energy physics using the TRAX-I massively parallel processor
International Nuclear Information System (INIS)
Vesztergombi, G.
1989-01-01
TRAX-I, a cost-effective parallel microcomputer, applying associative string processor (ASP) architecture with 16 K parallel processing elements, is being built by Aspex Microsystems Ltd. (UK). When applied to the tracking problem of very complex events with several hundred tracks, the large number of processors allows one to dedicate one or more processors to each wire (in MWPC), each pixel (in digitized images from streamer chambers or other visual detectors), or each pad (in TPC) to perform very efficient pattern recognition. Some linear tracking algorithms based on this ''ionic'' representation are presented. (orig.)
'Iconic' tracking algorithms for high energy physics using the TRAX-I massively parallel processor
International Nuclear Information System (INIS)
Vestergombi, G.
1989-11-01
TRAX-I, a cost-effective parallel microcomputer, applying Associative String Processor (ASP) architecture with 16 K parallel processing elements, is being built by Aspex Microsystems Ltd. (UK). When applied to the tracking problem of very complex events with several hundred tracks, the large number of processors allows one to dedicate one or more processors to each wire (in MWPC), each pixel (in digitized images from streamer chambers or other visual detectors), or each pad (in TPC) to perform very efficient pattern recognition. Some linear tracking algorithms based on this 'iconic' representation are presented. (orig.)
International Nuclear Information System (INIS)
Chen Jian-Lin; Li Lei; Wang Lin-Yuan; Cai Ai-Long; Xi Xiao-Qi; Zhang Han-Ming; Li Jian-Xin; Yan Bin
2015-01-01
The projection matrix model is used to describe the physical relationship between reconstructed object and projection. Such a model has a strong influence on projection and backprojection, two vital operations in iterative computed tomographic reconstruction. The distance-driven model (DDM) is a state-of-the-art technology that simulates forward and back projections. This model has a low computational complexity and a relatively high spatial resolution; however, it includes only a few methods in a parallel operation with a matched model scheme. This study introduces a fast and parallelizable algorithm to improve the traditional DDM for computing the parallel projection and backprojection operations. Our proposed model has been implemented on a GPU (graphic processing unit) platform and has achieved satisfactory computational efficiency with no approximation. The runtime for the projection and backprojection operations with our model is approximately 4.5 s and 10.5 s per loop, respectively, with an image size of 256×256×256 and 360 projections with a size of 512×512. We compare several general algorithms that have been proposed for maximizing GPU efficiency by using the unmatched projection/backprojection models in a parallel computation. The imaging resolution is not sacrificed and remains accurate during computed tomographic reconstruction. (paper)
Energy Technology Data Exchange (ETDEWEB)
Moryakov, A. V., E-mail: sailor@orc.ru [National Research Centre Kurchatov Institute (Russian Federation)
2016-12-15
An algorithm for solving the linear Cauchy problem for large systems of ordinary differential equations is presented. The algorithm for systems of first-order differential equations is implemented in the EDELWEISS code with the possibility of parallel computations on supercomputers employing the MPI (Message Passing Interface) standard for the data exchange between parallel processes. The solution is represented by a series of orthogonal polynomials on the interval [0, 1]. The algorithm is characterized by simplicity and the possibility to solve nonlinear problems with a correction of the operator in accordance with the solution obtained in the previous iterative process.
International Nuclear Information System (INIS)
Egger, M.L.; Scheurer, A.H.; Joseph, C.
1996-01-01
The issue of long reconstruction times in PET has been addressed from several points of view, resulting in an affordable dedicated system capable of handling routine 3D reconstruction in a few minutes per frame: on the hardware side using fast processors and a parallel architecture, and on the software side, using efficient implementations of computationally less intensive algorithms. Execution times obtained for the PRT-1 data set on a parallel system of five hybrid nodes, each combining an Alpha processor for computation and a transputer for communication, are the following (256 sinograms of 96 views by 128 radial samples): Ramp algorithm 56 s, Favor 81 s and reprojection algorithm of Kinahan and Rogers 187 s. The implementation of fast rebinning algorithms has shown our hardware platform to become communications-limited; they execute faster on a conventional single-processor Alpha workstation: single-slice rebinning 7 s, Fourier rebinning 22 s, 2D filtered backprojection 5 s. The scalability of the system has been demonstrated, and a saturation effect at network sizes above ten nodes has become visible; new T9000-based products lifting most of the constraints on network topology and link throughput are expected to result in improved parallel efficiency and scalability properties
A parallel adaptive finite difference algorithm for petroleum reservoir simulation
Energy Technology Data Exchange (ETDEWEB)
Hoang, Hai Minh
2005-07-01
Adaptive finite differential for problems arising in simulation of flow in porous medium applications are considered. Such methods have been proven useful for overcoming limitations of computational resources and improving the resolution of the numerical solutions to a wide range of problems. By local refinement of the computational mesh where it is needed to improve the accuracy of solutions, yields better solution resolution representing more efficient use of computational resources than is possible with traditional fixed-grid approaches. In this thesis, we propose a parallel adaptive cell-centered finite difference (PAFD) method for black-oil reservoir simulation models. This is an extension of the adaptive mesh refinement (AMR) methodology first developed by Berger and Oliger (1984) for the hyperbolic problem. Our algorithm is fully adaptive in time and space through the use of subcycling, in which finer grids are advanced at smaller time steps than the coarser ones. When coarse and fine grids reach the same advanced time level, they are synchronized to ensure that the global solution is conservative and satisfy the divergence constraint across all levels of refinement. The material in this thesis is subdivided in to three overall parts. First we explain the methodology and intricacies of AFD scheme. Then we extend a finite differential cell-centered approximation discretization to a multilevel hierarchy of refined grids, and finally we are employing the algorithm on parallel computer. The results in this work show that the approach presented is robust, and stable, thus demonstrating the increased solution accuracy due to local refinement and reduced computing resource consumption. (Author)
DEFF Research Database (Denmark)
Cao, Bin; Zhao, Jianwei; Yang, Po
2018-01-01
-objective evolutionary algorithms the Cooperative Coevolutionary Generalized Differential Evolution 3, the Cooperative Multi-objective Differential Evolution and the Nondominated Sorting Genetic Algorithm III, the proposed algorithm addresses the deployment optimization problem efficiently and effectively.......Using immune algorithms is generally a time-intensive process especially for problems with a large number of variables. In this paper, we propose a distributed parallel cooperative coevolutionary multi-objective large-scale immune algorithm that is implemented using the message passing interface...... (MPI). The proposed algorithm is composed of three layers: objective, group and individual layers. First, for each objective in the multi-objective problem to be addressed, a subpopulation is used for optimization, and an archive population is used to optimize all the objectives. Second, the large...
Parallel genetic algorithm as a tool for nuclear reactors reload
International Nuclear Information System (INIS)
Santos, Darley Roberto G.; Schirru, Roberto
1999-01-01
This work intends to present a tool which can be used by designers in order to get better solutions, in terms of computational costs, to solve problems of nuclear reactor reloads. It is known that the project of nuclear fuel reload is a complex combinatorial one. Generally, iterative processes are the most used ones because they generate answers to satisfy all restrictions. The model presented here uses Artificial Intelligence techniques, more precisely Genetic Algorithms techniques, mixed with parallelization techniques.Test of the tool presented here were highly satisfactory, due to a considerable reduction in computational time. (author)
Parallel Framework for Cooperative Processes
Directory of Open Access Journals (Sweden)
Mitică Craus
2005-01-01
Full Text Available This paper describes the work of an object oriented framework designed to be used in the parallelization of a set of related algorithms. The idea behind the system we are describing is to have a re-usable framework for running several sequential algorithms in a parallel environment. The algorithms that the framework can be used with have several things in common: they have to run in cycles and the work should be possible to be split between several "processing units". The parallel framework uses the message-passing communication paradigm and is organized as a master-slave system. Two applications are presented: an Ant Colony Optimization (ACO parallel algorithm for the Travelling Salesman Problem (TSP and an Image Processing (IP parallel algorithm for the Symmetrical Neighborhood Filter (SNF. The implementations of these applications by means of the parallel framework prove to have good performances: approximatively linear speedup and low communication cost.
Energy Technology Data Exchange (ETDEWEB)
Renaut, R.; He, Q. [Arizona State Univ., Tempe, AZ (United States)
1994-12-31
In a new parallel iterative algorithm for unconstrained optimization by multisplitting is proposed. In this algorithm the original problem is split into a set of small optimization subproblems which are solved using well known sequential algorithms. These algorithms are iterative in nature, e.g. DFP variable metric method. Here the authors use sequential algorithms based on an inexact subspace search, which is an extension to the usual idea of an inexact fine search. Essentially the idea of the inexact line search for nonlinear minimization is that at each iteration the authors only find an approximate minimum in the line search direction. Hence by inexact subspace search, they mean that, instead of finding the minimum of the subproblem at each interation, they do an incomplete down hill search to give an approximate minimum. Some convergence and numerical results for this algorithm will be presented. Further, the original theory will be generalized to the situation with a singular Hessian. Applications for nonlinear least squares problems will be presented. Experimental results will be presented for implementations on an Intel iPSC/860 Hypercube with 64 nodes as well as on the Intel Paragon.
Guermond, J. L.; Minev, P. D.
2011-01-01
The purpose of this paper is to validate a new highly parallelizable direction splitting algorithm. The parallelization capabilities of this algorithm are illustrated by providing a highly accurate solution for the start-up flow in a three
International Nuclear Information System (INIS)
Delakis, Ioannis; Hammad, Omer; Kitney, Richard I
2007-01-01
Wavelet-based de-noising has been shown to improve image signal-to-noise ratio in magnetic resonance imaging (MRI) while maintaining spatial resolution. Wavelet-based de-noising techniques typically implemented in MRI require that noise displays uniform spatial distribution. However, images acquired with parallel MRI have spatially varying noise levels. In this work, a new algorithm for filtering images with parallel MRI is presented. The proposed algorithm extracts the edges from the original image and then generates a noise map from the wavelet coefficients at finer scales. The noise map is zeroed at locations where edges have been detected and directional analysis is also used to calculate noise in regions of low-contrast edges that may not have been detected. The new methodology was applied on phantom and brain images and compared with other applicable de-noising techniques. The performance of the proposed algorithm was shown to be comparable with other techniques in central areas of the images, where noise levels are high. In addition, finer details and edges were maintained in peripheral areas, where noise levels are low. The proposed methodology is fully automated and can be applied on final reconstructed images without requiring sensitivity profiles or noise matrices of the receiver coils, therefore making it suitable for implementation in a clinical MRI setting
Evaluating parallel optimization on transputers
Directory of Open Access Journals (Sweden)
A.G. Chalmers
2003-12-01
Full Text Available The faster processing power of modern computers and the development of efficient algorithms have made it possible for operations researchers to tackle a much wider range of problems than ever before. Further improvements in processing speed can be achieved utilising relatively inexpensive transputers to process components of an algorithm in parallel. The Davidon-Fletcher-Powell method is one of the most successful and widely used optimisation algorithms for unconstrained problems. This paper examines the algorithm and identifies the components that can be processed in parallel. The results of some experiments with these components are presented which indicates under what conditions parallel processing with an inexpensive configuration is likely to be faster than the traditional sequential implementations. The performance of the whole algorithm with its parallel components is then compared with the original sequential algorithm. The implementation serves to illustrate the practicalities of speeding up typical OR algorithms in terms of difficulty, effort and cost. The results give an indication of the savings in time a given parallel implementation can be expected to yield.
Research on B Cell Algorithm for Learning to Rank Method Based on Parallel Strategy.
Tian, Yuling; Zhang, Hongxian
2016-01-01
For the purposes of information retrieval, users must find highly relevant documents from within a system (and often a quite large one comprised of many individual documents) based on input query. Ranking the documents according to their relevance within the system to meet user needs is a challenging endeavor, and a hot research topic-there already exist several rank-learning methods based on machine learning techniques which can generate ranking functions automatically. This paper proposes a parallel B cell algorithm, RankBCA, for rank learning which utilizes a clonal selection mechanism based on biological immunity. The novel algorithm is compared with traditional rank-learning algorithms through experimentation and shown to outperform the others in respect to accuracy, learning time, and convergence rate; taken together, the experimental results show that the proposed algorithm indeed effectively and rapidly identifies optimal ranking functions.
An efficient parallel algorithm for the calculation of unrestricted canonical MP2 energies.
Baker, Jon; Wolinski, Krzysztof
2011-11-30
We present details of our efficient implementation of full accuracy unrestricted open-shell second-order canonical Møller-Plesset (MP2) energies, both serial and parallel. The algorithm is based on our previous restricted closed-shell MP2 code using the Saebo-Almlöf direct integral transformation. Depending on system details, UMP2 energies take from less than 1.5 to about 3.0 times as long as a closed-shell RMP2 energy on a similar system using the same algorithm. Several examples are given including timings for some large stable radicals with 90+ atoms and over 3600 basis functions. Copyright © 2011 Wiley Periodicals, Inc.
Keibler, Evan; Arumugam, Manimozhiyan; Brent, Michael R
2007-03-01
Hidden Markov models (HMMs) and generalized HMMs been successfully applied to many problems, but the standard Viterbi algorithm for computing the most probable interpretation of an input sequence (known as decoding) requires memory proportional to the length of the sequence, which can be prohibitive. Existing approaches to reducing memory usage either sacrifice optimality or trade increased running time for reduced memory. We developed two novel decoding algorithms, Treeterbi and Parallel Treeterbi, and implemented them in the TWINSCAN/N-SCAN gene-prediction system. The worst case asymptotic space and time are the same as for standard Viterbi, but in practice, Treeterbi optimally decodes arbitrarily long sequences with generalized HMMs in bounded memory without increasing running time. Parallel Treeterbi uses the same ideas to split optimal decoding across processors, dividing latency to completion by approximately the number of available processors with constant average overhead per processor. Using these algorithms, we were able to optimally decode all human chromosomes with N-SCAN, which increased its accuracy relative to heuristic solutions. We also implemented Treeterbi for Pairagon, our pair HMM based cDNA-to-genome aligner. The TWINSCAN/N-SCAN/PAIRAGON open source software package is available from http://genes.cse.wustl.edu.
The STAPL Parallel Graph Library
Harshvardhan,
2013-01-01
This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable distributed graph container and a collection of commonly used parallel graph algorithms. The library introduces pGraph pViews that separate algorithm design from the container implementation. It supports three graph processing algorithmic paradigms, level-synchronous, asynchronous and coarse-grained, and provides common graph algorithms based on them. Experimental results demonstrate improved scalability in performance and data size over existing graph libraries on more than 16,000 cores and on internet-scale graphs containing over 16 billion vertices and 250 billion edges. © Springer-Verlag Berlin Heidelberg 2013.
Shrimankar, D D; Sathe, S R
2016-01-01
Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today's supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.
Shrimankar, D. D.; Sathe, S. R.
2016-01-01
Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today’s supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures. PMID:27932868
Pronk, Sander; Pouya, Iman; Lundborg, Magnus; Rotskoff, Grant; Wesén, Björn; Kasson, Peter M; Lindahl, Erik
2015-06-09
Computational chemistry and other simulation fields are critically dependent on computing resources, but few problems scale efficiently to the hundreds of thousands of processors available in current supercomputers-particularly for molecular dynamics. This has turned into a bottleneck as new hardware generations primarily provide more processing units rather than making individual units much faster, which simulation applications are addressing by increasingly focusing on sampling with algorithms such as free-energy perturbation, Markov state modeling, metadynamics, or milestoning. All these rely on combining results from multiple simulations into a single observation. They are potentially powerful approaches that aim to predict experimental observables directly, but this comes at the expense of added complexity in selecting sampling strategies and keeping track of dozens to thousands of simulations and their dependencies. Here, we describe how the distributed execution framework Copernicus allows the expression of such algorithms in generic workflows: dataflow programs. Because dataflow algorithms explicitly state dependencies of each constituent part, algorithms only need to be described on conceptual level, after which the execution is maximally parallel. The fully automated execution facilitates the optimization of these algorithms with adaptive sampling, where undersampled regions are automatically detected and targeted without user intervention. We show how several such algorithms can be formulated for computational chemistry problems, and how they are executed efficiently with many loosely coupled simulations using either distributed or parallel resources with Copernicus.
Directory of Open Access Journals (Sweden)
Ravil’ Kudermetov
2018-02-01
Full Text Available Nowadays multi-core processors are installed almost in each modern workstation, but the question of these computational resources effective utilization is still a topical one. In this paper the four-point block one-step integration method is considered, the parallel algorithm of this method is proposed and the Java programmatic implementation of this algorithm is discussed. The effectiveness of the proposed algorithm is demonstrated by way of spacecraft attitude motion simulation. The results of this work can be used for practical simulation of dynamic systems that are described by ordinary differential equations. The results are also applicable to the development and debugging of computer programs that integrate the dynamic and kinematic equations of the angular motion of a rigid body.
Parallel integer sorting with medium and fine-scale parallelism
Dagum, Leonardo
1993-01-01
Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
Parallel computing of physical maps--a comparative study in SIMD and MIMD parallelism.
Bhandarkar, S M; Chirravuri, S; Arnold, J
1996-01-01
Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is usually isomorphic to the NP-complete Optimal Linear Arrangement problem. Parallel SIMD and MIMD algorithms for simulated annealing based on Markov chain distribution are proposed and applied to the problem of chromosome reconstruction via clone ordering. Perturbation methods and problem-specific annealing heuristics are proposed and described. The SIMD algorithms are implemented on a 2048 processor MasPar MP-2 system which is an SIMD 2-D toroidal mesh architecture whereas the MIMD algorithms are implemented on an 8 processor Intel iPSC/860 which is an MIMD hypercube architecture. A comparative analysis of the various SIMD and MIMD algorithms is presented in which the convergence, speedup, and scalability characteristics of the various algorithms are analyzed and discussed. On a fine-grained, massively parallel SIMD architecture with a low synchronization overhead such as the MasPar MP-2, a parallel simulated annealing algorithm based on multiple periodically interacting searches performs the best. For a coarse-grained MIMD architecture with high synchronization overhead such as the Intel iPSC/860, a parallel simulated annealing algorithm based on multiple independent searches yields the best results. In either case, distribution of clonal data across multiple processors is shown to exacerbate the tendency of the parallel simulated annealing algorithm to get trapped in a local optimum.
Czech Academy of Sciences Publication Activity Database
Cullum, J. K.; Johnson, K.; Tůma, Miroslav
2003-01-01
Roč. 10, - (2003), s. 445-465 ISSN 1070-5325 R&D Projects: GA ČR GA201/02/0595; GA AV ČR IAA1030103 Institutional research plan: CEZ:AV0Z1030915 Keywords : parallel algorithms * graph partitioning * problem decomposition * rate of convergence Subject RIV: BA - General Mathematics Impact factor: 1.042, year: 2003
International Nuclear Information System (INIS)
Hwang, F-N; Wei, Z-H; Huang, T-M; Wang Weichung
2010-01-01
We develop a parallel Jacobi-Davidson approach for finding a partial set of eigenpairs of large sparse polynomial eigenvalue problems with application in quantum dot simulation. A Jacobi-Davidson eigenvalue solver is implemented based on the Portable, Extensible Toolkit for Scientific Computation (PETSc). The eigensolver thus inherits PETSc's efficient and various parallel operations, linear solvers, preconditioning schemes, and easy usages. The parallel eigenvalue solver is then used to solve higher degree polynomial eigenvalue problems arising in numerical simulations of three dimensional quantum dots governed by Schroedinger's equations. We find that the parallel restricted additive Schwarz preconditioner in conjunction with a parallel Krylov subspace method (e.g. GMRES) can solve the correction equations, the most costly step in the Jacobi-Davidson algorithm, very efficiently in parallel. Besides, the overall performance is quite satisfactory. We have observed near perfect superlinear speedup by using up to 320 processors. The parallel eigensolver can find all target interior eigenpairs of a quintic polynomial eigenvalue problem with more than 32 million variables within 12 minutes by using 272 Intel 3.0 GHz processors.
The parallel processing impact in the optimization of the reactors neutronic by genetic algorithms
International Nuclear Information System (INIS)
Pereira, Claudio M.N.A.; Universidade Federal, Rio de Janeiro, RJ; Lapa, Celso M.F.; Mol, Antonio C.A.
2002-01-01
Nowadays, many optimization problems found in nuclear engineering has been solved through genetic algorithms (GA). The robustness of such methods is strongly related to the nature of search process which is based on populations of solution candidates, and this fact implies high computational cost in the optimization process. The use of GA become more critical when the evaluation process of a solution candidate is highly time consuming. Problems of this nature are common in the nuclear engineering, and an example is the reactor design optimization, where neutronic codes, which consume high CPU time, must be run. Aiming to investigate the impact of the use of parallel computation in the solution, through GA, of a reactor design optimization problem, a parallel genetic algorithm (PGA), using the Island Model, was developed. Exhaustive experiments, then 1500 processing hours in 550 MHz personal computers, have been done, in order to compare the conventional GA with the PGA. Such experiments have demonstrating the superiority of the PGA not only in terms of execution time, but also, in the optimization results. (author)
Parallelism in matrix computations
Gallopoulos, Efstratios; Sameh, Ahmed H
2016-01-01
This book is primarily intended as a research monograph that could also be used in graduate courses for the design of parallel algorithms in matrix computations. It assumes general but not extensive knowledge of numerical linear algebra, parallel architectures, and parallel programming paradigms. The book consists of four parts: (I) Basics; (II) Dense and Special Matrix Computations; (III) Sparse Matrix Computations; and (IV) Matrix functions and characteristics. Part I deals with parallel programming paradigms and fundamental kernels, including reordering schemes for sparse matrices. Part II is devoted to dense matrix computations such as parallel algorithms for solving linear systems, linear least squares, the symmetric algebraic eigenvalue problem, and the singular-value decomposition. It also deals with the development of parallel algorithms for special linear systems such as banded ,Vandermonde ,Toeplitz ,and block Toeplitz systems. Part III addresses sparse matrix computations: (a) the development of pa...
Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji
2015-07-01
GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310-323. doi: 10.1002/wcms.1220.
International Nuclear Information System (INIS)
Lima, Alan M.M. de; Schirru, Roberto
2000-01-01
Genetic algorithms are biologically motivated adaptive systems which have been used, with good results, for function optimization. The purpose of this work is to introduce a new parallelization method to be applied to the Population-Based Incremental Learning (PBIL) algorithm. PBIL combines standard genetic algorithm mechanisms with simple competitive learning and has ben successfully used in combinatorial optimization problems. The development of this algorithm aims its application to the reload optimization of PWR nuclear reactors. Tests have been performed with combinatorial optimization problems similar to the reload problem. Results are compared to the serial PBIL ones, showing the new method's superiority and its viability as a tool for the nuclear core reload problem solution. (author)
Crockett, Thomas W.
1995-01-01
This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.
Cloud identification using genetic algorithms and massively parallel computation
Buckles, Bill P.; Petry, Frederick E.
1996-01-01
As a Guest Computational Investigator under the NASA administered component of the High Performance Computing and Communication Program, we implemented a massively parallel genetic algorithm on the MasPar SIMD computer. Experiments were conducted using Earth Science data in the domains of meteorology and oceanography. Results obtained in these domains are competitive with, and in most cases better than, similar problems solved using other methods. In the meteorological domain, we chose to identify clouds using AVHRR spectral data. Four cloud speciations were used although most researchers settle for three. Results were remarkedly consistent across all tests (91% accuracy). Refinements of this method may lead to more timely and complete information for Global Circulation Models (GCMS) that are prevalent in weather forecasting and global environment studies. In the oceanographic domain, we chose to identify ocean currents from a spectrometer having similar characteristics to AVHRR. Here the results were mixed (60% to 80% accuracy). Given that one is willing to run the experiment several times (say 10), then it is acceptable to claim the higher accuracy rating. This problem has never been successfully automated. Therefore, these results are encouraging even though less impressive than the cloud experiment. Successful conclusion of an automated ocean current detection system would impact coastal fishing, naval tactics, and the study of micro-climates. Finally we contributed to the basic knowledge of GA (genetic algorithm) behavior in parallel environments. We developed better knowledge of the use of subpopulations in the context of shared breeding pools and the migration of individuals. Rigorous experiments were conducted based on quantifiable performance criteria. While much of the work confirmed current wisdom, for the first time we were able to submit conclusive evidence. The software developed under this grant was placed in the public domain. An extensive user
Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.
Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo
2016-07-19
Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
Zheng, Yan
2015-03-01
Internet of things (IoT), focusing on providing users with information exchange and intelligent control, attracts a lot of attention of researchers from all over the world since the beginning of this century. IoT is consisted of large scale of sensor nodes and data processing units, and the most important features of IoT can be illustrated as energy confinement, efficient communication and high redundancy. With the sensor nodes increment, the communication efficiency and the available communication band width become bottle necks. Many research work is based on the instance which the number of joins is less. However, it is not proper to the increasing multi-join query in whole internet of things. To improve the communication efficiency between parallel units in the distributed sensor network, this paper proposed parallel query optimization algorithm based on distribution attributes cost graph. The storage information relations and the network communication cost are considered in this algorithm, and an optimized information changing rule is established. The experimental result shows that the algorithm has good performance, and it would effectively use the resource of each node in the distributed sensor network. Therefore, executive efficiency of multi-join query between different nodes could be improved.
Energy Technology Data Exchange (ETDEWEB)
Chang, Jonghwa [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of)
2014-05-15
Parallelization of Monte Carlo simulation is widely adpoted. There are also several parallel algorithms developed for the SN transport theory using the parallel wave sweeping algorithm and for the CPM using parallel ray tracing. For practical purpose of reactor physics application, the thermal feedback and burnup effects on the multigroup cross section should be considered. In this respect, the domain decomposition method(DDM) is suitable for distributing the expensive cross section calculation work. Parallel transport code and diffusion code based on the Raviart-Thomas mixed finite element method was developed. However most of the developed methods rely on the heuristic convergence of flux and current at the domain interfaces. Convergence was not attained in some cases. Mechanical stress computation community has also work on the DDM to solve the stress-strain equation using the finite element methods. The most successful domain decomposition method in terms of robustness is FETI-DP. We have modified the original FETI-DP to solve the eigenvalue problem for the multigroup diffusion problem in this study.
Quantum and classical parallelism in parity algorithms for ensemble quantum computers
International Nuclear Information System (INIS)
Stadelhofer, Ralf; Suter, Dieter; Banzhaf, Wolfgang
2005-01-01
The determination of the parity of a string of N binary digits is a well-known problem in classical as well as quantum information processing, which can be formulated as an oracle problem. It has been established that quantum algorithms require at least N/2 oracle calls. We present an algorithm that reaches this lower bound and is also optimal in terms of additional gate operations required. We discuss its application to pure and mixed states. Since it can be applied directly to thermal states, it does not suffer from signal loss associated with pseudo-pure-state preparation. For ensemble quantum computers, the number of oracle calls can be further reduced by a factor 2 k , with k is a member of {{1,2,...,log 2 (N/2}}, provided the signal-to-noise ratio is sufficiently high. This additional speed-up is linked to (classical) parallelism of the ensemble quantum computer. Experimental realizations are demonstrated on a liquid-state NMR quantum computer
A parallel adaptive quantum genetic algorithm for the controllability of arbitrary networks
Li, Yuhong
2018-01-01
In this paper, we propose a novel algorithm—parallel adaptive quantum genetic algorithm—which can rapidly determine the minimum control nodes of arbitrary networks with both control nodes and state nodes. The corresponding network can be fully controlled with the obtained control scheme. We transformed the network controllability issue into a combinational optimization problem based on the Popov-Belevitch-Hautus rank condition. A set of canonical networks and a list of real-world networks were experimented. Comparison results demonstrated that the algorithm was more ideal to optimize the controllability of networks, especially those larger-size networks. We demonstrated subsequently that there were links between the optimal control nodes and some network statistical characteristics. The proposed algorithm provides an effective approach to improve the controllability optimization of large networks or even extra-large networks with hundreds of thousands nodes. PMID:29554140
International Nuclear Information System (INIS)
Pereira, Claudio M.N.A.; Lapa, Celso M.F.
2003-01-01
In this work, we focus the application of an Island Genetic Algorithm (IGA), a coarse-grained parallel genetic algorithm (PGA) model, to a Nuclear Power Plant (NPP) Auxiliary Feedwater System (AFWS) surveillance tests policy optimization. Here, the main objective is to outline, by means of comparisons, the advantages of the IGA over the simple (non-parallel) genetic algorithm (GA), which has been successfully applied in the solution of such kind of problem. The goal of the optimization is to maximize the system's average availability for a given period of time, considering realistic features such as: i) aging effects on standby components during the tests; ii) revealing failures in the tests implies on corrective maintenance, increasing outage times; iii) components have distinct test parameters (outage time, aging factors, etc.) and iv) tests are not necessarily periodic. In our experiments, which were made in a cluster comprised by 8 1-GHz personal computers, we could clearly observe gains not only in the computational time, which reduced linearly with the number of computers, but in the optimization outcome
Parallel S/sub n/ iteration schemes
International Nuclear Information System (INIS)
Wienke, B.R.; Hiromoto, R.E.
1986-01-01
The iterative, multigroup, discrete ordinates (S/sub n/) technique for solving the linear transport equation enjoys widespread usage and appeal. Serial iteration schemes and numerical algorithms developed over the years provide a timely framework for parallel extension. On the Denelcor HEP, the authors investigate three parallel iteration schemes for solving the one-dimensional S/sub n/ transport equation. The multigroup representation and serial iteration methods are also reviewed. This analysis represents a first attempt to extend serial S/sub n/ algorithms to parallel environments and provides good baseline estimates on ease of parallel implementation, relative algorithm efficiency, comparative speedup, and some future directions. The authors examine ordered and chaotic versions of these strategies, with and without concurrent rebalance and diffusion acceleration. Two strategies efficiently support high degrees of parallelization and appear to be robust parallel iteration techniques. The third strategy is a weaker parallel algorithm. Chaotic iteration, difficult to simulate on serial machines, holds promise and converges faster than ordered versions of the schemes. Actual parallel speedup and efficiency are high and payoff appears substantial
Guermond, J. L.
2011-05-04
The purpose of this paper is to validate a new highly parallelizable direction splitting algorithm. The parallelization capabilities of this algorithm are illustrated by providing a highly accurate solution for the start-up flow in a three-dimensional impulsively started lid-driven cavity of aspect ratio 1×1×2 at Reynolds numbers 1000 and 5000. The computations are done in parallel (up to 1024 processors) on adapted grids of up to 2 billion nodes in three space dimensions. Velocity profiles are given at dimensionless times t=4, 8, and 12; at least four digits are expected to be correct at Re=1000. © 2011 John Wiley & Sons, Ltd.
Multitasking TORT Under UNICOS: Parallel Performance Models and Measurements
International Nuclear Information System (INIS)
Azmy, Y.Y.; Barnett, D.A.
1999-01-01
The existing parallel algorithms in the TORT discrete ordinates were updated to function in a UNI-COS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead
Multitasking TORT under UNICOS: Parallel performance models and measurements
International Nuclear Information System (INIS)
Barnett, A.; Azmy, Y.Y.
1999-01-01
The existing parallel algorithms in the TORT discrete ordinates code were updated to function in a UNICOS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead
Energy Technology Data Exchange (ETDEWEB)
2017-04-04
A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique. We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.
An Implementation and Parallelization of the Scale Space Meshing Algorithm
Directory of Open Access Journals (Sweden)
Julie Digne
2015-11-01
Full Text Available Creating an interpolating mesh from an unorganized set of oriented points is a difficult problemwhich is often overlooked. Most methods focus indeed on building a watertight smoothed meshby defining some function whose zero level set is the surface of the object. However in some casesit is crucial to build a mesh that interpolates the points and does not fill the acquisition holes:either because the data are sparse and trying to fill the holes would create spurious artifactsor because the goal is to explore visually the data exactly as they were acquired without anysmoothing process. In this paper we detail a parallel implementation of the Scale-Space Meshingalgorithm, which builds on the scale-space framework for reconstructing a high precision meshfrom an input oriented point set. This algorithm first smoothes the point set, producing asingularity free shape. It then uses a standard mesh reconstruction technique, the Ball PivotingAlgorithm, to build a mesh from the smoothed point set. The final step consists in back-projecting the mesh built on the smoothed positions onto the original point set. The result ofthis process is an interpolating, hole-preserving surface mesh reconstruction.
Automatic Parallelization Tool: Classification of Program Code for Parallel Computing
Directory of Open Access Journals (Sweden)
Mustafa Basthikodi
2016-04-01
Full Text Available Performance growth of single-core processors has come to a halt in the past decade, but was re-enabled by the introduction of parallelism in processors. Multicore frameworks along with Graphical Processing Units empowered to enhance parallelism broadly. Couples of compilers are updated to developing challenges forsynchronization and threading issues. Appropriate program and algorithm classifications will have advantage to a great extent to the group of software engineers to get opportunities for effective parallelization. In present work we investigated current species for classification of algorithms, in that related work on classification is discussed along with the comparison of issues that challenges the classification. The set of algorithms are chosen which matches the structure with different issues and perform given task. We have tested these algorithms utilizing existing automatic species extraction toolsalong with Bones compiler. We have added functionalities to existing tool, providing a more detailed characterization. The contributions of our work include support for pointer arithmetic, conditional and incremental statements, user defined types, constants and mathematical functions. With this, we can retain significant data which is not captured by original speciesof algorithms. We executed new theories into the device, empowering automatic characterization of program code.
International Nuclear Information System (INIS)
Zerr, R.J.; Azmy, Y.Y.
2010-01-01
A spatial domain decomposition with a parallel block Jacobi solution algorithm has been developed based on the integral transport matrix formulation of the discrete ordinates approximation for solving the within-group transport equation. The new methodology abandons the typical source iteration scheme and solves directly for the fully converged scalar flux. Four matrix operators are constructed based upon the integral form of the discrete ordinates equations. A single differential mesh sweep is performed to construct these operators. The method is parallelized by decomposing the problem domain into several smaller sub-domains, each treated as an independent problem. The scalar flux of each sub-domain is solved exactly given incoming angular flux boundary conditions. Sub-domain boundary conditions are updated iteratively, and convergence is achieved when the scalar flux error in all cells meets a pre-specified convergence criterion. The method has been implemented in a computer code that was then employed for strong scaling studies of the algorithm's parallel performance via a fixed-size problem in tests ranging from one domain up to one cell per sub-domain. Results indicate that the best parallel performance compared to source iterations occurs for optically thick, highly scattering problems, the variety that is most difficult for the traditional SI scheme to solve. Moreover, the minimum execution time occurs when each sub-domain contains a total of four cells. (authors)
Parallel multiphysics algorithms and software for computational nuclear engineering
International Nuclear Information System (INIS)
Gaston, D; Hansen, G; Kadioglu, S; Knoll, D A; Newman, C; Park, H; Permann, C; Taitano, W
2009-01-01
There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallel computing. Many of the coupling efforts to date utilize simple code coupling or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE-based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.
Barth, Timothy J.; Chan, Tony F.; Tang, Wei-Pai
1998-01-01
This paper considers an algebraic preconditioning algorithm for hyperbolic-elliptic fluid flow problems. The algorithm is based on a parallel non-overlapping Schur complement domain-decomposition technique for triangulated domains. In the Schur complement technique, the triangulation is first partitioned into a number of non-overlapping subdomains and interfaces. This suggests a reordering of triangulation vertices which separates subdomain and interface solution unknowns. The reordering induces a natural 2 x 2 block partitioning of the discretization matrix. Exact LU factorization of this block system yields a Schur complement matrix which couples subdomains and the interface together. The remaining sections of this paper present a family of approximate techniques for both constructing and applying the Schur complement as a domain-decomposition preconditioner. The approximate Schur complement serves as an algebraic coarse space operator, thus avoiding the known difficulties associated with the direct formation of a coarse space discretization. In developing Schur complement approximations, particular attention has been given to improving sequential and parallel efficiency of implementations without significantly degrading the quality of the preconditioner. A computer code based on these developments has been tested on the IBM SP2 using MPI message passing protocol. A number of 2-D calculations are presented for both scalar advection-diffusion equations as well as the Euler equations governing compressible fluid flow to demonstrate performance of the preconditioning algorithm.
Energy Technology Data Exchange (ETDEWEB)
Madduri, Kamesh; Ediger, David; Jiang, Karl; Bader, David A.; Chavarria-Miranda, Daniel
2009-02-15
We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.
Energy Technology Data Exchange (ETDEWEB)
Loring, Burlen; Karimabadi, Homa; Rortershteyn, Vadim
2014-07-01
The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.
Massively Parallel Finite Element Programming
Heister, Timo; Kronbichler, Martin; Bangerth, Wolfgang
2010-01-01
Today's large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.
Massively Parallel Finite Element Programming
Heister, Timo
2010-01-01
Today\\'s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.
A parallel approach to the stable marriage problem
DEFF Research Database (Denmark)
Larsen, Jesper
1997-01-01
This paper describes two parallel algorithms for the stable marriage problem implemented on a MIMD parallel computer. The algorithms are tested against sequential algorithms on randomly generated and worst-case instances. The results clearly show that the combination fo a very simple problem...... and a commercial MIMD system results in parallel algorithms which are not competitive with sequential algorithms wrt. practical performance. 1 Introduction In 1962 the Stable Marriage Problem was....
DEFF Research Database (Denmark)
Bilardi, Gianfranco; Pietracaprina, Andrea; Pucci, Geppino
2016-01-01
A framework is proposed for the design and analysis of network-oblivious algorithms, namely algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelism and communication capabilities. The framework prescribes that a network......-oblivious algorithm be specified on a parallel model of computation where the only parameter is the problem’s input size, and then evaluated on a model with two parameters, capturing parallelism granularity and communication latency. It is shown that for a wide class of network-oblivious algorithms, optimality...... of cache hierarchies, to the realm of parallel computation. Its effectiveness is illustrated by providing optimal network-oblivious algorithms for a number of key problems. Some limitations of the oblivious approach are also discussed....
Parallel hierarchical radiosity rendering
Energy Technology Data Exchange (ETDEWEB)
Carter, Michael [Iowa State Univ., Ames, IA (United States)
1993-07-01
In this dissertation, the step-by-step development of a scalable parallel hierarchical radiosity renderer is documented. First, a new look is taken at the traditional radiosity equation, and a new form is presented in which the matrix of linear system coefficients is transformed into a symmetric matrix, thereby simplifying the problem and enabling a new solution technique to be applied. Next, the state-of-the-art hierarchical radiosity methods are examined for their suitability to parallel implementation, and scalability. Significant enhancements are also discovered which both improve their theoretical foundations and improve the images they generate. The resultant hierarchical radiosity algorithm is then examined for sources of parallelism, and for an architectural mapping. Several architectural mappings are discussed. A few key algorithmic changes are suggested during the process of making the algorithm parallel. Next, the performance, efficiency, and scalability of the algorithm are analyzed. The dissertation closes with a discussion of several ideas which have the potential to further enhance the hierarchical radiosity method, or provide an entirely new forum for the application of hierarchical methods.
International Nuclear Information System (INIS)
McGhee, J.M.; Roberts, R.M.; Morel, J.E.
1997-01-01
A spherical harmonics research code (DANTE) has been developed which is compatible with parallel computer architectures. DANTE provides 3-D, multi-material, deterministic, transport capabilities using an arbitrary finite element mesh. The linearized Boltzmann transport equation is solved in a second order self-adjoint form utilizing a Galerkin finite element spatial differencing scheme. The core solver utilizes a preconditioned conjugate gradient algorithm. Other distinguishing features of the code include options for discrete-ordinates and simplified spherical harmonics angular differencing, an exact Marshak boundary treatment for arbitrarily oriented boundary faces, in-line matrix construction techniques to minimize memory consumption, and an effective diffusion based preconditioner for scattering dominated problems. Algorithm efficiency is demonstrated for a massively parallel SIMD architecture (CM-5), and compatibility with MPP multiprocessor platforms or workstation clusters is anticipated
Directory of Open Access Journals (Sweden)
Yu-Huei Cheng
2017-11-01
Full Text Available The control strategy is a major unit in hybrid electric vehicles (HEVs. In order to provide suitable control parameters for reducing fuel consumptions and engine emissions while maintaining vehicle performance requirements, the genetic algorithm (GA with small population size is applied to search for feasible control parameters in parallel HEVs. The electric assist control strategy (EACS is used as the fundamental control strategy of parallel HEVs. The dynamic performance requirements stipulated in the Partnership for a New Generation of Vehicles (PNGV is considered to maintain the vehicle performance. The known ADvanced VehIcle SimulatOR (ADVISOR is used to simulate a specific parallel HEV with urban dynamometer driving schedule (UDDS. Five population sets with size 5, 10, 15, 20, and 25 are used in the GA. The experimental results show that the GA with population size of 25 is the best for selecting feasible control parameters in parallel HEVs.
A Parallel Approach to Fractal Image Compression
Lubomir Dedera
2004-01-01
The paper deals with a parallel approach to coding and decoding algorithms in fractal image compressionand presents experimental results comparing sequential and parallel algorithms from the point of view of achieved bothcoding and decoding time and effectiveness of parallelization.
Ozmutlu, H. Cenk
2014-01-01
We developed mixed integer programming (MIP) models and hybrid genetic-local search algorithms for the scheduling problem of unrelated parallel machines with job sequence and machine-dependent setup times and with job splitting property. The first contribution of this paper is to introduce novel algorithms which make splitting and scheduling simultaneously with variable number of subjobs. We proposed simple chromosome structure which is constituted by random key numbers in hybrid genetic-local search algorithm (GAspLA). Random key numbers are used frequently in genetic algorithms, but it creates additional difficulty when hybrid factors in local search are implemented. We developed algorithms that satisfy the adaptation of results of local search into the genetic algorithms with minimum relocation operation of genes' random key numbers. This is the second contribution of the paper. The third contribution of this paper is three developed new MIP models which are making splitting and scheduling simultaneously. The fourth contribution of this paper is implementation of the GAspLAMIP. This implementation let us verify the optimality of GAspLA for the studied combinations. The proposed methods are tested on a set of problems taken from the literature and the results validate the effectiveness of the proposed algorithms. PMID:24977204
Eroglu, Duygu Yilmaz; Ozmutlu, H Cenk
2014-01-01
We developed mixed integer programming (MIP) models and hybrid genetic-local search algorithms for the scheduling problem of unrelated parallel machines with job sequence and machine-dependent setup times and with job splitting property. The first contribution of this paper is to introduce novel algorithms which make splitting and scheduling simultaneously with variable number of subjobs. We proposed simple chromosome structure which is constituted by random key numbers in hybrid genetic-local search algorithm (GAspLA). Random key numbers are used frequently in genetic algorithms, but it creates additional difficulty when hybrid factors in local search are implemented. We developed algorithms that satisfy the adaptation of results of local search into the genetic algorithms with minimum relocation operation of genes' random key numbers. This is the second contribution of the paper. The third contribution of this paper is three developed new MIP models which are making splitting and scheduling simultaneously. The fourth contribution of this paper is implementation of the GAspLAMIP. This implementation let us verify the optimality of GAspLA for the studied combinations. The proposed methods are tested on a set of problems taken from the literature and the results validate the effectiveness of the proposed algorithms.
Angular parallelization of a curvilinear Sn transport theory method
International Nuclear Information System (INIS)
Haghighat, A.
1991-01-01
In this paper a parallel algorithm for angular domain decomposition (or parallelization) of an r-dependent spherical S n transport theory method is derived. The parallel formulation is incorporated into TWOTRAN-II using the IBM Parallel Fortran compiler and implemented on an IBM 3090/400 (with four processors). The behavior of the parallel algorithm for different physical problems is studied, and it is concluded that the parallel algorithm behaves differently in the presence of a fission source as opposed to the absence of a fission source; this is attributed to the relative contributions of the source and the angular redistribution terms in the S s algorithm. Further, the parallel performance of the algorithm is measured for various problem sizes and different combinations of angular subdomains or processors. Poor parallel efficiencies between ∼35 and 50% are achieved in situations where the relative difference of parallel to serial iterations is ∼50%. High parallel efficiencies between ∼60% and 90% are obtained in situations where the relative difference of parallel to serial iterations is <35%
SWAMP+: multiple subsequence alignment using associative massive parallelism
Energy Technology Data Exchange (ETDEWEB)
Steinfadt, Shannon Irene [Los Alamos National Laboratory; Baker, Johnnie W [KENT STATE UNIV.
2010-10-18
A new parallel algorithm SWAMP+ incorporates the Smith-Waterman sequence alignment on an associative parallel model known as ASC. It is a highly sensitive parallel approach that expands traditional pairwise sequence alignment. This is the first parallel algorithm to provide multiple non-overlapping, non-intersecting subsequence alignments with the accuracy of Smith-Waterman. The efficient algorithm provides multiple alignments similar to BLAST while creating a better workflow for the end users. The parallel portions of the code run in O(m+n) time using m processors. When m = n, the algorithmic analysis becomes O(n) with a coefficient of two, yielding a linear speedup. Implementation of the algorithm on the SIMD ClearSpeed CSX620 confirms this theoretical linear speedup with real timings.
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications
Sun, Xian-He
1997-01-01
Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm
Massively parallel mathematical sieves
Energy Technology Data Exchange (ETDEWEB)
Montry, G.R.
1989-01-01
The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.
A Parallel Approach to Fractal Image Compression
Directory of Open Access Journals (Sweden)
Lubomir Dedera
2004-01-01
Full Text Available The paper deals with a parallel approach to coding and decoding algorithms in fractal image compressionand presents experimental results comparing sequential and parallel algorithms from the point of view of achieved bothcoding and decoding time and effectiveness of parallelization.
Parallel discrete ordinates algorithms on distributed and common memory systems
International Nuclear Information System (INIS)
Wienke, B.R.; Hiromoto, R.E.; Brickner, R.G.
1987-01-01
The S/sub n/ algorithm employs iterative techniques in solving the linear Boltzmann equation. These methods, both ordered and chaotic, were compared on both the Denelcor HEP and the Intel hypercube. Strategies are linked to the organization and accessibility of memory (common memory versus distributed memory architectures), with common concern for acquisition of global information. Apart from this, the inherent parallelism of the algorithm maps directly onto the two architectures. Results comparing execution times, speedup, and efficiency are based on a representative 16-group (full upscatter and downscatter) sample problem. Calculations were performed on both the Los Alamos National Laboratory (LANL) Denelcor HEP and the LANL Intel hypercube. The Denelcor HEP is a 64-bit multi-instruction, multidate MIMD machine consisting of up to 16 process execution modules (PEMs), each capable of executing 64 processes concurrently. Each PEM can cooperate on a job, or run several unrelated jobs, and share a common global memory through a crossbar switch. The Intel hypercube, on the other hand, is a distributed memory system composed of 128 processing elements, each with its own local memory. Processing elements are connected in a nearest-neighbor hypercube configuration and sharing of data among processors requires execution of explicit message-passing constructs
Parallel thermal radiation transport in two dimensions
International Nuclear Information System (INIS)
Smedley-Stevenson, R.P.; Ball, S.R.
2003-01-01
This paper describes the distributed memory parallel implementation of a deterministic thermal radiation transport algorithm in a 2-dimensional ALE hydrodynamics code. The parallel algorithm consists of a variety of components which are combined in order to produce a state of the art computational capability, capable of solving large thermal radiation transport problems using Blue-Oak, the 3 Tera-Flop MPP (massive parallel processors) computing facility at AWE (United Kingdom). Particular aspects of the parallel algorithm are described together with examples of the performance on some challenging applications. (author)
Parallel thermal radiation transport in two dimensions
Energy Technology Data Exchange (ETDEWEB)
Smedley-Stevenson, R.P.; Ball, S.R. [AWE Aldermaston (United Kingdom)
2003-07-01
This paper describes the distributed memory parallel implementation of a deterministic thermal radiation transport algorithm in a 2-dimensional ALE hydrodynamics code. The parallel algorithm consists of a variety of components which are combined in order to produce a state of the art computational capability, capable of solving large thermal radiation transport problems using Blue-Oak, the 3 Tera-Flop MPP (massive parallel processors) computing facility at AWE (United Kingdom). Particular aspects of the parallel algorithm are described together with examples of the performance on some challenging applications. (author)
Parallelizing the spectral transform method: A comparison of alternative parallel algorithms
International Nuclear Information System (INIS)
Foster, I.; Worley, P.H.
1993-01-01
The spectral transform method is a standard numerical technique for solving partial differential equations on the sphere and is widely used in global climate modeling. In this paper, we outline different approaches to parallelizing the method and describe experiments that we are conducting to evaluate the efficiency of these approaches on parallel computers. The experiments are conducted using a testbed code that solves the nonlinear shallow water equations on a sphere, but are designed to permit evaluation in the context of a global model. They allow us to evaluate the relative merits of the approaches as a function of problem size and number of processors. The results of this study are guiding ongoing work on PCCM2, a parallel implementation of the Community Climate Model developed at the National Center for Atmospheric Research
Redundancy allocation of series-parallel systems using a variable neighborhood search algorithm
International Nuclear Information System (INIS)
Liang, Y.-C.; Chen, Y.-C.
2007-01-01
This paper presents a meta-heuristic algorithm, variable neighborhood search (VNS), to the redundancy allocation problem (RAP). The RAP, an NP-hard problem, has attracted the attention of much prior research, generally in a restricted form where each subsystem must consist of identical components. The newer meta-heuristic methods overcome this limitation and offer a practical way to solve large instances of the relaxed RAP where different components can be used in parallel. Authors' previously published work has shown promise for the variable neighborhood descent (VND) method, the simplest version among VNS variations, on RAP. The variable neighborhood search method itself has not been used in reliability design, yet it is a method that fits those combinatorial problems with potential neighborhood structures, as in the case of the RAP. Therefore, authors further extended their work to develop a VNS algorithm for the RAP and tested a set of well-known benchmark problems from the literature. Results on 33 test instances ranging from less to severely constrained conditions show that the variable neighborhood search method improves the performance of VND and provides a competitive solution quality at economically computational expense in comparison with the best-known heuristics including ant colony optimization, genetic algorithm, and tabu search
Redundancy allocation of series-parallel systems using a variable neighborhood search algorithm
Energy Technology Data Exchange (ETDEWEB)
Liang, Y.-C. [Department of Industrial Engineering and Management, Yuan Ze University, No 135 Yuan-Tung Road, Chung-Li, Taoyuan County, Taiwan 320 (China)]. E-mail: ycliang@saturn.yzu.edu.tw; Chen, Y.-C. [Department of Industrial Engineering and Management, Yuan Ze University, No 135 Yuan-Tung Road, Chung-Li, Taoyuan County, Taiwan 320 (China)]. E-mail: s927523@mail.yzu.edu.tw
2007-03-15
This paper presents a meta-heuristic algorithm, variable neighborhood search (VNS), to the redundancy allocation problem (RAP). The RAP, an NP-hard problem, has attracted the attention of much prior research, generally in a restricted form where each subsystem must consist of identical components. The newer meta-heuristic methods overcome this limitation and offer a practical way to solve large instances of the relaxed RAP where different components can be used in parallel. Authors' previously published work has shown promise for the variable neighborhood descent (VND) method, the simplest version among VNS variations, on RAP. The variable neighborhood search method itself has not been used in reliability design, yet it is a method that fits those combinatorial problems with potential neighborhood structures, as in the case of the RAP. Therefore, authors further extended their work to develop a VNS algorithm for the RAP and tested a set of well-known benchmark problems from the literature. Results on 33 test instances ranging from less to severely constrained conditions show that the variable neighborhood search method improves the performance of VND and provides a competitive solution quality at economically computational expense in comparison with the best-known heuristics including ant colony optimization, genetic algorithm, and tabu search.
Lyster, Peter M.; Guo, J.; Clune, T.; Larson, J. W.; Atlas, Robert (Technical Monitor)
2001-01-01
The computational complexity of algorithms for Four Dimensional Data Assimilation (4DDA) at NASA's Data Assimilation Office (DAO) is discussed. In 4DDA, observations are assimilated with the output of a dynamical model to generate best-estimates of the states of the system. It is thus a mapping problem, whereby scattered observations are converted into regular accurate maps of wind, temperature, moisture and other variables. The DAO is developing and using 4DDA algorithms that provide these datasets, or analyses, in support of Earth System Science research. Two large-scale algorithms are discussed. The first approach, the Goddard Earth Observing System Data Assimilation System (GEOS DAS), uses an atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS). GEOS DAS is very similar to global meteorological weather forecasting data assimilation systems, but is used at NASA for climate research. Systems of this size typically run at between 1 and 20 gigaflop/s. The second approach, the Kalman filter, uses a more consistent algorithm to determine the forecast error covariance matrix than does GEOS DAS. For atmospheric assimilation, the gridded dynamical fields typically have More than 10(exp 6) variables, therefore the full error covariance matrix may be in excess of a teraword. For the Kalman filter this problem can easily scale to petaflop/s proportions. We discuss the computational complexity of GEOS DAS and our implementation of the Kalman filter. We also discuss and quantify some of the technical issues and limitations in developing efficient, in terms of wall clock time, and scalable parallel implementations of the algorithms.
International Nuclear Information System (INIS)
Woodruff, S.B.
1992-01-01
The Transient Reactor Analysis Code (TRAC), which features a two- fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, poor load balancing will degrade efficiency on either vector or data parallel architectures when the data are organized according to spatial location. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. This document discusses why developers algorithms, such as a neural net representation, that do not exhibit algorithms, such as a neural net representation, that do not exhibit load-balancing problems
A Self Consistent Multiprocessor Space Charge Algorithm that is Almost Embarrassingly Parallel
International Nuclear Information System (INIS)
Nissen, Edward; Erdelyi, B.; Manikonda, S.L.
2012-01-01
We present a space charge code that is self consistent, massively parallelizeable, and requires very little communication between computer nodes; making the calculation almost embarrassingly parallel. This method is implemented in the code COSY Infinity where the differential algebras used in this code are important to the algorithm's proper functioning. The method works by calculating the self consistent space charge distribution using the statistical moments of the test particles, and converting them into polynomial series coefficients. These coefficients are combined with differential algebraic integrals to form the potential, and electric fields. The result is a map which contains the effects of space charge. This method allows for massive parallelization since its statistics based solver doesn't require any binning of particles, and only requires a vector containing the partial sums of the statistical moments for the different nodes to be passed. All other calculations are done independently. The resulting maps can be used to analyze the system using normal form analysis, as well as advance particles in numbers and at speeds that were previously impossible.
Directory of Open Access Journals (Sweden)
Zhong-Kai Feng
2017-01-01
Full Text Available With the increasingly serious energy crisis and environmental pollution, the short-term economic environmental hydrothermal scheduling (SEEHTS problem is becoming more and more important in modern electrical power systems. In order to handle the SEEHTS problem efficiently, the parallel multi-objective genetic algorithm (PMOGA is proposed in the paper. Based on the Fork/Join parallel framework, PMOGA divides the whole population of individuals into several subpopulations which will evolve in different cores simultaneously. In this way, PMOGA can avoid the wastage of computational resources and increase the population diversity. Moreover, the constraint handling technique is used to handle the complex constraints in SEEHTS, and a selection strategy based on constraint violation is also employed to ensure the convergence speed and solution feasibility. The results from a hydrothermal system in different cases indicate that PMOGA can make the utmost of system resources to significantly improve the computing efficiency and solution quality. Moreover, PMOGA has competitive performance in SEEHTS when compared with several other methods reported in the previous literature, providing a new approach for the operation of hydrothermal systems.
Ren, Zhong; Liu, Guodong; Huang, Zhen
2012-11-01
The image reconstruction is a key step in medical imaging (MI) and its algorithm's performance determinates the quality and resolution of reconstructed image. Although some algorithms have been used, filter back-projection (FBP) algorithm is still the classical and commonly-used algorithm in clinical MI. In FBP algorithm, filtering of original projection data is a key step in order to overcome artifact of the reconstructed image. Since simple using of classical filters, such as Shepp-Logan (SL), Ram-Lak (RL) filter have some drawbacks and limitations in practice, especially for the projection data polluted by non-stationary random noises. So, an improved wavelet denoising combined with parallel-beam FBP algorithm is used to enhance the quality of reconstructed image in this paper. In the experiments, the reconstructed effects were compared between the improved wavelet denoising and others (directly FBP, mean filter combined FBP and median filter combined FBP method). To determine the optimum reconstruction effect, different algorithms, and different wavelet bases combined with three filters were respectively test. Experimental results show the reconstruction effect of improved FBP algorithm is better than that of others. Comparing the results of different algorithms based on two evaluation standards i.e. mean-square error (MSE), peak-to-peak signal-noise ratio (PSNR), it was found that the reconstructed effects of the improved FBP based on db2 and Hanning filter at decomposition scale 2 was best, its MSE value was less and the PSNR value was higher than others. Therefore, this improved FBP algorithm has potential value in the medical imaging.
Parallelization of Subchannel Analysis Code MATRA
International Nuclear Information System (INIS)
Kim, Seongjin; Hwang, Daehyun; Kwon, Hyouk
2014-01-01
A stand-alone calculation of MATRA code used up pertinent computing time for the thermal margin calculations while a relatively considerable time is needed to solve the whole core pin-by-pin problems. In addition, it is strongly required to improve the computation speed of the MATRA code to satisfy the overall performance of the multi-physics coupling calculations. Therefore, a parallel approach to improve and optimize the computability of the MATRA code is proposed and verified in this study. The parallel algorithm is embodied in the MATRA code using the MPI communication method and the modification of the previous code structure was minimized. An improvement is confirmed by comparing the results between the single and multiple processor algorithms. The speedup and efficiency are also evaluated when increasing the number of processors. The parallel algorithm was implemented to the subchannel code MATRA using the MPI. The performance of the parallel algorithm was verified by comparing the results with those from the MATRA with the single processor. It is also noticed that the performance of the MATRA code was greatly improved by implementing the parallel algorithm for the 1/8 core and whole core problems
Directory of Open Access Journals (Sweden)
V. E. Podol'skii
2015-01-01
Full Text Available The paper considers the implementing Bellman-Ford and Lee algorithms to find the shortest graph path on a computer system with multiple instruction stream and single data stream (MISD. The MISD computer is a computer that executes commands of arithmetic-logical processing (on the CPU and commands of structures processing (on the structures processor in parallel on a single data stream. Transformation of sequential programs into the MISD programs is a labor intensity process because it requires a stream of the arithmetic-logical processing to be manually separated from that of the structures processing. Algorithms based on the processing of data structures (e.g., algorithms on graphs show high performance on a MISD computer. Bellman-Ford and Lee algorithms for finding the shortest path on a graph are representatives of these algorithms. They are applied to robotics for automatic planning of the robot movement in-situ. Modification of Bellman-Ford and Lee algorithms for finding the shortest graph path in coprocessor MISD mode and the parallel MISD modification of these algorithms were first obtained in this article. Thus, this article continues a series of studies on the transformation of sequential algorithms into MISD ones (Dijkstra and Ford-Fulkerson 's algorithms and has a pronouncedly applied nature. The article also presents the analysis results of Bellman-Ford and Lee algorithms in MISD mode. The paper formulates the basic trends of a technique for parallelization of algorithms into arithmetic-logical processing stream and structures processing stream. Among the key areas for future research, development of the mathematical approach to provide a subsequently formalized and automated process of parallelizing sequential algorithms between the CPU and structures processor is highlighted. Among the mathematical models that can be used in future studies there are graph models of algorithms (e.g., dependency graph of a program. Due to the high
Unterkircher, A
2005-01-01
We propose methods for parallel assembling and iterative equation solving based on graph algorithms. The assembling technique is independent of dimension, element type and model shape. As a parallel solving technique we construct a multiplicative symmetric Schwarz preconditioner for the conjugate gradient method. Both methods have been incorporated into a non-linear FE code to simulate 3D metal extrusion processes. We illustrate the efficiency of these methods on shared memory computers by realistic examples.
Directory of Open Access Journals (Sweden)
Ashkan Tousimojarad
2013-12-01
Full Text Available We present the Glasgow Parallel Reduction Machine (GPRM, a novel, flexible framework for parallel task-composition based many-core programming. We allow the programmer to structure programs into task code, written as C++ classes, and communication code, written in a restricted subset of C++ with functional semantics and parallel evaluation. In this paper we discuss the GPRM, the virtual machine framework that enables the parallel task composition approach. We focus the discussion on GPIR, the functional language used as the intermediate representation of the bytecode running on the GPRM. Using examples in this language we show the flexibility and power of our task composition framework. We demonstrate the potential using an implementation of a merge sort algorithm on a 64-core Tilera processor, as well as on a conventional Intel quad-core processor and an AMD 48-core processor system. We also compare our framework with OpenMP tasks in a parallel pointer chasing algorithm running on the Tilera processor. Our results show that the GPRM programs outperform the corresponding OpenMP codes on all test platforms, and can greatly facilitate writing of parallel programs, in particular non-data parallel algorithms such as reductions.
Data communications in a parallel active messaging interface of a parallel computer
Davis, Kristan D; Faraj, Daniel A
2013-07-09
Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Neural Parallel Engine: A toolbox for massively parallel neural signal processing.
Tam, Wing-Kin; Yang, Zhi
2018-05-01
Large-scale neural recordings provide detailed information on neuronal activities and can help elicit the underlying neural mechanisms of the brain. However, the computational burden is also formidable when we try to process the huge data stream generated by such recordings. In this study, we report the development of Neural Parallel Engine (NPE), a toolbox for massively parallel neural signal processing on graphical processing units (GPUs). It offers a selection of the most commonly used routines in neural signal processing such as spike detection and spike sorting, including advanced algorithms such as exponential-component-power-component (EC-PC) spike detection and binary pursuit spike sorting. We also propose a new method for detecting peaks in parallel through a parallel compact operation. Our toolbox is able to offer a 5× to 110× speedup compared with its CPU counterparts depending on the algorithms. A user-friendly MATLAB interface is provided to allow easy integration of the toolbox into existing workflows. Previous efforts on GPU neural signal processing only focus on a few rudimentary algorithms, are not well-optimized and often do not provide a user-friendly programming interface to fit into existing workflows. There is a strong need for a comprehensive toolbox for massively parallel neural signal processing. A new toolbox for massively parallel neural signal processing has been created. It can offer significant speedup in processing signals from large-scale recordings up to thousands of channels. Copyright © 2018 Elsevier B.V. All rights reserved.
Zinno, Ivana; De Luca, Claudio; Elefante, Stefano; Imperatore, Pasquale; Manunta, Michele; Casu, Francesco
2014-05-01
Differential Synthetic Aperture Radar Interferometry (DInSAR) is an effective technique to estimate and monitor ground displacements with centimetre accuracy [1]. In the last decade, advanced DInSAR algorithms, such as the Small Baseline Subset (SBAS) [2] one that is aimed at following the temporal evolution of the ground deformation, showed to be significantly useful remote sensing tools for the geoscience communities as well as for those related to hazard monitoring and risk mitigation. DInSAR scenario is currently characterized by the large and steady increasing availability of huge SAR data archives that have a broad range of diversified features according to the characteristics of the employed sensor. Indeed, besides the old generation sensors, that include ERS, ENVISAT and RADARSAT systems, the new X-band generation constellations, such as COSMO-SkyMed and TerraSAR-X, have permitted an overall study of ground deformations with an unprecedented detail thanks to their improved spatial resolution and reduced revisit time. Furthermore, the incoming ESA Sentinel-1 SAR satellite is characterized by a global coverage acquisition strategy and 12-day revisit time and, therefore, will further contribute to improve deformation analyses and monitoring capabilities. However, in this context, the capability to process such huge SAR data archives is strongly limited by the existing DInSAR algorithms, which are not specifically designed to exploit modern high performance computational infrastructures (e.g. cluster, grid and cloud computing platforms). The goal of this paper is to present a Parallel version of the SBAS algorithm (P-SBAS) which is based on a dual-level parallelization approach and embraces combined parallel strategies [3], [4]. A detailed description of the P-SBAS algorithm will be provided together with a scalability analysis focused on studying its performances. In particular, a P-SBAS scalability analysis with respect to the number of exploited CPUs has
Vector Green's function algorithm for radiative transfer in plane-parallel atmosphere
International Nuclear Information System (INIS)
Qin Yi; Box, Michael A.
2006-01-01
Green's function is a widely used approach for boundary value problems. In problems related to radiative transfer, Green's function has been found to be useful in land, ocean and atmosphere remote sensing. It is also a key element in higher order perturbation theory. This paper presents an explicit expression of the Green's function, in terms of the source and radiation field variables, for a plane-parallel atmosphere with either vacuum boundaries or a reflecting (BRDF) surface. Full polarization state is considered but the algorithm has been developed in such way that it can be easily reduced to solve scalar radiative transfer problems, which makes it possible to implement a single set of code for computing both the scalar and the vector Green's function
Semi-coarsening multigrid methods for parallel computing
Energy Technology Data Exchange (ETDEWEB)
Jones, J.E.
1996-12-31
Standard multigrid methods are not well suited for problems with anisotropic coefficients which can occur, for example, on grids that are stretched to resolve a boundary layer. There are several different modifications of the standard multigrid algorithm that yield efficient methods for anisotropic problems. In the paper, we investigate the parallel performance of these multigrid algorithms. Multigrid algorithms which work well for anisotropic problems are based on line relaxation and/or semi-coarsening. In semi-coarsening multigrid algorithms a grid is coarsened in only one of the coordinate directions unlike standard or full-coarsening multigrid algorithms where a grid is coarsened in each of the coordinate directions. When both semi-coarsening and line relaxation are used, the resulting multigrid algorithm is robust and automatic in that it requires no knowledge of the nature of the anisotropy. This is the basic multigrid algorithm whose parallel performance we investigate in the paper. The algorithm is currently being implemented on an IBM SP2 and its performance is being analyzed. In addition to looking at the parallel performance of the basic semi-coarsening algorithm, we present algorithmic modifications with potentially better parallel efficiency. One modification reduces the amount of computational work done in relaxation at the expense of using multiple coarse grids. This modification is also being implemented with the aim of comparing its performance to that of the basic semi-coarsening algorithm.
Synchronization Of Parallel Discrete Event Simulations
Steinman, Jeffrey S.
1992-01-01
Adaptive, parallel, discrete-event-simulation-synchronization algorithm, Breathing Time Buckets, developed in Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES) operating system. Algorithm allows parallel simulations to process events optimistically in fluctuating time cycles that naturally adapt while simulation in progress. Combines best of optimistic and conservative synchronization strategies while avoiding major disadvantages. Algorithm processes events optimistically in time cycles adapting while simulation in progress. Well suited for modeling communication networks, for large-scale war games, for simulated flights of aircraft, for simulations of computer equipment, for mathematical modeling, for interactive engineering simulations, and for depictions of flows of information.
The STAPL Parallel Graph Library
Harshvardhan,; Fidel, Adam; Amato, Nancy M.; Rauchwerger, Lawrence
2013-01-01
This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable
Parallel processing from applications to systems
Moldovan, Dan I
1993-01-01
This text provides one of the broadest presentations of parallelprocessing available, including the structure of parallelprocessors and parallel algorithms. The emphasis is on mappingalgorithms to highly parallel computers, with extensive coverage ofarray and multiprocessor architectures. Early chapters provideinsightful coverage on the analysis of parallel algorithms andprogram transformations, effectively integrating a variety ofmaterial previously scattered throughout the literature. Theory andpractice are well balanced across diverse topics in this concisepresentation. For exceptional cla
Algorithms and data structures for massively parallel generic adaptive finite element codes
Bangerth, Wolfgang
2011-12-01
Today\\'s largest supercomputers have 100,000s of processor cores and offer the potential to solve partial differential equations discretized by billions of unknowns. However, the complexity of scaling to such large machines and problem sizes has so far prevented the emergence of generic software libraries that support such computations, although these would lower the threshold of entry and enable many more applications to benefit from large-scale computing. We are concerned with providing this functionality for mesh-adaptive finite element computations. We assume the existence of an "oracle" that implements the generation and modification of an adaptive mesh distributed across many processors, and that responds to queries about its structure. Based on querying the oracle, we develop scalable algorithms and data structures for generic finite element methods. Specifically, we consider the parallel distribution of mesh data, global enumeration of degrees of freedom, constraints, and postprocessing. Our algorithms remove the bottlenecks that typically limit large-scale adaptive finite element analyses. We demonstrate scalability of complete finite element workflows on up to 16,384 processors. An implementation of the proposed algorithms, based on the open source software p4est as mesh oracle, is provided under an open source license through the widely used deal.II finite element software library. © 2011 ACM 0098-3500/2011/12-ART10 $10.00.
The numerical parallel computing of photon transport
International Nuclear Information System (INIS)
Huang Qingnan; Liang Xiaoguang; Zhang Lifa
1998-12-01
The parallel computing of photon transport is investigated, the parallel algorithm and the parallelization of programs on parallel computers both with shared memory and with distributed memory are discussed. By analyzing the inherent law of the mathematics and physics model of photon transport according to the structure feature of parallel computers, using the strategy of 'to divide and conquer', adjusting the algorithm structure of the program, dissolving the data relationship, finding parallel liable ingredients and creating large grain parallel subtasks, the sequential computing of photon transport into is efficiently transformed into parallel and vector computing. The program was run on various HP parallel computers such as the HY-1 (PVP), the Challenge (SMP) and the YH-3 (MPP) and very good parallel speedup has been gotten
Parallel Fast Legendre Transform
Alves de Inda, M.; Bisseling, R.H.; Maslen, D.K.
1998-01-01
We discuss a parallel implementation of a fast algorithm for the discrete polynomial Legendre transform We give an introduction to the DriscollHealy algorithm using polynomial arithmetic and present experimental results on the eciency and accuracy of our implementation The algorithms were
Lin, Mingpei; Xu, Ming; Fu, Xiaoyu
2017-05-01
Currently, a tremendous amount of space debris in Earth's orbit imperils operational spacecraft. It is essential to undertake risk assessments of collisions and predict dangerous encounters in space. However, collision predictions for an enormous amount of space debris give rise to large-scale computations. In this paper, a parallel algorithm is established on the Compute Unified Device Architecture (CUDA) platform of NVIDIA Corporation for collision prediction. According to the parallel structure of NVIDIA graphics processors, a block decomposition strategy is adopted in the algorithm. Space debris is divided into batches, and the computation and data transfer operations of adjacent batches overlap. As a consequence, the latency to access shared memory during the entire computing process is significantly reduced, and a higher computing speed is reached. Theoretically, a simulation of collision prediction for space debris of any amount and for any time span can be executed. To verify this algorithm, a simulation example including 1382 pieces of debris, whose operational time scales vary from 1 min to 3 days, is conducted on Tesla C2075 of NVIDIA. The simulation results demonstrate that with the same computational accuracy as that of a CPU, the computing speed of the parallel algorithm on a GPU is 30 times that on a CPU. Based on this algorithm, collision prediction of over 150 Chinese spacecraft for a time span of 3 days can be completed in less than 3 h on a single computer, which meets the timeliness requirement of the initial screening task. Furthermore, the algorithm can be adapted for multiple tasks, including particle filtration, constellation design, and Monte-Carlo simulation of an orbital computation.
Directory of Open Access Journals (Sweden)
Ion LUNGU
2012-01-01
Full Text Available In this paper, we research, analyze and develop optimization solutions for the parallel reduction function using graphics processing units (GPUs that implement the Compute Unified Device Architecture (CUDA, a modern and novel approach for improving the software performance of data processing applications and algorithms. Many of these applications and algorithms make use of the reduction function in their computational steps. After having designed the function and its algorithmic steps in CUDA, we have progressively developed and implemented optimization solutions for the reduction function. In order to confirm, test and evaluate the solutions' efficiency, we have developed a custom tailored benchmark suite. We have analyzed the obtained experimental results regarding: the comparison of the execution time and bandwidth when using graphic processing units covering the main CUDA architectures (Tesla GT200, Fermi GF100, Kepler GK104 and a central processing unit; the data type influence; the binary operator's influence.
Energy Technology Data Exchange (ETDEWEB)
Azmy, Yousry
2014-06-10
We employ the Integral Transport Matrix Method (ITMM) as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells' fluxes and between the cells' and boundary surfaces' fluxes. The main goals of this work are to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and parallel performance of the developed methods with increasing number of processes, P. The fastest observed parallel solution method, Parallel Gauss-Seidel (PGS), was used in a weak scaling comparison with the PARTISN transport code, which uses the source iteration (SI) scheme parallelized with the Koch-baker-Alcouffe (KBA) method. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method- even without acceleration/preconditioning-is completitive for optically thick problems as P is increased to the tens of thousands range. For the most optically thick cells tested, PGS reduced execution time by an approximate factor of three for problems with more than 130 million computational cells on P = 32,768. Moreover, the SI-DSA execution times's trend rises generally more steeply with increasing P than the PGS trend. Furthermore, the PGS method outperforms SI for the periodic heterogeneous layers (PHL) configuration problems. The PGS method outperforms SI and SI-DSA on as few as P = 16 for PHL problems and reduces execution time by a factor of ten or more for all problems considered with more than 2 million computational cells on P = 4.096.
Directory of Open Access Journals (Sweden)
Chunfeng Liu
2013-01-01
Full Text Available The paper presents a novel hybrid genetic algorithm (HGA for a deterministic scheduling problem where multiple jobs with arbitrary precedence constraints are processed on multiple unrelated parallel machines. The objective is to minimize total tardiness, since delays of the jobs may lead to punishment cost or cancellation of orders by the clients in many situations. A priority rule-based heuristic algorithm, which schedules a prior job on a prior machine according to the priority rule at each iteration, is suggested and embedded to the HGA for initial feasible schedules that can be improved in further stages. Computational experiments are conducted to show that the proposed HGA performs well with respect to accuracy and efficiency of solution for small-sized problems and gets better results than the conventional genetic algorithm within the same runtime for large-sized problems.
Parallel optimization algorithm for drone inspection in the building industry
Walczyński, Maciej; BoŻejko, Wojciech; Skorupka, Dariusz
2017-07-01
In this paper we present an approach for Vehicle Routing Problem with Drones (VRPD) in case of building inspection from the air. In autonomic inspection process there is a need to determine of the optimal route for inspection drone. This is especially important issue because of the very limited flight time of modern multicopters. The method of determining solutions for Traveling Salesman Problem(TSP), described in this paper bases on Parallel Evolutionary Algorithm (ParEA)with cooperative and independent approach for communication between threads. This method described first by Bożejko and Wodecki [1] bases on the observation that if exists some number of elements on certain positions in a number of permutations which are local minima, then those elements will be in the same position in the optimal solution for TSP problem. Numerical experiments were made on BEM computational cluster with using MPI library.
International Nuclear Information System (INIS)
Epelboin, Y.
1996-01-01
This paper presents a new algorithm for the integration of Takagi-Taupin equations taking into account the fact that X-ray diffraction is a parallel phenomenon. The diffraction equations show that the propagation of the waves is independent in each incidence plane. It is thus possible to compute in parallel the propagation of the waves in different planes. Two algorithms are presented: the first one for multiprocessor machines where the processors share a common memory, the second one for massively parallel computers. The program is written to achieve a high vectorization ratio and to make it as efficient as possible with modern superscalar and array processors. The simulation of the image of a defect has been divided into two independent parts. In the first one, one computes the derivatives of the deformation inside the crystal; in the second one, these results are used to simulate the image. This allows one rapidly to change the model for a defect, something that was not feasible in all previously written simulation programs since the computation of the deformation was part of the simulation. The study of stroboscopic images of the propagation of acoustic waves in piezoelectric devices is given as an example of the possibilities of this new program. (orig.)
Reduced complexity and latency for a massive MIMO system using a parallel detection algorithm
Directory of Open Access Journals (Sweden)
Shoichi Higuchi
2017-09-01
Full Text Available In recent years, massive MIMO systems have been widely researched to realize high-speed data transmission. Since massive MIMO systems use a large number of antennas, these systems require huge complexity to detect the signal. In this paper, we propose a novel detection method for massive MIMO using parallel detection with maximum likelihood detection with QR decomposition and M-algorithm (QRM-MLD to reduce the complexity and latency. The proposed scheme obtains an R matrix after permutation of an H matrix and QR decomposition. The R matrix is also eliminated using a Gauss–Jordan elimination method. By using a modified R matrix, the proposed method can detect the transmitted signal using parallel detection. From the simulation results, the proposed scheme can achieve a reduced complexity and latency with a little degradation of the bit error rate (BER performance compared with the conventional method.
Dong, Yu-Shuang; Xu, Gao-Chao; Fu, Xiao-Dong
2014-01-01
The cloud platform provides various services to users. More and more cloud centers provide infrastructure as the main way of operating. To improve the utilization rate of the cloud center and to decrease the operating cost, the cloud center provides services according to requirements of users by sharding the resources with virtualization. Considering both QoS for users and cost saving for cloud computing providers, we try to maximize performance and minimize energy cost as well. In this paper, we propose a distributed parallel genetic algorithm (DPGA) of placement strategy for virtual machines deployment on cloud platform. It executes the genetic algorithm parallelly and distributedly on several selected physical hosts in the first stage. Then it continues to execute the genetic algorithm of the second stage with solutions obtained from the first stage as the initial population. The solution calculated by the genetic algorithm of the second stage is the optimal one of the proposed approach. The experimental results show that the proposed placement strategy of VM deployment can ensure QoS for users and it is more effective and more energy efficient than other placement strategies on the cloud platform.
Directory of Open Access Journals (Sweden)
Yu-Shuang Dong
2014-01-01
Full Text Available The cloud platform provides various services to users. More and more cloud centers provide infrastructure as the main way of operating. To improve the utilization rate of the cloud center and to decrease the operating cost, the cloud center provides services according to requirements of users by sharding the resources with virtualization. Considering both QoS for users and cost saving for cloud computing providers, we try to maximize performance and minimize energy cost as well. In this paper, we propose a distributed parallel genetic algorithm (DPGA of placement strategy for virtual machines deployment on cloud platform. It executes the genetic algorithm parallelly and distributedly on several selected physical hosts in the first stage. Then it continues to execute the genetic algorithm of the second stage with solutions obtained from the first stage as the initial population. The solution calculated by the genetic algorithm of the second stage is the optimal one of the proposed approach. The experimental results show that the proposed placement strategy of VM deployment can ensure QoS for users and it is more effective and more energy efficient than other placement strategies on the cloud platform.
Development of parallel GPU based algorithms for problems in nuclear area
International Nuclear Information System (INIS)
Almeida, Adino Americo Heimlich
2009-01-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)
Muckley, Matthew J; Noll, Douglas C; Fessler, Jeffrey A
2015-02-01
Sparsity-promoting regularization is useful for combining compressed sensing assumptions with parallel MRI for reducing scan time while preserving image quality. Variable splitting algorithms are the current state-of-the-art algorithms for SENSE-type MR image reconstruction with sparsity-promoting regularization. These methods are very general and have been observed to work with almost any regularizer; however, the tuning of associated convergence parameters is a commonly-cited hindrance in their adoption. Conversely, majorize-minimize algorithms based on a single Lipschitz constant have been observed to be slow in shift-variant applications such as SENSE-type MR image reconstruction since the associated Lipschitz constants are loose bounds for the shift-variant behavior. This paper bridges the gap between the Lipschitz constant and the shift-variant aspects of SENSE-type MR imaging by introducing majorizing matrices in the range of the regularizer matrix. The proposed majorize-minimize methods (called BARISTA) converge faster than state-of-the-art variable splitting algorithms when combined with momentum acceleration and adaptive momentum restarting. Furthermore, the tuning parameters associated with the proposed methods are unitless convergence tolerances that are easier to choose than the constraint penalty parameters required by variable splitting algorithms.
Introduction of Parallel GPGPU Acceleration Algorithms for the Solution of Radiative Transfer
Godoy, William F.; Liu, Xu
2011-01-01
General-purpose computing on graphics processing units (GPGPU) is a recent technique that allows the parallel graphics processing unit (GPU) to accelerate calculations performed sequentially by the central processing unit (CPU). To introduce GPGPU to radiative transfer, the Gauss-Seidel solution of the well-known expressions for 1-D and 3-D homogeneous, isotropic media is selected as a test case. Different algorithms are introduced to balance memory and GPU-CPU communication, critical aspects of GPGPU. Results show that speed-ups of one to two orders of magnitude are obtained when compared to sequential solutions. The underlying value of GPGPU is its potential extension in radiative solvers (e.g., Monte Carlo, discrete ordinates) at a minimal learning curve.
Parallel Algorithm Solves Coupled Differential Equations
Hayashi, A.
1987-01-01
Numerical methods adapted to concurrent processing. Algorithm solves set of coupled partial differential equations by numerical integration. Adapted to run on hypercube computer, algorithm separates problem into smaller problems solved concurrently. Increase in computing speed with concurrent processing over that achievable with conventional sequential processing appreciable, especially for large problems.
Pattern-Driven Automatic Parallelization
Directory of Open Access Journals (Sweden)
Christoph W. Kessler
1996-01-01
Full Text Available This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.
Xu, Zhenzhen; Zou, Yongxing; Kong, Xiangjie
2015-01-01
To our knowledge, this paper investigates the first application of meta-heuristic algorithms to tackle the parallel machines scheduling problem with weighted late work criterion and common due date ([Formula: see text]). Late work criterion is one of the performance measures of scheduling problems which considers the length of late parts of particular jobs when evaluating the quality of scheduling. Since this problem is known to be NP-hard, three meta-heuristic algorithms, namely ant colony system, genetic algorithm, and simulated annealing are designed and implemented, respectively. We also propose a novel algorithm named LDF (largest density first) which is improved from LPT (longest processing time first). The computational experiments compared these meta-heuristic algorithms with LDF, LPT and LS (list scheduling), and the experimental results show that SA performs the best in most cases. However, LDF is better than SA in some conditions, moreover, the running time of LDF is much shorter than SA.
Kim, Seonah; Orendt, Anita M; Ferraro, Marta B; Facelli, Julio C
2009-10-01
This article describes the application of our distributed computing framework for crystal structure prediction (CSP) the modified genetic algorithms for crystal and cluster prediction (MGAC), to predict the crystal structure of flexible molecules using the general Amber force field (GAFF) and the CHARMM program. The MGAC distributed computing framework includes a series of tightly integrated computer programs for generating the molecule's force field, sampling crystal structures using a distributed parallel genetic algorithm and local energy minimization of the structures followed by the classifying, sorting, and archiving of the most relevant structures. Our results indicate that the method can consistently find the experimentally known crystal structures of flexible molecules, but the number of missing structures and poor ranking observed in some crystals show the need for further improvement of the potential. Copyright 2009 Wiley Periodicals, Inc.
Liu, Kuojuey Ray
1990-01-01
Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.
Parallel computing solution of Boltzmann neutron transport equation
International Nuclear Information System (INIS)
Ansah-Narh, T.
2010-01-01
The focus of the research was on developing parallel computing algorithm for solving Eigen-values of the Boltzmam Neutron Transport Equation (BNTE) in a slab geometry using multi-grid approach. In response to the problem of slow execution of serial computing when solving large problems, such as BNTE, the study was focused on the design of parallel computing systems which was an evolution of serial computing that used multiple processing elements simultaneously to solve complex physical and mathematical problems. Finite element method (FEM) was used for the spatial discretization scheme, while angular discretization was accomplished by expanding the angular dependence in terms of Legendre polynomials. The eigenvalues representing the multiplication factors in the BNTE were determined by the power method. MATLAB Compiler Version 4.1 (R2009a) was used to compile the MATLAB codes of BNTE. The implemented parallel algorithms were enabled with matlabpool, a Parallel Computing Toolbox function. The option UseParallel was set to 'always' and the default value of the option was 'never'. When those conditions held, the solvers computed estimated gradients in parallel. The parallel computing system was used to handle all the bottlenecks in the matrix generated from the finite element scheme and each domain of the power method generated. The parallel algorithm was implemented on a Symmetric Multi Processor (SMP) cluster machine, which had Intel 32 bit quad-core x 86 processors. Convergence rates and timings for the algorithm on the SMP cluster machine were obtained. Numerical experiments indicated the designed parallel algorithm could reach perfect speedup and had good stability and scalability. (au)
Design considerations for parallel graphics libraries
Crockett, Thomas W.
1994-01-01
Applications which run on parallel supercomputers are often characterized by massive datasets. Converting these vast collections of numbers to visual form has proven to be a powerful aid to comprehension. For a variety of reasons, it may be desirable to provide this visual feedback at runtime. One way to accomplish this is to exploit the available parallelism to perform graphics operations in place. In order to do this, we need appropriate parallel rendering algorithms and library interfaces. This paper provides a tutorial introduction to some of the issues which arise in designing parallel graphics libraries and their underlying rendering algorithms. The focus is on polygon rendering for distributed memory message-passing systems. We illustrate our discussion with examples from PGL, a parallel graphics library which has been developed on the Intel family of parallel systems.
Directory of Open Access Journals (Sweden)
A. L. Lapikov
2014-01-01
Full Text Available The article is aimed at creating techniques to study multi-sectional manipulators with parallel structure. To solve this task the analysis in the field concerned was carried out to reveal both advantages and drawbacks of such executive mechanisms and main problems to be encountered in the course of research. The work shows that it is inefficient to create complete mathematical models of multisectional manipulators, which in the context of solving a direct kinematic problem are to derive a functional dependence of location and orientation of the end effector on all the generalized coordinates of the mechanism. The structure of multisectional manipulators was considered, where the sections are platform manipulators of parallel kinematics with six degrees of freedom. The paper offers an algorithm to define location and orientation of the end effector of the manipulator by means of iterative solution of analytical equation of the moving platform plane for each section. The equation for the unknown plane is derived using three points, which are attachment points of the moving platform joints. To define the values of joint coordinates a system of nine non-linear equations is completed. It is necessary to mention that for completion of the equation system are used the equations with the same type of non-linearity. The physical sense of all nine equations of the system is Euclidean distance between the points of the manipulator. The result of algorithm execution is a matrix of homogenous transformation for each section. The correlations describing transformations between adjoining sections of the manipulator are given. An example of the mechanism consisting of three sections is examined. The comparison of theoretical calculations with results obtained on a 3D-prototype is made. The next step of the work is to conduct research activities both in the field of dynamics of platform parallel kinematics manipulators with six degrees of freedom and in the
Writing parallel programs that work
CERN. Geneva
2012-01-01
Serial algorithms typically run inefficiently on parallel machines. This may sound like an obvious statement, but it is the root cause of why parallel programming is considered to be difficult. The current state of the computer industry is still that almost all programs in existence are serial. This talk will describe the techniques used in the Intel Parallel Studio to provide a developer with the tools necessary to understand the behaviors and limitations of the existing serial programs. Once the limitations are known the developer can refactor the algorithms and reanalyze the resulting programs with the tools in the Intel Parallel Studio to create parallel programs that work. About the speaker Paul Petersen is a Sr. Principal Engineer in the Software and Solutions Group (SSG) at Intel. He received a Ph.D. degree in Computer Science from the University of Illinois in 1993. After UIUC, he was employed at Kuck and Associates, Inc. (KAI) working on auto-parallelizing compiler (KAP), and was involved in th...
Scheduling Parallel Jobs Using Migration and Consolidation in the Cloud
Directory of Open Access Journals (Sweden)
Xiaocheng Liu
2012-01-01
Full Text Available An increasing number of high performance computing parallel applications leverages the power of the cloud for parallel processing. How to schedule the parallel applications to improve the quality of service is the key to the successful host of parallel applications in the cloud. The large scale of the cloud makes the parallel job scheduling more complicated as even simple parallel job scheduling problem is NP-complete. In this paper, we propose a parallel job scheduling algorithm named MEASY. MEASY adopts migration and consolidation to enhance the most popular EASY scheduling algorithm. Our extensive experiments on well-known workloads show that our algorithm takes very good care of the quality of service. For two common parallel job scheduling objectives, our algorithm produces an up to 41.1% and an average of 23.1% improvement on the average response time; an up to 82.9% and an average of 69.3% improvement on the average slowdown. Our algorithm is robust even in terms that it allows inaccurate CPU usage estimation and high migration cost. Our approach involves trivial modification on EASY and requires no additional technique; it is practical and effective in the cloud environment.
Parallel computing: numerics, applications, and trends
National Research Council Canada - National Science Library
Trobec, Roman; Vajteršic, Marián; Zinterhof, Peter
2009-01-01
... and/or distributed systems. The contributions to this book are focused on topics most concerned in the trends of today's parallel computing. These range from parallel algorithmics, programming, tools, network computing to future parallel computing. Particular attention is paid to parallel numerics: linear algebra, differential equations, numerica...
Rerucha, Simon; Sarbort, Martin; Hola, Miroslava; Cizek, Martin; Hucl, Vaclav; Cip, Ondrej; Lazar, Josef
2016-12-01
The homodyne detection with only a single detector represents a promising approach in the interferometric application which enables a significant reduction of the optical system complexity while preserving the fundamental resolution and dynamic range of the single frequency laser interferometers. We present the design, implementation and analysis of algorithmic methods for computational processing of the single-detector interference signal based on parallel pipelined processing suitable for real time implementation on a programmable hardware platform (e.g. the FPGA - Field Programmable Gate Arrays or the SoC - System on Chip). The algorithmic methods incorporate (a) the single detector signal (sine) scaling, filtering, demodulations and mixing necessary for the second (cosine) quadrature signal reconstruction followed by a conic section projection in Cartesian plane as well as (a) the phase unwrapping together with the goniometric and linear transformations needed for the scale linearization and periodic error correction. The digital computing scheme was designed for bandwidths up to tens of megahertz which would allow to measure the displacements at the velocities around half metre per second. The algorithmic methods were tested in real-time operation with a PC-based reference implementation that employed the advantage pipelined processing by balancing the computational load among multiple processor cores. The results indicate that the algorithmic methods are suitable for a wide range of applications [3] and that they are bringing the fringe counting interferometry closer to the industrial applications due to their optical setup simplicity and robustness, computational stability, scalability and also a cost-effectiveness.
Energy Technology Data Exchange (ETDEWEB)
Schatz, Martin D. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Kolda, Tamara G. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); van de Geijn, Robert [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2015-09-01
Large-scale datasets in computational chemistry typically require distributed-memory parallel methods to perform a special operation known as tensor contraction. Tensors are multidimensional arrays, and a tensor contraction is akin to matrix multiplication with special types of permutations. Creating an efficient algorithm and optimized im- plementation in this domain is complex, tedious, and error-prone. To address this, we develop a notation to express data distributions so that we can apply use automated methods to find optimized implementations for tensor contractions. We consider the spin-adapted coupled cluster singles and doubles method from computational chemistry and use our methodology to produce an efficient implementation. Experiments per- formed on the IBM Blue Gene/Q and Cray XC30 demonstrate impact both improved performance and reduced memory consumption.
Lei, H.; Lu, Z.; Vesselinov, V. V.; Ye, M.
2017-12-01
Simultaneous identification of both the zonation structure of aquifer heterogeneity and the hydrogeological parameters associated with these zones is challenging, especially for complex subsurface heterogeneity fields. In this study, a new approach, based on the combination of the level set method and a parallel genetic algorithm is proposed. Starting with an initial guess for the zonation field (including both zonation structure and the hydraulic properties of each zone), the level set method ensures that material interfaces are evolved through the inverse process such that the total residual between the simulated and observed state variables (hydraulic head) always decreases, which means that the inversion result depends on the initial guess field and the minimization process might fail if it encounters a local minimum. To find the global minimum, the genetic algorithm (GA) is utilized to explore the parameters that define initial guess fields, and the minimal total residual corresponding to each initial guess field is considered as the fitness function value in the GA. Due to the expensive evaluation of the fitness function, a parallel GA is adapted in combination with a simulated annealing algorithm. The new approach has been applied to several synthetic cases in both steady-state and transient flow fields, including a case with real flow conditions at the chromium contaminant site at the Los Alamos National Laboratory. The results show that this approach is capable of identifying the arbitrary zonation structures of aquifer heterogeneity and the hydrogeological parameters associated with these zones effectively.
Two and Three-Dimensional Nonlocal DFT for Inhomogeneous Fluids I: Algorithms and Parallelization
Energy Technology Data Exchange (ETDEWEB)
Frink, Laura J. Douglas; Salinger, Andrew
1999-08-09
Fluids adsorbed near surfaces, macromolecules, and in porous materials are inhomogeneous, inhibiting spatially varying density distributions. This inhomogeneity in the fluid plays an important role in controlling a wide variety of complex physical phenomena including wetting, self-assembly, corrosion, and molecular recognition. One of the key methods for studying the properties of inhomogeneous fluids in simple geometries has been density functional theory (DFT). However, there has been a conspicuous lack of calculations in complex 2D and 3D geometries. The computational difficulty arises from the need to perform nested integrals that are due to nonlocal terms in the free energy functional These integral equations are expensive both in evaluation time and in memory requirements; however, the expense can be mitigated by intelligent algorithms and the use of parallel computers. This paper details our efforts to develop efficient numerical algorithms so that no local DFT calculations in complex geometries that require two or three dimensions can be performed. The success of this implementation will enable the study of solvation effects at heterogeneous surfaces, in zeolites, in solvated (bio)polymers, and in colloidal suspensions.
Advanced parallel processing with supercomputer architectures
International Nuclear Information System (INIS)
Hwang, K.
1987-01-01
This paper investigates advanced parallel processing techniques and innovative hardware/software architectures that can be applied to boost the performance of supercomputers. Critical issues on architectural choices, parallel languages, compiling techniques, resource management, concurrency control, programming environment, parallel algorithms, and performance enhancement methods are examined and the best answers are presented. The authors cover advanced processing techniques suitable for supercomputers, high-end mainframes, minisupers, and array processors. The coverage emphasizes vectorization, multitasking, multiprocessing, and distributed computing. In order to achieve these operation modes, parallel languages, smart compilers, synchronization mechanisms, load balancing methods, mapping parallel algorithms, operating system functions, application library, and multidiscipline interactions are investigated to ensure high performance. At the end, they assess the potentials of optical and neural technologies for developing future supercomputers
Domain decomposition methods and parallel computing
International Nuclear Information System (INIS)
Meurant, G.
1991-01-01
In this paper, we show how to efficiently solve large linear systems on parallel computers. These linear systems arise from discretization of scientific computing problems described by systems of partial differential equations. We show how to get a discrete finite dimensional system from the continuous problem and the chosen conjugate gradient iterative algorithm is briefly described. Then, the different kinds of parallel architectures are reviewed and their advantages and deficiencies are emphasized. We sketch the problems found in programming the conjugate gradient method on parallel computers. For this algorithm to be efficient on parallel machines, domain decomposition techniques are introduced. We give results of numerical experiments showing that these techniques allow a good rate of convergence for the conjugate gradient algorithm as well as computational speeds in excess of a billion of floating point operations per second. (author). 5 refs., 11 figs., 2 tabs., 1 inset
Parallel sparse direct solver for integrated circuit simulation
Chen, Xiaoming; Yang, Huazhong
2017-01-01
This book describes algorithmic methods and parallelization techniques to design a parallel sparse direct solver which is specifically targeted at integrated circuit simulation problems. The authors describe a complete flow and detailed parallel algorithms of the sparse direct solver. They also show how to improve the performance by simple but effective numerical techniques. The sparse direct solver techniques described can be applied to any SPICE-like integrated circuit simulator and have been proven to be high-performance in actual circuit simulation. Readers will benefit from the state-of-the-art parallel integrated circuit simulation techniques described in this book, especially the latest parallel sparse matrix solution techniques. · Introduces complicated algorithms of sparse linear solvers, using concise principles and simple examples, without complex theory or lengthy derivations; · Describes a parallel sparse direct solver that can be adopted to accelerate any SPICE-like integrated circuit simulato...
Debelak, Rudolf; Tran, Ulrich S
2016-01-01
The analysis of polychoric correlations via principal component analysis and exploratory factor analysis are well-known approaches to determine the dimensionality of ordered categorical items. However, the application of these approaches has been considered as critical due to the possible indefiniteness of the polychoric correlation matrix. A possible solution to this problem is the application of smoothing algorithms. This study compared the effects of three smoothing algorithms, based on the Frobenius norm, the adaption of the eigenvalues and eigenvectors, and on minimum-trace factor analysis, on the accuracy of various variations of parallel analysis by the means of a simulation study. We simulated different datasets which varied with respect to the size of the respondent sample, the size of the item set, the underlying factor model, the skewness of the response distributions and the number of response categories in each item. We found that a parallel analysis and principal component analysis of smoothed polychoric and Pearson correlations led to the most accurate results in detecting the number of major factors in simulated datasets when compared to the other methods we investigated. Of the methods used for smoothing polychoric correlation matrices, we recommend the algorithm based on minimum trace factor analysis.
Scalable parallel prefix solvers for discrete ordinates transport
International Nuclear Information System (INIS)
Pautz, S.; Pandya, T.; Adams, M.
2009-01-01
The well-known 'sweep' algorithm for inverting the streaming-plus-collision term in first-order deterministic radiation transport calculations has some desirable numerical properties. However, it suffers from parallel scaling issues caused by a lack of concurrency. The maximum degree of concurrency, and thus the maximum parallelism, grows more slowly than the problem size for sweeps-based solvers. We investigate a new class of parallel algorithms that involves recasting the streaming-plus-collision problem in prefix form and solving via cyclic reduction. This method, although computationally more expensive at low levels of parallelism than the sweep algorithm, offers better theoretical scalability properties. Previous work has demonstrated this approach for one-dimensional calculations; we show how to extend it to multidimensional calculations. Notably, for multiple dimensions it appears that this approach is limited to long-characteristics discretizations; other discretizations cannot be cast in prefix form. We implement two variants of the algorithm within the radlib/SCEPTRE transport code library at Sandia National Laboratories and show results on two different massively parallel systems. Both the 'forward' and 'symmetric' solvers behave similarly, scaling well to larger degrees of parallelism then sweeps-based solvers. We do observe some issues at the highest levels of parallelism (relative to the system size) and discuss possible causes. We conclude that this approach shows good potential for future parallel systems, but the parallel scalability will depend heavily on the architecture of the communication networks of these systems. (authors)
Data parallel sorting for particle simulation
Dagum, Leonardo
1992-01-01
Sorting on a parallel architecture is a communications intensive event which can incur a high penalty in applications where it is required. In the case of particle simulation, only integer sorting is necessary, and sequential implementations easily attain the minimum performance bound of O (N) for N particles. Parallel implementations, however, have to cope with the parallel sorting problem which, in addition to incurring a heavy communications cost, can make the minimun performance bound difficult to attain. This paper demonstrates how the sorting problem in a particle simulation can be reduced to a merging problem, and describes an efficient data parallel algorithm to solve this merging problem in a particle simulation. The new algorithm is shown to be optimal under conditions usual for particle simulation, and its fieldwise implementation on the Connection Machine is analyzed in detail. The new algorithm is about four times faster than a fieldwise implementation of radix sort on the Connection Machine.
Scemama, Anthony; Renon, Nicolas; Rapacioli, Mathias
2014-06-10
We present an algorithm and its parallel implementation for solving a self-consistent problem as encountered in Hartree-Fock or density functional theory. The algorithm takes advantage of the sparsity of matrices through the use of local molecular orbitals. The implementation allows one to exploit efficiently modern symmetric multiprocessing (SMP) computer architectures. As a first application, the algorithm is used within the density-functional-based tight binding method, for which most of the computational time is spent in the linear algebra routines (diagonalization of the Fock/Kohn-Sham matrix). We show that with this algorithm (i) single point calculations on very large systems (millions of atoms) can be performed on large SMP machines, (ii) calculations involving intermediate size systems (1000-100 000 atoms) are also strongly accelerated and can run efficiently on standard servers, and (iii) the error on the total energy due to the use of a cutoff in the molecular orbital coefficients can be controlled such that it remains smaller than the SCF convergence criterion.
Implementing O(N N-Body Algorithms Efficiently in Data-Parallel Languages
Directory of Open Access Journals (Sweden)
Yu Hu
1996-01-01
Full Text Available The optimization techniques for hierarchical O(N N-body algorithms described here focus on managing the data distribution and the data references, both between the memories of different nodes and within the memory hierarchy of each node. We show how the techniques can be expressed in data-parallel languages, such as High Performance Fortran (HPF and Connection Machine Fortran (CMF. The effectiveness of our techniques is demonstrated on an implementation of Anderson's hierarchical O(N N-body method for the Connection Machine system CM-5/5E. Of the total execution time, communication accounts for about 10–20% of the total time, with the average efficiency for arithmetic operations being about 40% and the total efficiency (including communication being about 35%. For the CM-5E, a performance in excess of 60 Mflop/s per node (peak 160 Mflop/s per node has been measured.
Parallel-Processing Test Bed For Simulation Software
Blech, Richard; Cole, Gary; Townsend, Scott
1996-01-01
Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).
Approximation algorithms for the parallel flow shop problem
X. Zhang (Xiandong); S.L. van de Velde (Steef)
2012-01-01
textabstractWe consider the NP-hard problem of scheduling n jobs in m two-stage parallel flow shops so as to minimize the makespan. This problem decomposes into two subproblems: assigning the jobs to parallel flow shops; and scheduling the jobs assigned to the same flow shop by use of Johnson's
Global restructuring of the CPM-2 transport algorithm for vector and parallel processing
International Nuclear Information System (INIS)
Vujic, J.L.; Martin, W.R.
1989-01-01
The CPM-2 code is an assembly transport code based on the collision probability (CP) method. It can in principle be applied to global reactor problems, but its excessive computational demands prevent this application. Therefore, a new transport algorithm for CPM-2 has been developed for vector-parallel architectures, which has resulted in an overall factor of 20 speedup (wall clock) on the IBM 3090-600E. This paper presents the detailed results of this effort as well as a brief description of ongoing effort to remove some of the modeling limitations in CPM-2 that inhibit its use for global applications, such as the use of the pure CP treatment and the assumption of isotropic scattering
Aspects of computation on asynchronous parallel processors
International Nuclear Information System (INIS)
Wright, M.
1989-01-01
The increasing availability of asynchronous parallel processors has provided opportunities for original and useful work in scientific computing. However, the field of parallel computing is still in a highly volatile state, and researchers display a wide range of opinion about many fundamental questions such as models of parallelism, approaches for detecting and analyzing parallelism of algorithms, and tools that allow software developers and users to make effective use of diverse forms of complex hardware. This volume collects the work of researchers specializing in different aspects of parallel computing, who met to discuss the framework and the mechanics of numerical computing. The far-reaching impact of high-performance asynchronous systems is reflected in the wide variety of topics, which include scientific applications (e.g. linear algebra, lattice gauge simulation, ordinary and partial differential equations), models of parallelism, parallel language features, task scheduling, automatic parallelization techniques, tools for algorithm development in parallel environments, and system design issues
Algorithm and Implementation of Distributed ESN Using Spark Framework and Parallel PSO
Directory of Open Access Journals (Sweden)
Kehe Wu
2017-04-01
Full Text Available The echo state network (ESN employs a huge reservoir with sparsely and randomly connected internal nodes and only trains the output weights, which avoids the suboptimal problem, exploding and vanishing gradients, high complexity and other disadvantages faced by traditional recurrent neural network (RNN training. In light of the outstanding adaption to nonlinear dynamical systems, ESN has been applied into a wide range of applications. However, in the era of Big Data, with an enormous amount of data being generated continuously every day, the data are often distributed and stored in real applications, and thus the centralized ESN training process is prone to being technologically unsuitable. In order to achieve the requirement of Big Data applications in the real world, in this study we propose an algorithm and its implementation for distributed ESN training. The mentioned algorithm is based on the parallel particle swarm optimization (P-PSO technique and the implementation uses Spark, a famous large-scale data processing framework. Four extremely large-scale datasets, including artificial benchmarks, real-world data and image data, are adopted to verify our framework on a stretchable platform. Experimental results indicate that the proposed work is accurate in the era of Big Data, regarding speed, accuracy and generalization capabilities.
Narayanan, Kiran; Mora Cordova, Angel; Allsopp, Nicholas; El Sayed, Tamer S.
2012-01-01
A hybrid parallelization method composed of a coarse-grained genetic algorithm (GA) and fine-grained objective function evaluations is implemented on a heterogeneous computational resource consisting of 16 IBM Blue Gene/P racks, a single x86 cluster
A parallel stereo reconstruction algorithm with applications in entomology (APSRA)
Bhasin, Rajesh; Jang, Won Jun; Hart, John C.
2012-03-01
We propose a fast parallel algorithm for the reconstruction of 3-Dimensional point clouds of insects from binocular stereo image pairs using a hierarchical approach for disparity estimation. Entomologists study various features of insects to classify them, build their distribution maps, and discover genetic links between specimens among various other essential tasks. This information is important to the pesticide and the pharmaceutical industries among others. When considering the large collections of insects entomologists analyze, it becomes difficult to physically handle the entire collection and share the data with researchers across the world. With the method presented in our work, Entomologists can create an image database for their collections and use the 3D models for studying the shape and structure of the insects thus making it easier to maintain and share. Initial feedback shows that the reconstructed 3D models preserve the shape and size of the specimen. We further optimize our results to incorporate multiview stereo which produces better overall structure of the insects. Our main contribution is applying stereoscopic vision techniques to entomology to solve the problems faced by entomologists.
Sequential and Parallel Algorithms for Finding a Maximum Convex Polygon
DEFF Research Database (Denmark)
Fischer, Paul
1997-01-01
This paper investigates the problem where one is given a finite set of n points in the plane each of which is labeled either ?positive? or ?negative?. We consider bounded convex polygons, the vertices of which are positive points and which do not contain any negative point. It is shown how...... such a polygon which is maximal with respect to area can be found in time O(n³ log n). With the same running time one can also find such a polygon which contains a maximum number of positive points. If, in addition, the number of vertices of the polygon is restricted to be at most M, then the running time...... becomes O(M n³ log n). It is also shown how to find a maximum convex polygon which contains a given point in time O(n³ log n). Two parallel algorithms for the basic problem are also presented. The first one runs in time O(n log n) using O(n²) processors, the second one has polylogarithmic time but needs O...
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations
Energy Technology Data Exchange (ETDEWEB)
Detrixhe, Miles, E-mail: mdetrixhe@engineering.ucsb.edu [Department of Mechanical Engineering (United States); University of California Santa Barbara, Santa Barbara, CA, 93106 (United States); Gibou, Frédéric, E-mail: fgibou@engineering.ucsb.edu [Department of Mechanical Engineering (United States); University of California Santa Barbara, Santa Barbara, CA, 93106 (United States); Department of Computer Science (United States); Department of Mathematics (United States)
2016-10-01
The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations
International Nuclear Information System (INIS)
Detrixhe, Miles; Gibou, Frédéric
2016-01-01
The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
Fast parallel ring recognition algorithm in the RICH detector of the CBM experiment at FAIR
International Nuclear Information System (INIS)
Lebedev, S.
2011-01-01
The Compressed Baryonic Matter (CBM)experiment at the future FAIR facility at Darmstadt will measure dileptons emitted from the hot and dense phase in heavy ion collisions. In case of an electron measurement, a high purity of identified electrons is required in order to suppress the background. Electron identification in CBM will be performed by a Ring Imaging Cherenkov (RICH) detector and Transition Radiation Detector (TRD). Very fast data reconstruction is extremely important for CBM because of the huge amount of data which has to be handled. In this contribution, a parallelized ring recognition algorithm is presented. Modern CPUs have two features, which enable parallel programming. First, the SSE technology allows using the SIMD execution model. Second, multicore CPUs enable the use of multithreading. Both features have been implemented in the ring reconstruction of the RICH detector. A considerable speedup factor from 357 to 2.5 ms/event has been achieved including preceding code optimization for Intel Xeon X5550 processors at 2.67 GHz
International Nuclear Information System (INIS)
Lin Lin; Chao Yang; Jiangfeng Lu; Lexing Ying; Weinan, E.
2009-01-01
We present an efficient parallel algorithm and its implementation for computing the diagonal of H -1 where H is a 2D Kohn-Sham Hamiltonian discretized on a rectangular domain using a standard second order finite difference scheme. This type of calculation can be used to obtain an accurate approximation to the diagonal of a Fermi-Dirac function of H through a recently developed pole-expansion technique LinLuYingE2009. The diagonal elements are needed in electronic structure calculations for quantum mechanical systems HohenbergKohn1964, KohnSham 1965,DreizlerGross1990. We show how elimination tree is used to organize the parallel computation and how synchronization overhead is reduced by passing data level by level along this tree using the technique of local buffers and relative indices. We analyze the performance of our implementation by examining its load balance and communication overhead. We show that our implementation exhibits an excellent weak scaling on a large-scale high performance distributed parallel machine. When compared with standard approach for evaluating the diagonal a Fermi-Dirac function of a Kohn-Sham Hamiltonian associated a 2D electron quantum dot, the new pole-expansion technique that uses our algorithm to compute the diagonal of (H-z i I) -1 for a small number of poles z i is much faster, especially when the quantum dot contains many electrons.
Energy Technology Data Exchange (ETDEWEB)
Lin, Lin; Yang, Chao; Lu, Jiangfeng; Ying, Lexing; E, Weinan
2009-09-25
We present an efficient parallel algorithm and its implementation for computing the diagonal of $H^-1$ where $H$ is a 2D Kohn-Sham Hamiltonian discretized on a rectangular domain using a standard second order finite difference scheme. This type of calculation can be used to obtain an accurate approximation to the diagonal of a Fermi-Dirac function of $H$ through a recently developed pole-expansion technique \\cite{LinLuYingE2009}. The diagonal elements are needed in electronic structure calculations for quantum mechanical systems \\citeHohenbergKohn1964, KohnSham 1965,DreizlerGross1990. We show how elimination tree is used to organize the parallel computation and how synchronization overhead is reduced by passing data level by level along this tree using the technique of local buffers and relative indices. We analyze the performance of our implementation by examining its load balance and communication overhead. We show that our implementation exhibits an excellent weak scaling on a large-scale high performance distributed parallel machine. When compared with standard approach for evaluating the diagonal a Fermi-Dirac function of a Kohn-Sham Hamiltonian associated a 2D electron quantum dot, the new pole-expansion technique that uses our algorithm to compute the diagonal of $(H-z_i I)^-1$ for a small number of poles $z_i$ is much faster, especially when the quantum dot contains many electrons.
Towards a streaming model for nested data parallelism
DEFF Research Database (Denmark)
Madsen, Frederik Meisner; Filinski, Andrzej
2013-01-01
The language-integrated cost semantics for nested data parallelism pioneered by NESL provides an intuitive, high-level model for predicting performance and scalability of parallel algorithms with reasonable accuracy. However, this predictability, obtained through a uniform, parallelism-flattening......The language-integrated cost semantics for nested data parallelism pioneered by NESL provides an intuitive, high-level model for predicting performance and scalability of parallel algorithms with reasonable accuracy. However, this predictability, obtained through a uniform, parallelism......-processable in a streaming fashion. This semantics is directly compatible with previously proposed piecewise execution models for nested data parallelism, but allows the expected space usage to be reasoned about directly at the source-language level. The language definition and implementation are still very much work...
International Nuclear Information System (INIS)
Yamamoto, Akio; Hashimoto, Hiroshi
2002-01-01
The distributed genetic algorithm (DGA) is applied for loading pattern optimization problems of the pressurized water reactors. A basic concept of DGA follows that of the conventional genetic algorithm (GA). However, DGA equally distributes candidates of solutions (i.e. loading patterns) to several independent ''islands'' and evolves them in each island. Communications between islands, i.e. migrations of some candidates between islands are performed with a certain period. Since candidates of solutions independently evolve in each island while accepting different genes of migrants, premature convergence in the conventional GA can be prevented. Because many candidate loading patterns should be evaluated in GA or DGA, the parallelization is efficient to reduce turn around time. Parallel efficiency of DGA was measured using our optimization code and good efficiency was attained even in a heterogeneous cluster environment due to dynamic distribution of the calculation load. The optimization code is based on the client/server architecture with the TCP/IP native socket and a client (optimization) module and calculation server modules communicate the objects of loading patterns each other. Throughout the sensitivity study on optimization parameters of DGA, a suitable set of the parameters for a test problem was identified. Finally, optimization capability of DGA and the conventional GA was compared in the test problem and DGA provided better optimization results than the conventional GA. (author)
Directory of Open Access Journals (Sweden)
Marco Scutari
2017-03-01
Full Text Available It is well known in the literature that the problem of learning the structure of Bayesian networks is very hard to tackle: Its computational complexity is super-exponential in the number of nodes in the worst case and polynomial in most real-world scenarios. Efficient implementations of score-based structure learning benefit from past and current research in optimization theory, which can be adapted to the task by using the network score as the objective function to maximize. This is not true for approaches based on conditional independence tests, called constraint-based learning algorithms. The only optimization in widespread use, backtracking, leverages the symmetries implied by the definitions of neighborhood and Markov blanket. In this paper we illustrate how backtracking is implemented in recent versions of the bnlearn R package, and how it degrades the stability of Bayesian network structure learning for little gain in terms of speed. As an alternative, we describe a software architecture and framework that can be used to parallelize constraint-based structure learning algorithms (also implemented in bnlearn and we demonstrate its performance using four reference networks and two real-world data sets from genetics and systems biology. We show that on modern multi-core or multiprocessor hardware parallel implementations are preferable over backtracking, which was developed when single-processor machines were the norm.
A highly parallel algorithm for track finding
International Nuclear Information System (INIS)
Dell'Orso, M.
1990-01-01
We describe a very fast algorithm for track finding, which is applicable to a whole class of detectors like drift chambers, silicon microstrip detectors, etc. The algorithm uses a pattern bank stored in a large memory and organized into a tree structure. (orig.)
Parallel Algorithms for Model Checking
van de Pol, Jaco; Mousavi, Mohammad Reza; Sgall, Jiri
2017-01-01
Model checking is an automated verification procedure, which checks that a model of a system satisfies certain properties. These properties are typically expressed in some temporal logic, like LTL and CTL. Algorithms for LTL model checking (linear time logic) are based on automata theory and graph
Faraj, Daniel A
2013-07-16
Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Parallel ray tracing for one-dimensional discrete ordinate computations
International Nuclear Information System (INIS)
Jarvis, R.D.; Nelson, P.
1996-01-01
The ray-tracing sweep in discrete-ordinates, spatially discrete numerical approximation methods applied to the linear, steady-state, plane-parallel, mono-energetic, azimuthally symmetric, neutral-particle transport equation can be reduced to a parallel prefix computation. In so doing, the often severe penalty in convergence rate of the source iteration, suffered by most current parallel algorithms using spatial domain decomposition, can be avoided while attaining parallelism in the spatial domain to whatever extent desired. In addition, the reduction implies parallel algorithm complexity limits for the ray-tracing sweep. The reduction applies to all closed, linear, one-cell functional (CLOF) spatial approximation methods, which encompasses most in current popular use. Scalability test results of an implementation of the algorithm on a 64-node nCube-2S hypercube-connected, message-passing, multi-computer are described. (author)
PLAST: parallel local alignment search tool for database comparison
Directory of Open Access Journals (Sweden)
Lavenier Dominique
2009-10-01
Full Text Available Abstract Background Sequence similarity searching is an important and challenging task in molecular biology and next-generation sequencing should further strengthen the need for faster algorithms to process such vast amounts of data. At the same time, the internal architecture of current microprocessors is tending towards more parallelism, leading to the use of chips with two, four and more cores integrated on the same die. The main purpose of this work was to design an effective algorithm to fit with the parallel capabilities of modern microprocessors. Results A parallel algorithm for comparing large genomic banks and targeting middle-range computers has been developed and implemented in PLAST software. The algorithm exploits two key parallel features of existing and future microprocessors: the SIMD programming model (SSE instruction set and the multithreading concept (multicore. Compared to multithreaded BLAST software, tests performed on an 8-processor server have shown speedup ranging from 3 to 6 with a similar level of accuracy. Conclusion A parallel algorithmic approach driven by the knowledge of the internal microprocessor architecture allows significant speedup to be obtained while preserving standard sensitivity for similarity search problems.
Energy Technology Data Exchange (ETDEWEB)
Pinchedez, K
1999-06-01
Parallel computing meets the ever-increasing requirements for neutronic computer code speed and accuracy. In this work, two different approaches have been considered. We first parallelized the sequential algorithm used by the neutronics code CRONOS developed at the French Atomic Energy Commission. The algorithm computes the dominant eigenvalue associated with PN simplified transport equations by a mixed finite element method. Several parallel algorithms have been developed on distributed memory machines. The performances of the parallel algorithms have been studied experimentally by implementation on a T3D Cray and theoretically by complexity models. A comparison of various parallel algorithms has confirmed the chosen implementations. We next applied a domain sub-division technique to the two-group diffusion Eigen problem. In the modal synthesis-based method, the global spectrum is determined from the partial spectra associated with sub-domains. Then the Eigen problem is expanded on a family composed, on the one hand, from eigenfunctions associated with the sub-domains and, on the other hand, from functions corresponding to the contribution from the interface between the sub-domains. For a 2-D homogeneous core, this modal method has been validated and its accuracy has been measured. (author)
Parallel Implementation of the Terrain Masking Algorithm
1994-03-01
contains behavior rules which can define a computation or an algorithm. It can communicate with other process nodes, it can contain local data, and it can...terrain maskirg calculation is being performed. It is this algorithm that comsumes about seventy percent of the total terrain masking calculation time
Advances in randomized parallel computing
Rajasekaran, Sanguthevar
1999-01-01
The technique of randomization has been employed to solve numerous prob lems of computing both sequentially and in parallel. Examples of randomized algorithms that are asymptotically better than their deterministic counterparts in solving various fundamental problems abound. Randomized algorithms have the advantages of simplicity and better performance both in theory and often in practice. This book is a collection of articles written by renowned experts in the area of randomized parallel computing. A brief introduction to randomized algorithms In the aflalysis of algorithms, at least three different measures of performance can be used: the best case, the worst case, and the average case. Often, the average case run time of an algorithm is much smaller than the worst case. 2 For instance, the worst case run time of Hoare's quicksort is O(n ), whereas its average case run time is only O( n log n). The average case analysis is conducted with an assumption on the input space. The assumption made to arrive at t...
Progress in parallel implementation of the multilevel plane wave time domain algorithm
Liu, Yang
2013-07-01
The computational complexity and memory requirements of classical schemes for evaluating transient electromagnetic fields produced by Ns dipoles active for Nt time steps scale as O(NtN s 2) and O(Ns 2), respectively. The multilevel plane wave time domain (PWTD) algorithm [A.A. Ergin et al., Antennas and Propagation Magazine, IEEE, vol. 41, pp. 39-52, 1999], viz. the extension of the frequency domain fast multipole method (FMM) to the time domain, reduces the above costs to O(NtNslog2Ns) and O(Ns α) with α = 1.5 for surface current distributions and α = 4/3 for volumetric ones. Its favorable computational and memory costs notwithstanding, serial implementations of the PWTD scheme unfortunately remain somewhat limited in scope and ill-suited to tackle complex real-world scattering problems, and parallel implementations are called for. © 2013 IEEE.
Low-communication parallel quantum multi-target preimage search
Banegas, G.S.; Bernstein, D.J.; Adams, Carlisle; Camenisch, Jan
2017-01-01
The most important pre-quantum threat to AES-128 is the 1994 van Oorschot–Wiener “parallel rho method”, a low-communication parallel pre-quantum multi-target preimage-search algorithm. This algorithm uses a mesh of p small processors, each running for approximately 2 128 /pt 2128/pt fast steps, to
He, Xiaojun; Ma, Haotong; Luo, Chuanxin
2016-10-01
The optical multi-aperture imaging system is an effective way to magnify the aperture and increase the resolution of telescope optical system, the difficulty of which lies in detecting and correcting of co-phase error. This paper presents a method based on stochastic parallel gradient decent algorithm (SPGD) to correct the co-phase error. Compared with the current method, SPGD method can avoid detecting the co-phase error. This paper analyzed the influence of piston error and tilt error on image quality based on double-aperture imaging system, introduced the basic principle of SPGD algorithm, and discuss the influence of SPGD algorithm's key parameters (the gain coefficient and the disturbance amplitude) on error control performance. The results show that SPGD can efficiently correct the co-phase error. The convergence speed of the SPGD algorithm is improved with the increase of gain coefficient and disturbance amplitude, but the stability of the algorithm reduced. The adaptive gain coefficient can solve this problem appropriately. This paper's results can provide the theoretical reference for the co-phase error correction of the multi-aperture imaging system.
Program Transformation to Identify List-Based Parallel Skeletons
Directory of Open Access Journals (Sweden)
Venkatesh Kannan
2016-07-01
Full Text Available Algorithmic skeletons are used as building-blocks to ease the task of parallel programming by abstracting the details of parallel implementation from the developer. Most existing libraries provide implementations of skeletons that are defined over flat data types such as lists or arrays. However, skeleton-based parallel programming is still very challenging as it requires intricate analysis of the underlying algorithm and often uses inefficient intermediate data structures. Further, the algorithmic structure of a given program may not match those of list-based skeletons. In this paper, we present a method to automatically transform any given program to one that is defined over a list and is more likely to contain instances of list-based skeletons. This facilitates the parallel execution of a transformed program using existing implementations of list-based parallel skeletons. Further, by using an existing transformation called distillation in conjunction with our method, we produce transformed programs that contain fewer inefficient intermediate data structures.
Quantum learning algorithms for quantum measurements
Energy Technology Data Exchange (ETDEWEB)
Bisio, Alessandro, E-mail: alessandro.bisio@unipv.it [QUIT Group, Dipartimento di Fisica ' A. Volta' and INFN, via Bassi 6, 27100 Pavia (Italy); D' Ariano, Giacomo Mauro, E-mail: dariano@unipv.it [QUIT Group, Dipartimento di Fisica ' A. Volta' and INFN, via Bassi 6, 27100 Pavia (Italy); Perinotti, Paolo, E-mail: paolo.perinotti@unipv.it [QUIT Group, Dipartimento di Fisica ' A. Volta' and INFN, via Bassi 6, 27100 Pavia (Italy); Sedlak, Michal, E-mail: michal.sedlak@unipv.it [QUIT Group, Dipartimento di Fisica ' A. Volta' and INFN, via Bassi 6, 27100 Pavia (Italy); Institute of Physics, Slovak Academy of Sciences, Dubravska cesta 9, 845 11 Bratislava (Slovakia)
2011-09-12
We study quantum learning algorithms for quantum measurements. The optimal learning algorithm is derived for arbitrary von Neumann measurements in the case of training with one or two examples. The analysis of the case of three examples reveals that, differently from the learning of unitary gates, the optimal algorithm for learning of quantum measurements cannot be parallelized, and requires quantum memories for the storage of information. -- Highlights: → Optimal learning algorithm for von Neumann measurements. → From 2 copies to 1 copy: the optimal strategy is parallel. → From 3 copies to 1 copy: the optimal strategy must be non-parallel.
Quantum learning algorithms for quantum measurements
International Nuclear Information System (INIS)
Bisio, Alessandro; D'Ariano, Giacomo Mauro; Perinotti, Paolo; Sedlak, Michal
2011-01-01
We study quantum learning algorithms for quantum measurements. The optimal learning algorithm is derived for arbitrary von Neumann measurements in the case of training with one or two examples. The analysis of the case of three examples reveals that, differently from the learning of unitary gates, the optimal algorithm for learning of quantum measurements cannot be parallelized, and requires quantum memories for the storage of information. -- Highlights: → Optimal learning algorithm for von Neumann measurements. → From 2 copies to 1 copy: the optimal strategy is parallel. → From 3 copies to 1 copy: the optimal strategy must be non-parallel.
Wang, Lui; Bayer, Steven E.
1991-01-01
Genetic algorithms are mathematical, highly parallel, adaptive search procedures (i.e., problem solving methods) based loosely on the processes of natural genetics and Darwinian survival of the fittest. Basic genetic algorithms concepts are introduced, genetic algorithm applications are introduced, and results are presented from a project to develop a software tool that will enable the widespread use of genetic algorithm technology.
Fast image processing on parallel hardware
International Nuclear Information System (INIS)
Bittner, U.
1988-01-01
Current digital imaging modalities in the medical field incorporate parallel hardware which is heavily used in the stage of image formation like the CT/MR image reconstruction or in the DSA real time subtraction. In order to image post-processing as efficient as image acquisition, new software approaches have to be found which take full advantage of the parallel hardware architecture. This paper describes the implementation of two-dimensional median filter which can serve as an example for the development of such an algorithm. The algorithm is analyzed by viewing it as a complete parallel sort of the k pixel values in the chosen window which leads to a generalization to rank order operators and other closely related filters reported in literature. A section about the theoretical base of the algorithm gives hints for how to characterize operations suitable for implementations on pipeline processors and the way to find the appropriate algorithms. Finally some results that computation time and usefulness of medial filtering in radiographic imaging are given
Where genetic algorithms excel.
Baum, E B; Boneh, D; Garrett, C
2001-01-01
We analyze the performance of a genetic algorithm (GA) we call Culling, and a variety of other algorithms, on a problem we refer to as the Additive Search Problem (ASP). We show that the problem of learning the Ising perceptron is reducible to a noisy version of ASP. Noisy ASP is the first problem we are aware of where a genetic-type algorithm bests all known competitors. We generalize ASP to k-ASP to study whether GAs will achieve "implicit parallelism" in a problem with many more schemata. GAs fail to achieve this implicit parallelism, but we describe an algorithm we call Explicitly Parallel Search that succeeds. We also compute the optimal culling point for selective breeding, which turns out to be independent of the fitness function or the population distribution. We also analyze a mean field theoretic algorithm performing similarly to Culling on many problems. These results provide insight into when and how GAs can beat competing methods.
Algorithms for parallel and vector computations
Ortega, James M.
1995-01-01
This is a final report on work performed under NASA grant NAG-1-1112-FOP during the period March, 1990 through February 1995. Four major topics are covered: (1) solution of nonlinear poisson-type equations; (2) parallel reduced system conjugate gradient method; (3) orderings for conjugate gradient preconditioners, and (4) SOR as a preconditioner.
Parallel Monte Carlo simulation of aerosol dynamics
Zhou, K.
2014-01-01
A highly efficient Monte Carlo (MC) algorithm is developed for the numerical simulation of aerosol dynamics, that is, nucleation, surface growth, and coagulation. Nucleation and surface growth are handled with deterministic means, while coagulation is simulated with a stochastic method (Marcus-Lushnikov stochastic process). Operator splitting techniques are used to synthesize the deterministic and stochastic parts in the algorithm. The algorithm is parallelized using the Message Passing Interface (MPI). The parallel computing efficiency is investigated through numerical examples. Near 60% parallel efficiency is achieved for the maximum testing case with 3.7 million MC particles running on 93 parallel computing nodes. The algorithm is verified through simulating various testing cases and comparing the simulation results with available analytical and/or other numerical solutions. Generally, it is found that only small number (hundreds or thousands) of MC particles is necessary to accurately predict the aerosol particle number density, volume fraction, and so forth, that is, low order moments of the Particle Size Distribution (PSD) function. Accurately predicting the high order moments of the PSD needs to dramatically increase the number of MC particles. 2014 Kun Zhou et al.
Ultrascalable petaflop parallel supercomputer
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Chiu, George [Cross River, NY; Cipolla, Thomas M [Katonah, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Hall, Shawn [Pleasantville, NY; Haring, Rudolf A [Cortlandt Manor, NY; Heidelberger, Philip [Cortlandt Manor, NY; Kopcsay, Gerard V [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Salapura, Valentina [Chappaqua, NY; Sugavanam, Krishnan [Mahopac, NY; Takken, Todd [Brewster, NY
2010-07-20
A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.
A Coupling Tool for Parallel Molecular Dynamics-Continuum Simulations
Neumann, Philipp
2012-06-01
We present a tool for coupling Molecular Dynamics and continuum solvers. It is written in C++ and is meant to support the developers of hybrid molecular - continuum simulations in terms of both realisation of the respective coupling algorithm as well as parallel execution of the hybrid simulation. We describe the implementational concept of the tool and its parallel extensions. We particularly focus on the parallel execution of particle insertions into dense molecular systems and propose a respective parallel algorithm. Our implementations are validated for serial and parallel setups in two and three dimensions. © 2012 IEEE.
Energy Technology Data Exchange (ETDEWEB)
Volkov, M V; Garanin, S G; Dolgopolov, Yu V; Kopalkin, A V; Kulikov, S M; Sinyavin, D N; Starikov, F A; Sukharev, S A; Tyutin, S V; Khokhlov, S V; Chaparin, D A [Russian Federal Nuclear Center ' All-Russian Research Institute of Experimental Physics' , Sarov, Nizhnii Novgorod region (Russian Federation)
2014-11-30
A seven-channel fibre laser system operated by the master oscillator – multichannel power amplifier scheme is the phase locked using a stochastic parallel gradient algorithm. The phase modulators on lithium niobate crystals are controlled by a multichannel electronic unit with the microcontroller processing signals in real time. The dynamic phase locking of the laser system with the bandwidth of 14 kHz is demonstrated, the time of phasing is 3 – 4 ms. (fibre and integrated-optical structures)
Parallel fuzzy connected image segmentation on GPU
Zhuge, Ying; Cao, Yong; Udupa, Jayaram K.; Miller, Robert W.
2011-01-01
Purpose: Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm impleme...
Parallel quantum computing in a single ensemble quantum computer
International Nuclear Information System (INIS)
Long Guilu; Xiao, L.
2004-01-01
We propose a parallel quantum computing mode for ensemble quantum computer. In this mode, some qubits are in pure states while other qubits are in mixed states. It enables a single ensemble quantum computer to perform 'single-instruction-multidata' type of parallel computation. Parallel quantum computing can provide additional speedup in Grover's algorithm and Shor's algorithm. In addition, it also makes a fuller use of qubit resources in an ensemble quantum computer. As a result, some qubits discarded in the preparation of an effective pure state in the Schulman-Varizani and the Cleve-DiVincenzo algorithms can be reutilized
International Nuclear Information System (INIS)
Baron, E.; Hauschildt, Peter H.
1998-01-01
We describe an important addition to the parallel implementation of our generalized nonlocal thermodynamic equilibrium (NLTE) stellar atmosphere and radiative transfer computer program PHOENIX. In a previous paper in this series we described data and task parallel algorithms we have developed for radiative transfer, spectral line opacity, and NLTE opacity and rate calculations. These algorithms divided the work spatially or by spectral lines, that is, distributing the radial zones, individual spectral lines, or characteristic rays among different processors and employ, in addition, task parallelism for logically independent functions (such as atomic and molecular line opacities). For finite, monotonic velocity fields, the radiative transfer equation is an initial value problem in wavelength, and hence each wavelength point depends upon the previous one. However, for sophisticated NLTE models of both static and moving atmospheres needed to accurately describe, e.g., novae and supernovae, the number of wavelength points is very large (200,000 - 300,000) and hence parallelization over wavelength can lead both to considerable speedup in calculation time and the ability to make use of the aggregate memory available on massively parallel supercomputers. Here, we describe an implementation of a pipelined design for the wavelength parallelization of PHOENIX, where the necessary data from the processor working on a previous wavelength point is sent to the processor working on the succeeding wavelength point as soon as it is known. Our implementation uses a MIMD design based on a relatively small number of standard message passing interface (MPI) library calls and is fully portable between serial and parallel computers. copyright 1998 The American Astronomical Society
Regional-scale calculation of the LS factor using parallel processing
Liu, Kai; Tang, Guoan; Jiang, Ling; Zhu, A.-Xing; Yang, Jianyi; Song, Xiaodong
2015-05-01
With the increase of data resolution and the increasing application of USLE over large areas, the existing serial implementation of algorithms for computing the LS factor is becoming a bottleneck. In this paper, a parallel processing model based on message passing interface (MPI) is presented for the calculation of the LS factor, so that massive datasets at a regional scale can be processed efficiently. The parallel model contains algorithms for calculating flow direction, flow accumulation, drainage network, slope, slope length and the LS factor. According to the existence of data dependence, the algorithms are divided into local algorithms and global algorithms. Parallel strategy are designed according to the algorithm characters including the decomposition method for maintaining the integrity of the results, optimized workflow for reducing the time taken for exporting the unnecessary intermediate data and a buffer-communication-computation strategy for improving the communication efficiency. Experiments on a multi-node system show that the proposed parallel model allows efficient calculation of the LS factor at a regional scale with a massive dataset.
Finite element electromagnetic field computation on the Sequent Symmetry 81 parallel computer
International Nuclear Information System (INIS)
Ratnajeevan, S.; Hoole, H.
1990-01-01
Finite element field analysis algorithms lend themselves to parallelization and this fact is exploited in this paper to implement a finite element analysis program for electromagnetic field computation on the Sequent Symmetry 81 parallel computer with three processors. In terms of waiting time, the maximum gains are to be made in matrix solution and therefore this paper concentrates on the gains in parallelizing the solution part of finite element analysis. An outline of how parallelization could be exploited in most finite element operations is given in this paper although the actual implemention of parallelism on the Sequent Symmetry 81 parallel computer was in sparsity computation, matrix assembly and the matrix solution areas. In all cases, the algorithms were modified suit the parallel programming application rather than allowing the compiler to parallelize on existing algorithms
Wang, Zhaocai; Pu, Jun; Cao, Liling; Tan, Jian
2015-10-23
The unbalanced assignment problem (UAP) is to optimally resolve the problem of assigning n jobs to m individuals (m applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn) time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation.
Vector Green's function algorithm for radiative transfer in plane-parallel atmosphere
Energy Technology Data Exchange (ETDEWEB)
Qin Yi [School of Physics, University of New South Wales (Australia)]. E-mail: yi.qin@csiro.au; Box, Michael A. [School of Physics, University of New South Wales (Australia)
2006-01-15
Green's function is a widely used approach for boundary value problems. In problems related to radiative transfer, Green's function has been found to be useful in land, ocean and atmosphere remote sensing. It is also a key element in higher order perturbation theory. This paper presents an explicit expression of the Green's function, in terms of the source and radiation field variables, for a plane-parallel atmosphere with either vacuum boundaries or a reflecting (BRDF) surface. Full polarization state is considered but the algorithm has been developed in such way that it can be easily reduced to solve scalar radiative transfer problems, which makes it possible to implement a single set of code for computing both the scalar and the vector Green's function.