Sublinear Time Motif Discovery from Multiple Sequences
Directory of Open Access Journals (Sweden)
Yunhui Fu
2013-10-01
Full Text Available In this paper, a natural probabilistic model for motif discovery has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet, Σ. A motif G = g1g2 ... gm is a string of m characters. In each background sequence is implanted a probabilistically-generated approximate copy of G. For a probabilistically-generated approximate copy b1b2 ... bm of G, every character, bi, is probabilistically generated, such that the probability for bi ≠ gi is at most α. We develop two new randomized algorithms and one new deterministic algorithm. They make advancements in the following aspects: (1 The algorithms are much faster than those before. Our algorithms can even run in sublinear time. (2 They can handle any motif pattern. (3 The restriction for the alphabet size is a lower bound of four. This gives them potential applications in practical problems, since gene sequences have an alphabet size of four. (4 All algorithms have rigorous proofs about their performances. The methods developed in this paper have been used in the software implementation. We observed some encouraging results that show improved performance for motif detection compared with other software.
Casanova, Henri; Robert, Yves
2008-01-01
""…The authors of the present book, who have extensive credentials in both research and instruction in the area of parallelism, present a sound, principled treatment of parallel algorithms. … This book is very well written and extremely well designed from an instructional point of view. … The authors have created an instructive and fascinating text. The book will serve researchers as well as instructors who need a solid, readable text for a course on parallelism in computing. Indeed, for anyone who wants an understandable text from which to acquire a current, rigorous, and broad vi
Akl, Selim G
1985-01-01
Parallel Sorting Algorithms explains how to use parallel algorithms to sort a sequence of items on a variety of parallel computers. The book reviews the sorting problem, the parallel models of computation, parallel algorithms, and the lower bounds on the parallel sorting problems. The text also presents twenty different algorithms, such as linear arrays, mesh-connected computers, cube-connected computers. Another example where algorithm can be applied is on the shared-memory SIMD (single instruction stream multiple data stream) computers in which the whole sequence to be sorted can fit in the
Algorithmically specialized parallel computers
Snyder, Lawrence; Gannon, Dennis B
1985-01-01
Algorithmically Specialized Parallel Computers focuses on the concept and characteristics of an algorithmically specialized computer.This book discusses the algorithmically specialized computers, algorithmic specialization using VLSI, and innovative architectures. The architectures and algorithms for digital signal, speech, and image processing and specialized architectures for numerical computations are also elaborated. Other topics include the model for analyzing generalized inter-processor, pipelined architecture for search tree maintenance, and specialized computer organization for raster
Parallel Algorithms and Patterns
Energy Technology Data Exchange (ETDEWEB)
Robey, Robert W. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
2016-06-16
This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.
Parallel Genetic Algorithm System
Nagaraju Sangepu; Vikram, K.
2010-01-01
Genetic Algorithm (GA) is a popular technique to find the optimum of transformation, because of its simple implementation procedure. In image processing GAs are used as a parameter-search-for procedure, this processing requires very high performance of the computer. Recently, parallel processing used to reduce the time by distributing the appropriate amount of work to each computer in the clustering system. The processing time reduces with the number of dedicated computers. Parallel implement...
A Parallel Butterfly Algorithm
Poulson, Jack
2014-02-04
The butterfly algorithm is a fast algorithm which approximately evaluates a discrete analogue of the integral transform (Equation Presented.) at large numbers of target points when the kernel, K(x, y), is approximately low-rank when restricted to subdomains satisfying a certain simple geometric condition. In d dimensions with O(Nd) quasi-uniformly distributed source and target points, when each appropriate submatrix of K is approximately rank-r, the running time of the algorithm is at most O(r2Nd logN). A parallelization of the butterfly algorithm is introduced which, assuming a message latency of α and per-process inverse bandwidth of β, executes in at most (Equation Presented.) time using p processes. This parallel algorithm was then instantiated in the form of the open-source DistButterfly library for the special case where K(x, y) = exp(iΦ(x, y)), where Φ(x, y) is a black-box, sufficiently smooth, real-valued phase function. Experiments on Blue Gene/Q demonstrate impressive strong-scaling results for important classes of phase functions. Using quasi-uniform sources, hyperbolic Radon transforms, and an analogue of a three-dimensional generalized Radon transform were, respectively, observed to strong-scale from 1-node/16-cores up to 1024-nodes/16,384-cores with greater than 90% and 82% efficiency, respectively. © 2014 Society for Industrial and Applied Mathematics.
Parallel Architectures and Bioinspired Algorithms
Pérez, José; Lanchares, Juan
2012-01-01
This monograph presents examples of best practices when combining bioinspired algorithms with parallel architectures. The book includes recent work by leading researchers in the field and offers a map with the main paths already explored and new ways towards the future. Parallel Architectures and Bioinspired Algorithms will be of value to both specialists in Bioinspired Algorithms, Parallel and Distributed Computing, as well as computer science students trying to understand the present and the future of Parallel Architectures and Bioinspired Algorithms.
Numerical Algorithms for Parallel Computers
1989-08-31
NUMERICAL ALGORITHMS FOR PARALLEL COMPUTERS 12. PERSONAL AUTHOR(S) Loce M. Adams 13a. TYPE OF REPORT 13b. TIME COVERED 14. DATE OF REPORT (Yea, Month. Day...editions are obsolete. SEC n&"S2IVAQelftS PAGE 90 01 11 131 . . AP06w.TR. 8 9 -l1N5 Numerical Algorithms for Parallel Computers Loyce M. Adams Department of...Conference on Applied Linear Algebra, Loyce Adams presented minisym- posium talk Preconditioners on Parallel Computers , Madison, WI., May 1988. Third
PARALLEL ALGORITHM FOR BAYESIAN NETWORK STRUCTURE LEARNING
Directory of Open Access Journals (Sweden)
S. A. Arustamov
2013-03-01
Full Text Available The article deals with implementation of a scalable parallel algorithm for structure learning of Bayesian network. Comparative analysis of sequential and parallel algorithms is done.
Parallel External Memory Graph Algorithms
DEFF Research Database (Denmark)
Arge, Lars Allan; Goodrich, Michael T.; Sitchinava, Nodari
2010-01-01
In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one o f the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest...... common ancestors, tree contraction and expression tree evaluation. We also study the problems of computing the connected and biconnected components of a graph, minimum spanning tree of a connected graph and ear decomposition of a biconnected graph. All our solutions on a P-processor PEM model provide...
Parallel algorithms and cluster computing
Hoffmann, Karl Heinz
2007-01-01
This book presents major advances in high performance computing as well as major advances due to high performance computing. It contains a collection of papers in which results achieved in the collaboration of scientists from computer science, mathematics, physics, and mechanical engineering are presented. From the science problems to the mathematical algorithms and on to the effective implementation of these algorithms on massively parallel and cluster computers we present state-of-the-art methods and technology as well as exemplary results in these fields. This book shows that problems which seem superficially distinct become intimately connected on a computational level.
Experiments with parallel algorithms for combinatorial problems
G.A.P. Kindervater (Gerard); H.W.J.M. Trienekens
1985-01-01
textabstractIn the last decade many models for parallel computation have been proposed and many parallel algorithms have been developed. However, few of these models have been realized and most of these algorithms are supposed to run on idealized, unrealistic parallel machines. The parallel machines
Parallel computers and parallel algorithms for CFD: An introduction
Roose, Dirk; Vandriessche, Rafael
1995-10-01
This text presents a tutorial on those aspects of parallel computing that are important for the development of efficient parallel algorithms and software for computational fluid dynamics. We first review the main architectural features of parallel computers and we briefly describe some parallel systems on the market today. We introduce some important concepts concerning the development and the performance evaluation of parallel algorithms. We discuss how work load imbalance and communication costs on distributed memory parallel computers can be minimized. We present performance results for some CFD test cases. We focus on applications using structured and block structured grids, but the concepts and techniques are also valid for unstructured grids.
Parallel algorithms for unconstrained optimizations by multisplitting
Energy Technology Data Exchange (ETDEWEB)
He, Qing [Arizona State Univ., Tempe, AZ (United States)
1994-12-31
In this paper a new parallel iterative algorithm for unconstrained optimization using the idea of multisplitting is proposed. This algorithm uses the existing sequential algorithms without any parallelization. Some convergence and numerical results for this algorithm are presented. The experiments are performed on an Intel iPSC/860 Hyper Cube with 64 nodes. It is interesting that the sequential implementation on one node shows that if the problem is split properly, the algorithm converges much faster than one without splitting.
Parallelization of TMVA Machine Learning Algorithms
Hajili, Mammad
2017-01-01
This report reflects my work on Parallelization of TMVA Machine Learning Algorithms integrated to ROOT Data Analysis Framework during summer internship at CERN. The report consists of 4 impor- tant part - data set used in training and validation, algorithms that multiprocessing applied on them, parallelization techniques and re- sults of execution time changes due to number of workers.
Parallel Algorithms for the Exascale Era
Energy Technology Data Exchange (ETDEWEB)
Robey, Robert W. [Los Alamos National Laboratory
2016-10-19
New parallel algorithms are needed to reach the Exascale level of parallelism with millions of cores. We look at some of the research developed by students in projects at LANL. The research blends ideas from the early days of computing while weaving in the fresh approach brought by students new to the field of high performance computing. We look at reproducibility of global sums and why it is important to parallel computing. Next we look at how the concept of hashing has led to the development of more scalable algorithms suitable for next-generation parallel computers. Nearly all of this work has been done by undergraduates and published in leading scientific journals.
On mesh rezoning algorithms for parallel platforms
Energy Technology Data Exchange (ETDEWEB)
Plaskacz, E.J.
1995-07-01
A mesh rezoning algorithm for finite element simulations in a parallel-distributed environment is described. The cornerstones of the algorithm are: the parallel computation of distortion norms on the element and subdomain level, the exchange of the individual subdomain norms to form a subdomain distortion vector, the classification of subdomains and the rezoning behavior prescribed within each subdomain as a response to its own classification and the classification of neighboring subdomains.
Parallel algorithms for numerical linear algebra
van der Vorst, H
1990-01-01
This is the first in a new series of books presenting research results and developments concerning the theory and applications of parallel computers, including vector, pipeline, array, fifth/future generation computers, and neural computers.All aspects of high-speed computing fall within the scope of the series, e.g. algorithm design, applications, software engineering, networking, taxonomy, models and architectural trends, performance, peripheral devices.Papers in Volume One cover the main streams of parallel linear algebra: systolic array algorithms, message-passing systems, algorithms for p
Parallel Computing Strategies for Irregular Algorithms
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
FRPA: A Framework for Recursive Parallel Algorithms
2015-05-01
over naïve scheduling . Our Framework for Recursive Parallel Algorithms (FRPA) allows for the separation of an algorithm’s implemen- tation from its...parallelism, and Cilk Plus handles task scheduling and load balancing. Additionally, FRPA uses OpenTuner [7] to tune the BFS/DFS interleaving and all...system for data- flow task programming on heterogeneous architectures, supports multi-CPU and multi- GPU architectures. This framework consists of a
Improvement of Parallel Algorithm for MATRA Code
Energy Technology Data Exchange (ETDEWEB)
Kim, Seong-Jin; Seo, Kyong-Won; Kwon, Hyouk; Hwang, Dae-Hyun [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of)
2014-10-15
The feasibility study to parallelize the MATRA code was conducted in KAERI early this year. As a result, a parallel algorithm for the MATRA code has been developed to decrease a considerably required computing time to solve a bigsize problem such as a whole core pin-by-pin problem of a general PWR reactor and to improve an overall performance of the multi-physics coupling calculations. It was shown that the performance of the MATRA code was greatly improved by implementing the parallel algorithm using MPI communication. For problems of a 1/8 core and whole core for SMART reactor, a speedup was evaluated as about 10 when the numbers of used processor were 25. However, it was also shown that the performance deteriorated as the axial node number increased. In this paper, the procedure of a communication between processors is optimized to improve the previous parallel algorithm.. To improve the performance deterioration of the parallelized MATRA code, the communication algorithm between processors was newly presented. It was shown that the speedup was improved and stable regardless of the axial node number.
Parallel Clustering Algorithms for Structured AMR
Energy Technology Data Exchange (ETDEWEB)
Gunney, B T; Wissink, A M; Hysom, D A
2005-10-26
We compare several different parallel implementation approaches for the clustering operations performed during adaptive gridding operations in patch-based structured adaptive mesh refinement (SAMR) applications. Specifically, we target the clustering algorithm of Berger and Rigoutsos (BR91), which is commonly used in many SAMR applications. The baseline for comparison is a simplistic parallel extension of the original algorithm that works well for up to O(10{sup 2}) processors. Our goal is a clustering algorithm for machines of up to O(10{sup 5}) processors, such as the 64K-processor IBM BlueGene/Light system. We first present an algorithm that avoids the unneeded communications of the simplistic approach to improve the clustering speed by up to an order of magnitude. We then present a new task-parallel implementation to further reduce communication wait time, adding another order of magnitude of improvement. The new algorithms also exhibit more favorable scaling behavior for our test problems. Performance is evaluated on a number of large scale parallel computer systems, including a 16K-processor BlueGene/Light system.
Efficient Parallel Algorithms for Unsteady Incompressible Flows
Guermond, Jean-Luc
2013-01-01
The objective of this paper is to give an overview of recent developments on splitting schemes for solving the time-dependent incompressible Navier–Stokes equations and to discuss possible extensions to the variable density/viscosity case. A particular attention is given to algorithms that can be implemented efficiently on large parallel clusters.
Parallel Tree Projection Algorithm for Sequence Mining
2001-03-29
HPMA +00] was developed by extending the tree-projectionalgorithm [AAP00]. Even though, sequential association rule discovery algorithms based on tree...Kumar. Scalable parallel data mining for association rules. IEEETransactions on Knowledge and Data Eng. (accepted for publication), 1999.[ HPMA +00] J
Parallel Implementation of the Katsevich's FBP Algorithm
Directory of Open Access Journals (Sweden)
2006-01-01
Full Text Available For spiral cone-beam CT, parallel computing is an effective approach to resolving the problem of heavy computation burden. It is well known that the major computation time is spent in the backprojection step for either filtered-backprojection (FBP or backprojected-filtration (BPF algorithms. By the cone-beam cover method [1], the backprojection procedure is driven by cone-beam projections, and every cone-beam projection can be backprojected independently. Basing on this fact, we develop a parallel implementation of Katsevich's FBP algorithm. We do all the numerical experiments on a Linux cluster. In one typical experiment, the sequential reconstruction time is 781.3 seconds, while the parallel reconstruction time is 25.7 seconds with 32 processors.
Adapting algorithms to massively parallel hardware
Sioulas, Panagiotis
2016-01-01
In the recent years, the trend in computing has shifted from delivering processors with faster clock speeds to increasing the number of cores per processor. This marks a paradigm shift towards parallel programming in which applications are programmed to exploit the power provided by multi-cores. Usually there is gain in terms of the time-to-solution and the memory footprint. Specifically, this trend has sparked an interest towards massively parallel systems that can provide a large number of processors, and possibly computing nodes, as in the GPUs and MPPAs (Massively Parallel Processor Arrays). In this project, the focus was on two distinct computing problems: k-d tree searches and track seeding cellular automata. The goal was to adapt the algorithms to parallel systems and evaluate their performance in different cases.
Optical flow optimization using parallel genetic algorithm
Zavala-Romero, Olmo; Botella, Guillermo; Meyer-Bäse, Anke; Meyer Base, Uwe
2011-06-01
A new approach to optimize the parameters of a gradient-based optical flow model using a parallel genetic algorithm (GA) is proposed. The main characteristics of the optical flow algorithm are its bio-inspiration and robustness against contrast, static patterns and noise, besides working consistently with several optical illusions where other algorithms fail. This model depends on many parameters which conform the number of channels, the orientations required, the length and shape of the kernel functions used in the convolution stage, among many more. The GA is used to find a set of parameters which improve the accuracy of the optical flow on inputs where the ground-truth data is available. This set of parameters helps to understand which of them are better suited for each type of inputs and can be used to estimate the parameters of the optical flow algorithm when used with videos that share similar characteristics. The proposed implementation takes into account the embarrassingly parallel nature of the GA and uses the OpenMP Application Programming Interface (API) to speedup the process of estimating an optimal set of parameters. The information obtained in this work can be used to dynamically reconfigure systems, with potential applications in robotics, medical imaging and tracking.
A parallel algorithm for random searches
Wosniack, M. E.; Raposo, E. P.; Viswanathan, G. M.; da Luz, M. G. E.
2015-11-01
We discuss a parallelization procedure for a two-dimensional random search of a single individual, a typical sequential process. To assure the same features of the sequential random search in the parallel version, we analyze the former spatial patterns of the encountered targets for different search strategies and densities of homogeneously distributed targets. We identify a lognormal tendency for the distribution of distances between consecutively detected targets. Then, by assigning the distinct mean and standard deviation of this distribution for each corresponding configuration in the parallel simulations (constituted by parallel random walkers), we are able to recover important statistical properties, e.g., the target detection efficiency, of the original problem. The proposed parallel approach presents a speedup of nearly one order of magnitude compared with the sequential implementation. This algorithm can be easily adapted to different instances, as searches in three dimensions. Its possible range of applicability covers problems in areas as diverse as automated computer searchers in high-capacity databases and animal foraging.
Predicting mining activity with parallel genetic algorithms
Talaie, S.; Leigh, R.; Louis, S.J.; Raines, G.L.; Beyer, H.G.; O'Reilly, U.M.; Banzhaf, Arnold D.; Blum, W.; Bonabeau, C.; Cantu-Paz, E.W.; ,; ,
2005-01-01
We explore several different techniques in our quest to improve the overall model performance of a genetic algorithm calibrated probabilistic cellular automata. We use the Kappa statistic to measure correlation between ground truth data and data predicted by the model. Within the genetic algorithm, we introduce a new evaluation function sensitive to spatial correctness and we explore the idea of evolving different rule parameters for different subregions of the land. We reduce the time required to run a simulation from 6 hours to 10 minutes by parallelizing the code and employing a 10-node cluster. Our empirical results suggest that using the spatially sensitive evaluation function does indeed improve the performance of the model and our preliminary results also show that evolving different rule parameters for different regions tends to improve overall model performance. Copyright 2005 ACM.
Arkin, Ethem; Tekinerdogan, Bedir
2016-01-01
Mapping parallel algorithms to parallel computing platforms requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform, the mapping of the algorithm to the logical configuration platform and the implementation of the
Introduction to parallel algorithms and architectures arrays, trees, hypercubes
Leighton, F Thomson
1991-01-01
Introduction to Parallel Algorithms and Architectures: Arrays Trees Hypercubes provides an introduction to the expanding field of parallel algorithms and architectures. This book focuses on parallel computation involving the most popular network architectures, namely, arrays, trees, hypercubes, and some closely related networks.Organized into three chapters, this book begins with an overview of the simplest architectures of arrays and trees. This text then presents the structures and relationships between the dominant network architectures, as well as the most efficient parallel algorithms for
Online Algorithms for Parallel Job Scheduling and Strip Packing
Hurink, Johann L.; Paulus, J.J.
We consider the online scheduling problem of parallel jobs on parallel machines, $P|online{−}list,m_j |C_{max}$. For this problem we present a 6.6623-competitive algorithm. This improves the best known 7-competitive algorithm for this problem. The presented algorithm also applies to the problem
Fast parallel algorithm for slicing STL based on pipeline
Ma, Xulong; Lin, Feng; Yao, Bo
2016-05-01
In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
Arkin, Ethem; Tekinerdogan, Bedir; Imre, Kayhan M.
2017-01-01
The need for high-performance computing together with the increasing trend from single processor to parallel computer architectures has leveraged the adoption of parallel computing. To benefit from parallel computing power, usually parallel algorithms are defined that can be mapped and executed
Comparative efficiencies of three parallel algorithms for nonlinear ...
Indian Academy of Sciences (India)
The work reported in this paper is motivated by the need to develop portable parallel processing algorithms and codes which can run on a variety of hardware platforms without any modiﬁcations. The prime aim of the research work reported here is to test the portability of the parallel algorithms and also to study and ...
Parallel optimization algorithms and their implementation in VLSI design
Lee, G.; Feeley, J. J.
1991-01-01
Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.
Optimum Quantization and Parallel Algorithms for Nonlinear State Estimation,
estimation problem can be reformulated so as to use parallel computers effectively to approximate the optimal state estimate. The problem of...simple hardware. The parallel algorithms described in the paper are suitable for a large class of parallel computers . (Author)
A Parallel Algorithm for the Vehicle Routing Problem
Energy Technology Data Exchange (ETDEWEB)
Groer, Christopher S [ORNL; Golden, Bruce [University of Maryland; Edward, Wasil [American University
2011-01-01
The vehicle routing problem (VRP) is a dicult and well-studied combinatorial optimization problem. We develop a parallel algorithm for the VRP that combines a heuristic local search improvement procedure with integer programming. We run our parallel algorithm with as many as 129 processors and are able to quickly nd high-quality solutions to standard benchmark problems. We assess the impact of parallelism by analyzing our procedure's performance under a number of dierent scenarios.
An efficient parallel algorithm for matrix-vector multiplication
Energy Technology Data Exchange (ETDEWEB)
Hendrickson, B.; Leland, R.; Plimpton, S.
1993-03-01
The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.
A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations
Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw
2005-01-01
A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.
Fast Parallel Algorithms for Graphs and Networks
1987-12-01
loosing the nth game of badminton to him. Valerie King and .Joel Friedman showed me the wonders of cross-country skiing in Yosemite. Steven Rudich was...2), both W(u) and L(v) have no more than 7s/8 vertices. Let x be some ver- tex. We can describe the history of x throughout the algorithm by a zero
A Globally Convergent Parallel SSLE Algorithm for Inequality Constrained Optimization
Directory of Open Access Journals (Sweden)
Zhijun Luo
2014-01-01
Full Text Available A new parallel variable distribution algorithm based on interior point SSLE algorithm is proposed for solving inequality constrained optimization problems under the condition that the constraints are block-separable by the technology of sequential system of linear equation. Each iteration of this algorithm only needs to solve three systems of linear equations with the same coefficient matrix to obtain the descent direction. Furthermore, under certain conditions, the global convergence is achieved.
Execution of VHDL Models Using Parallel Discrete Event Simulation Algorithms
Ashenden, Peter J.; Henry Detmold; McKeen, Wayne S.
1994-01-01
In this paper, we discuss the use of parallel discrete event simulation (PDES) algorithms for execution of hardware models written in VHDL. We survey central event queue, conservative distributed and optimistic distributed PDES algorithms, and discuss aspects of the semantics of VHDL and VHDL-92 that affect the use of these algorithms in a VHDL simulator. Next, we describe an experiment performed as part of the Vsim Project at the University of Adelaide, in which a simulation kernel using the...
Totally parallel multilevel algorithms for sparse elliptic systems
Frederickson, Paul O.
1989-01-01
The fastest known algorithms for the solution of a large elliptic boundary value problem on a massively parallel hypercube all require O(log(n)) floating point operations and O(log(n)) distance-1 communications, if massively parallel is defined to mean a number of processors proportional to the size n of the problem. The Totally Parallel Multilevel Algorithm (TPMA) that has, as special cases, four of these fast algorithms is described. These four algorithms are Parallel Superconvergent Multigrid (PSMG), Robust Multigrid, the Fast Fourier Transformation (FFT) based Spectral Algorithm, and Parallel Cyclic Reduction. The algorithm TPMA, when described recursively, has four steps: (1) project to a collection of interlaced, coarser problems at the next lower level; (2) apply TPMA, recursively, to each of these lower level problems, solving directly at the lowest level; (3) interpolate these approximate solutions to the finer grid, and to verage them to form an approximate solution on this grid; and (4) refine this approximate solution with a defect-correction step, using a local approximate inverse. Choice of the projection operator (P), the interpolation operator (Q), and the smoother (S) determines the class of problems on which TPMA is most effective. There are special cases in which the first three steps produce an exact solution, and the smoother is not needed (e.g., constant coefficient operators).
AN ALGORITHM FOR PARALLEL SN SWEEPS ON UNSTRUCTURED MESHES
Energy Technology Data Exchange (ETDEWEB)
S. D. PAUTZ
2000-12-01
We develop a new algorithm for performing parallel S{sub n} sweeps on unstructured meshes. The algorithm uses a low-complexity list ordering heuristic to determine a sweep ordering on any partitioned mesh. For typical problems and with ''normal'' mesh partitionings we have observed nearly linear speedups on up to 126 processors. This is an important and desirable result, since although analyses of structured meshes indicate that parallel sweeps will not scale with normal partitioning approaches, we do not observe any severe asymptotic degradation in the parallel efficiency with modest ({le}100) levels of parallelism. This work is a fundamental step in the development of parallel S{sub n} methods.
Fundamental Parallel Algorithms for Private-Cache Chip Multiprocessors
DEFF Research Database (Denmark)
Arge, Lars Allan; Goodrich, Michael T.; Nelson, Michael
2008-01-01
about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallel algorithms. Indeed, we present...... two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks....... In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache CMPs....
Parallel-vector algorithms for analysis of large structures
Soegiarso, R.; Adeli, H.
1995-01-01
In analysis of large space structures, the major computational steps are evaluation and assembly of the structure stiffness matrix and solution of the resulting simultaneous linear equations. In this paper we present efficient parallel-vector algorithms for these steps of structural analysis. The goal is to optimize the performance of the algorithms through judicious combination of parallel processing and vectorization. Parallel-vector algorithms are presented for solution of linear simultaneous equations using Cholesky decomposition and preconditioned conjugate gradient approaches. The algorithms are applied to three large space structures modeling the exterior envelope of high-rise and super high-rise building structures in the range of 50-162 stories with up to 6,136 members. Performance results are presented in terms of central-processing-unit time on a Cray Y-MP8/864 supercomputer, MFLOPS (millions of floating point operations per second), and speedup.
Parallel conjugate gradient algorithms for manipulator dynamic simulation
Fijany, Amir; Scheld, Robert E.
1989-01-01
Parallel conjugate gradient algorithms for the computation of multibody dynamics are developed for the specialized case of a robot manipulator. For an n-dimensional positive-definite linear system, the Classical Conjugate Gradient (CCG) algorithms are guaranteed to converge in n iterations, each with a computation cost of O(n); this leads to a total computational cost of O(n sq) on a serial processor. A conjugate gradient algorithms is presented that provide greater efficiency using a preconditioner, which reduces the number of iterations required, and by exploiting parallelism, which reduces the cost of each iteration. Two Preconditioned Conjugate Gradient (PCG) algorithms are proposed which respectively use a diagonal and a tridiagonal matrix, composed of the diagonal and tridiagonal elements of the mass matrix, as preconditioners. Parallel algorithms are developed to compute the preconditioners and their inversions in O(log sub 2 n) steps using n processors. A parallel algorithm is also presented which, on the same architecture, achieves the computational time of O(log sub 2 n) for each iteration. Simulation results for a seven degree-of-freedom manipulator are presented. Variants of the proposed algorithms are also developed which can be efficiently implemented on the Robot Mathematics Processor (RMP).
von Davier, Matthias
2016-01-01
This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
Energy Technology Data Exchange (ETDEWEB)
Lober, R.R.; Tautges, T.J.; Vaughan, C.T.
1997-03-01
Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.
A Task-parallel Clustering Algorithm for Structured AMR
Energy Technology Data Exchange (ETDEWEB)
Gunney, B N; Wissink, A M
2004-11-02
A new parallel algorithm, based on the Berger-Rigoutsos algorithm for clustering grid points into logically rectangular regions, is presented. The clustering operation is frequently performed in the dynamic gridding steps of structured adaptive mesh refinement (SAMR) calculations. A previous study revealed that although the cost of clustering is generally insignificant for smaller problems run on relatively few processors, the algorithm scaled inefficiently in parallel and its cost grows with problem size. Hence, it can become significant for large scale problems run on very large parallel machines, such as the new BlueGene system (which has {Omicron}(10{sup 4}) processors). We propose a new task-parallel algorithm designed to reduce communication wait times. Performance was assessed using dynamic SAMR re-gridding operations on up to 16K processors of currently available computers at Lawrence Livermore National Laboratory. The new algorithm was shown to be up to an order of magnitude faster than the baseline algorithm and had better scaling trends.
Interactive animation of fault-tolerant parallel algorithms
Energy Technology Data Exchange (ETDEWEB)
Apgar, S.W.
1992-02-01
Animation of algorithms makes understanding them intuitively easier. This paper describes the software tool Raft (Robust Animator of Fault Tolerant Algorithms). The Raft system allows the user to animate a number of parallel algorithms which achieve fault tolerant execution. In particular, we use it to illustrate the key Write-All problem. It has an extensive user-interface which allows a choice of the number of processors, the number of elements in the Write-All array, and the adversary to control the processor failures. The novelty of the system is that the interface allows the user to create new on-line adversaries as the algorithm executes.
Parallel Alpha-Beta Algorithm on the GPU
Strnad, Damjan; Guid, Nikola
2011-01-01
In the paper we present the parallel implementation of the alpha-beta algorithm running on the graphics processing unit (GPU). We compare the speed of the parallel player with the standard serial one using the game of reversi with boards of different sizes. We show that for small boards the level of available parallelism is insufficient for efficient GPU utilization, but for larger boards substantial speed-ups can be achieved on the GPU. The results indicate that the GPU-based alpha-beta impl...
Parallel AFSA algorithm accelerating based on MIC architecture
Zhou, Junhao; Xiao, Hong; Huang, Yifan; Li, Yongzhao; Xu, Yuanrui
2017-05-01
Analysis AFSA past for solving the traveling salesman problem, the algorithm efficiency is often a big problem, and the algorithm processing method, it does not fully responsive to the characteristics of the traveling salesman problem to deal with, and therefore proposes a parallel join improved AFSA process. The simulation with the current TSP known optimal solutions were analyzed, the results showed that the AFSA iterations improved less, on the MIC cards doubled operating efficiency, efficiency significantly.
Applications of Parallelism to Current Algorithms for Intelligence Analysis
1987-07-10
ALGORITHMS FOR INTELLIGENCE ANALYSIS Martha Ann Griesel * J. Steven Hughes Beth R. Moore 10 July 1987 National Aeronautics and Space Administration...INTELLIGENCE CENTER AND SCHOOL Software Analysis and Management System Applications of Parallelism to Current Algorithms for Intelligence Analysis 10 July 1987...rates and a growing diversity of IEW information sources create a void in current intelligence analysis methods. New ways are needed to quickly
Mighell, Kenneth John
2011-11-01
The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called CRBLASTER, which does cosmic-ray rejection of CCD (charge-coupled device) images using the embarrassingly-parallel L.A.COSMIC algorithm. CRBLASTER is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly-parallel algorithms. The CRBLASTER source code is freely available at the official application website at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800x800 pixel Hubble Space Telescope WFPC2 image takes 44 seconds with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8-GHz quad-core Intel Xeon processors. CRBLASTER is 7.4 times faster processing the same image on a single core on the same machine. Processing the same image with CRBLASTER simultaneously on all 8 cores of the same machine takes 0.875 seconds - which is a speedup factor of 50.3 times faster than the IRAF script. A detailed analysis is presented of the performance of CRBLASTER using between 1 and 57 processors on a low-power Tilera 700-MHz 64-core TILE64 processor.
A survey of checkpointing algorithms for parallel and distributed ...
Indian Academy of Sciences (India)
A survey of checkpointing algorithms for parallel and distributed computers. S KALAISELVI and V RAJARAMANa. Supercomputer Education and Research Centre (SERC), Indian Institute of. Science, Bangalore 560 012, India. aAlso at Jawaharlal Nehru Centre for Advanced Scientific Research, Indian. Institute of Science ...
Multiscale Architectures and Parallel Algorithms for Video Object Tracking
2011-10-01
Black River Systems. This may have inadvertently introduced bugs that were later discovered by AFRL during testing (of the June 22, 2011 version of...Parallelism in Algorithms and Architectures, pages 289–298, 2007. [3] S. Ali and M. Shah. COCOA - Tracking in aerial imagery. In Daniel J. Henry
A survey of checkpointing algorithms for parallel and distributed ...
Indian Academy of Sciences (India)
They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed.It is howeverfelt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this ...
A VLSI design concept for parallel iterative algorithms
Directory of Open Access Journals (Sweden)
C. C. Sun
2009-05-01
Full Text Available Modern VLSI manufacturing technology has kept shrinking down to the nanoscale level with a very fast trend. Integration with the advanced nano-technology now makes it possible to realize advanced parallel iterative algorithms directly which was almost impossible 10 years ago. In this paper, we want to discuss the influences of evolving VLSI technologies for iterative algorithms and present design strategies from an algorithmic and architectural point of view. Implementing an iterative algorithm on a multiprocessor array, there is a trade-off between the performance/complexity of processors and the load/throughput of interconnects. This is due to the behavior of iterative algorithms. For example, we could simplify the parallel implementation of the iterative algorithm (i.e., processor elements of the multiprocessor array in any way as long as the convergence is guaranteed. However, the modification of the algorithm (processors usually increases the number of required iterations which also means that the switch activity of interconnects is increasing. As an example we show that a 25×25 full Jacobi EVD array could be realized into one single FPGA device with the simplified μ-rotation CORDIC architecture.
Technical Report: Scalable Parallel Algorithms for High Dimensional Numerical Integration
Energy Technology Data Exchange (ETDEWEB)
Masalma, Yahya [Universidad del Turabo; Jiao, Yu [ORNL
2010-10-01
We implemented a scalable parallel quasi-Monte Carlo numerical high-dimensional integration for tera-scale data points. The implemented algorithm uses the Sobol s quasi-sequences to generate random samples. Sobol s sequence was used to avoid clustering effects in the generated random samples and to produce low-discrepancy random samples which cover the entire integration domain. The performance of the algorithm was tested. Obtained results prove the scalability and accuracy of the implemented algorithms. The implemented algorithm could be used in different applications where a huge data volume is generated and numerical integration is required. We suggest using the hyprid MPI and OpenMP programming model to improve the performance of the algorithms. If the mixed model is used, attention should be paid to the scalability and accuracy.
A parallel genetic algorithm for the set partitioning problem
Energy Technology Data Exchange (ETDEWEB)
Levine, D. [Argonne National Lab., IL (United States). Mathematics and Computer Science Division.
1994-05-01
In this dissertation the author reports on his efforts to develop a parallel genetic algorithm and apply it to the solution of set partitioning problem -- a difficult combinatorial optimization problem used by many airlines as a mathematical model for flight crew scheduling. He developed a distributed steady-state genetic algorithm in conjunction with a specialized local search heuristic for solving the set partitioning problem. The genetic algorithm is based on an island model where multiple independent subpopulations each run a steady-state genetic algorithm on their subpopulation and occasionally fit strings migrate between the subpopulations. Tests on forty real-world set partitioning problems were carried out on up to 128 nodes of an IBM SP1 parallel computer. The authors found that performance, as measured by the quality of the solution found and the iteration on which it was found, improved as additional subpopulation found and the iteration on which it was found, improved as additional subpopulations were added to the computation. With larger numbers of subpopulations the genetic algorithm was regularly able to find the optimal solution to problems having up to a few thousand integer variables. In two cases, high-quality integer feasible solutions were found for problems with 36,699 and 43,749 integer variables, respectively. A notable limitation they found was the difficulty solving problems with many constraints.
Mighell, Kenneth John
2010-10-01
The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called crblaster, which does cosmic-ray rejection of CCD images using the embarrassingly parallel l.a.cosmic algorithm. crblaster is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. crblaster uses a two-dimensional image partitioning algorithm that partitions an input image into N rectangular subimages of nearly equal area; the subimages include sufficient additional pixels along common image partition edges such that the need for communication between computer processes is eliminated. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly parallel algorithms. The crblaster source code is freely available at the official application Web site at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800 × 800 pixel Hubble Space Telescope WFPC2 image takes 44 s with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8 GHz quad-core Intel Xeon processors. crblaster is 7.4 times faster when processing the same image on a single core on the same machine. Processing the same image with crblaster simultaneously on all eight cores of the same machine takes 0.875 s—which is a speedup factor of 50.3 times faster than the
Parallel Algorithms for Graph Optimization using Tree Decompositions
Energy Technology Data Exchange (ETDEWEB)
Sullivan, Blair D [ORNL; Weerapurage, Dinesh P [ORNL; Groer, Christopher S [ORNL
2012-06-01
Although many $\\cal{NP}$-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the necessary dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task-based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.
Eddy current testing probe optimization using a parallel genetic algorithm
Directory of Open Access Journals (Sweden)
Dolapchiev Ivaylo
2008-01-01
Full Text Available This paper uses the developed parallel version of Michalewicz's Genocop III Genetic Algorithm (GA searching technique to optimize the coil geometry of an eddy current non-destructive testing probe (ECTP. The electromagnetic field is computed using FEMM 2D finite element code. The aim of this optimization was to determine coil dimensions and positions that improve ECTP sensitivity to physical properties of the tested devices.
Algorithm of parallel: hierarchical transformation and its implementation on FPGA
Timchenko, Leonid I.; Petrovskiy, Mykola S.; Kokryatskay, Natalia I.; Barylo, Alexander S.; Dembitska, Sofia V.; Stepanikuk, Dmytro S.; Suleimenov, Batyrbek; Zyska, Tomasz; Uvaysova, Svetlana; Shedreyeva, Indira
2017-08-01
In this paper considers the algorithm of laser beam spots image classification in atmospheric-optical transmission systems. It discusses the need for images filtering using adaptive methods, using, for example, parallel-hierarchical networks. The article also highlights the need to create high-speed memory devices for such networks. Implementation and simulation results of the developed method based on the PLD are demonstrated, which shows that the presented method gives 15-20% better prediction results than similar methods.
A hybrid algorithm for parallel molecular dynamics simulations
Mangiardi, Chris M.; Meyer, R.
2017-10-01
This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
A parallel genetic algorithm for the set partitioning problem
Energy Technology Data Exchange (ETDEWEB)
Levine, D.
1996-12-31
This paper describes a parallel genetic algorithm developed for the solution of the set partitioning problem- a difficult combinatorial optimization problem used by many airlines as a mathematical model for flight crew scheduling. The genetic algorithm is based on an island model where multiple independent subpopulations each run a steady-state genetic algorithm on their own subpopulation and occasionally fit strings migrate between the subpopulations. Tests on forty real-world set partitioning problems were carried out on up to 128 nodes of an IBM SP1 parallel computer. We found that performance, as measured by the quality of the solution found and the iteration on which it was found, improved as additional subpopulations were added to the computation. With larger numbers of subpopulations the genetic algorithm was regularly able to find the optimal solution to problems having up to a few thousand integer variables. In two cases, high- quality integer feasible solutions were found for problems with 36, 699 and 43,749 integer variables, respectively. A notable limitation we found was the difficulty solving problems with many constraints.
Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model
Hamam, Alwaleed A.
2017-03-13
Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it\\'s time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.
Parallel Algorithm for Incremental Betweenness Centrality on Large Graphs
Jamour, Fuad Tarek
2017-10-17
Betweenness centrality quantifies the importance of nodes in a graph in many applications, including network analysis, community detection and identification of influential users. Typically, graphs in such applications evolve over time. Thus, the computation of betweenness centrality should be performed incrementally. This is challenging because updating even a single edge may trigger the computation of all-pairs shortest paths in the entire graph. Existing approaches cannot scale to large graphs: they either require excessive memory (i.e., quadratic to the size of the input graph) or perform unnecessary computations rendering them prohibitively slow. We propose iCentral; a novel incremental algorithm for computing betweenness centrality in evolving graphs. We decompose the graph into biconnected components and prove that processing can be localized within the affected components. iCentral is the first algorithm to support incremental betweeness centrality computation within a graph component. This is done efficiently, in linear space; consequently, iCentral scales to large graphs. We demonstrate with real datasets that the serial implementation of iCentral is up to 3.7 times faster than existing serial methods. Our parallel implementation that scales to large graphs, is an order of magnitude faster than the state-of-the-art parallel algorithm, while using an order of magnitude less computational resources.
Parallel pipeline algorithm of real time star map preprocessing
Wang, Hai-yong; Qin, Tian-mu; Liu, Jia-qi; Li, Zhi-feng; Li, Jian-hua
2016-03-01
To improve the preprocessing speed of star map and reduce the resource consumption of embedded system of star tracker, a parallel pipeline real-time preprocessing algorithm is presented. The two characteristics, the mean and the noise standard deviation of the background gray of a star map, are firstly obtained dynamically by the means that the intervene of the star image itself to the background is removed in advance. The criterion on whether or not the following noise filtering is needed is established, then the extraction threshold value is assigned according to the level of background noise, so that the centroiding accuracy is guaranteed. In the processing algorithm, as low as two lines of pixel data are buffered, and only 100 shift registers are used to record the connected domain label, by which the problems of resources wasting and connected domain overflow are solved. The simulating results show that the necessary data of the selected bright stars could be immediately accessed in a delay time as short as 10us after the pipeline processing of a 496×496 star map in 50Mb/s is finished, and the needed memory and registers resource total less than 80kb. To verify the accuracy performance of the algorithm proposed, different levels of background noise are added to the processed ideal star map, and the statistic centroiding error is smaller than 1/23 pixel under the condition that the signal to noise ratio is greater than 1. The parallel pipeline algorithm of real time star map preprocessing helps to increase the data output speed and the anti-dynamic performance of star tracker.
A Parallel Genetic Algorithm for Automated Electronic Circuit Design
Long, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris
2000-01-01
Parallelized versions of genetic algorithms (GAs) are popular primarily for three reasons: the GA is an inherently parallel algorithm, typical GA applications are very compute intensive, and powerful computing platforms, especially Beowulf-style computing clusters, are becoming more affordable and easier to implement. In addition, the low communication bandwidth required allows the use of inexpensive networking hardware such as standard office ethernet. In this paper we describe a parallel GA and its use in automated high-level circuit design. Genetic algorithms are a type of trial-and-error search technique that are guided by principles of Darwinian evolution. Just as the genetic material of two living organisms can intermix to produce offspring that are better adapted to their environment, GAs expose genetic material, frequently strings of 1s and Os, to the forces of artificial evolution: selection, mutation, recombination, etc. GAs start with a pool of randomly-generated candidate solutions which are then tested and scored with respect to their utility. Solutions are then bred by probabilistically selecting high quality parents and recombining their genetic representations to produce offspring solutions. Offspring are typically subjected to a small amount of random mutation. After a pool of offspring is produced, this process iterates until a satisfactory solution is found or an iteration limit is reached. Genetic algorithms have been applied to a wide variety of problems in many fields, including chemistry, biology, and many engineering disciplines. There are many styles of parallelism used in implementing parallel GAs. One such method is called the master-slave or processor farm approach. In this technique, slave nodes are used solely to compute fitness evaluations (the most time consuming part). The master processor collects fitness scores from the nodes and performs the genetic operators (selection, reproduction, variation, etc.). Because of dependency
Research on Parallelization of GPU-based K-Nearest Neighbor Algorithm
Jiang, Hao; Wu, Yulin
2017-10-01
Based on the analysis of the K-Nearest Neighbor Algorithm, the feasibility of parallelization is studied from the steps of the algorithm, the operation efficiency and the data structure of each step, and the part of parallel execution is determined. A K-Nearest Neighbor Algorithm parallelization scheme is designed and the parallel G-KNN algorithm is implemented in the CUDA environment. The experimental results show that the K-Nearest Neighbor Algorithm has a significant improvement in efficiency after parallelization, especially on large-scale data.
Parallel particle swarm optimization algorithm in nuclear problems
Energy Technology Data Exchange (ETDEWEB)
Waintraub, Marcel; Pereira, Claudio M.N.A. [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil)], e-mail: marcel@ien.gov.br, e-mail: cmnap@ien.gov.br; Schirru, Roberto [Coordenacao dos Programas de Pos-graduacao de Engenharia (COPPE/UFRJ), Rio de Janeiro, RJ (Brazil). Lab. de Monitoracao de Processos], e-mail: schirru@lmp.ufrj.br
2009-07-01
Particle Swarm Optimization (PSO) is a population-based metaheuristic (PBM), in which solution candidates evolve through simulation of a simplified social adaptation model. Putting together robustness, efficiency and simplicity, PSO has gained great popularity. Many successful applications of PSO are reported, in which PSO demonstrated to have advantages over other well-established PBM. However, computational costs are still a great constraint for PSO, as well as for all other PBMs, especially in optimization problems with time consuming objective functions. To overcome such difficulty, parallel computation has been used. The default advantage of parallel PSO (PPSO) is the reduction of computational time. Master-slave approaches, exploring this characteristic are the most investigated. However, much more should be expected. It is known that PSO may be improved by more elaborated neighborhood topologies. Hence, in this work, we develop several different PPSO algorithms exploring the advantages of enhanced neighborhood topologies implemented by communication strategies in multiprocessor architectures. The proposed PPSOs have been applied to two complex and time consuming nuclear engineering problems: reactor core design and fuel reload optimization. After exhaustive experiments, it has been concluded that: PPSO still improves solutions after many thousands of iterations, making prohibitive the efficient use of serial (non-parallel) PSO in such kind of realworld problems; and PPSO with more elaborated communication strategies demonstrated to be more efficient and robust than the master-slave model. Advantages and peculiarities of each model are carefully discussed in this work. (author)
Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm
Umam, Khoirul; Bustamam, Alhadi; Lestari, Dian
2017-03-01
DNA is one of the carrier of genetic information of living organisms. Encoding, sequencing, and clustering DNA sequences has become the key jobs and routine in the world of molecular biology, in particular on bioinformatics application. There are two type of clustering, hierarchical clustering and partitioning clustering. In this paper, we combined two type clustering i.e. K-Means (partitioning clustering) and DIANA (hierarchical clustering), therefore it called Hybrid clustering. Application of hybrid clustering using Parallel K-Means algorithm and DIANA algorithm used to clustering DNA sequences of Human Papillomavirus (HPV). The clustering process is started with Collecting DNA sequences of HPV are obtained from NCBI (National Centre for Biotechnology Information), then performing characteristics extraction of DNA sequences. The characteristics extraction result is store in a matrix form, then normalize this matrix using Min-Max normalization and calculate genetic distance using Euclidian Distance. Furthermore, the hybrid clustering is applied by using implementation of Parallel K-Means algorithm and DIANA algorithm. The aim of using Hybrid Clustering is to obtain better clusters result. For validating the resulted clusters, to get optimum number of clusters, we use Davies-Bouldin Index (DBI). In this study, the result of implementation of Parallel K-Means clustering is data clustered become 5 clusters with minimal IDB value is 0.8741, and Hybrid Clustering clustered data become 13 sub-clusters with minimal IDB values = 0.8216, 0.6845, 0.3331, 0.1994 and 0.3952. The IDB value of hybrid clustering less than IBD value of Parallel K-Means clustering only that perform at 1ts stage. Its means clustering using Hybrid Clustering have the better result to clustered DNA sequence of HPV than perform parallel K-Means Clustering only.
Step-parallel algorithms for stiff initial value problems
W.A. van der Veen
1995-01-01
textabstractFor the parallel integration of stiff initial value problems, three types of parallelism can be employed: 'parallelism across the problem', 'parallelism across the method' and 'parallelism across the steps'. Recently, methods based on Runge-Kutta schemes that use parallelism across the
Graph Transformation and Designing Parallel Sparse Matrix Algorithms beyond Data Dependence Analysis
Directory of Open Access Journals (Sweden)
H.X. Lin
2004-01-01
Full Text Available Algorithms are often parallelized based on data dependence analysis manually or by means of parallel compilers. Some vector/matrix computations such as the matrix-vector products with simple data dependence structures (data parallelism can be easily parallelized. For problems with more complicated data dependence structures, parallelization is less straightforward. The data dependence graph is a powerful means for designing and analyzing parallel algorithms. However, for sparse matrix computations, parallelization based on solely exploiting the existing parallelism in an algorithm does not always give satisfactory results. For example, the conventional Gaussian elimination algorithm for the solution of a tri-diagonal system is inherently sequential, so algorithms specially for parallel computation has to be designed. After briefly reviewing different parallelization approaches, a powerful graph formalism for designing parallel algorithms is introduced. This formalism will be discussed using a tri-diagonal system as an example. Its application to general matrix computations is also discussed. Its power in designing parallel algorithms beyond the ability of data dependence analysis is shown by means of a new algorithm called ACER (Alternating Cyclic Elimination and Reduction algorithm.
Parallel optimization algorithm for drone inspection in the building industry
Walczyński, Maciej; BoŻejko, Wojciech; Skorupka, Dariusz
2017-07-01
In this paper we present an approach for Vehicle Routing Problem with Drones (VRPD) in case of building inspection from the air. In autonomic inspection process there is a need to determine of the optimal route for inspection drone. This is especially important issue because of the very limited flight time of modern multicopters. The method of determining solutions for Traveling Salesman Problem(TSP), described in this paper bases on Parallel Evolutionary Algorithm (ParEA)with cooperative and independent approach for communication between threads. This method described first by Bożejko and Wodecki [1] bases on the observation that if exists some number of elements on certain positions in a number of permutations which are local minima, then those elements will be in the same position in the optimal solution for TSP problem. Numerical experiments were made on BEM computational cluster with using MPI library.
A Parallel Distributed Data CPHF Algorithm for Analytic Hessians
Energy Technology Data Exchange (ETDEWEB)
Alexeev, Yuri; Schmidt, Michael W.; Windus, Theresa L.; Gordon, Mark S.
2007-07-30
One of the most commonly used means to characterize potential energy surfaces of reactions and chemical systems is the Hessian calculation, whose analytic evaluation is computationally and memory demanding. A new scalable distributed data analytic Hessian algorithm is presented. Features of the distributed data parallel coupled perturbed Hartree-Fock (CPHF) are (a) columns of density-like and Fock-like matrices are distributed among processors, (b) an efficient static load balancing scheme achieves good work load distribution among the processors, (c) network communication time is minimized, and (d) numerous performance improvements in analytic Hessian steps are made. As result, the new code has good performance which is demonstrated via calculations on large biological systems.
A parallel distributed data CPHF algorithm for analytic Hessians.
Alexeev, Yuri; Schmidt, Michael W; Windus, Theresa L; Gordon, Mark S
2007-07-30
One of the most commonly used means to characterize potential energy surfaces of reactions and chemical systems is the Hessian calculation, whose analytic evaluation is computationally and memory demanding. A new scalable distributed data analytic Hessian algorithm is presented. Features of the distributed data parallel coupled perturbed Hartree-Fock (CPHF) are (a) columns of density-like and Fock-like matrices are distributed among processors, (b) an efficient static load balancing scheme achieves good work load distribution among the processors, (c) network communication time is minimized, and (d) numerous performance improvements in analytic Hessian steps are made. As a result, the new code has good performance which is demonstrated on large biological systems. (c) 2007 Wiley Periodicals, Inc.
A parallel adaptive finite difference algorithm for petroleum reservoir simulation
Energy Technology Data Exchange (ETDEWEB)
Hoang, Hai Minh
2005-07-01
Adaptive finite differential for problems arising in simulation of flow in porous medium applications are considered. Such methods have been proven useful for overcoming limitations of computational resources and improving the resolution of the numerical solutions to a wide range of problems. By local refinement of the computational mesh where it is needed to improve the accuracy of solutions, yields better solution resolution representing more efficient use of computational resources than is possible with traditional fixed-grid approaches. In this thesis, we propose a parallel adaptive cell-centered finite difference (PAFD) method for black-oil reservoir simulation models. This is an extension of the adaptive mesh refinement (AMR) methodology first developed by Berger and Oliger (1984) for the hyperbolic problem. Our algorithm is fully adaptive in time and space through the use of subcycling, in which finer grids are advanced at smaller time steps than the coarser ones. When coarse and fine grids reach the same advanced time level, they are synchronized to ensure that the global solution is conservative and satisfy the divergence constraint across all levels of refinement. The material in this thesis is subdivided in to three overall parts. First we explain the methodology and intricacies of AFD scheme. Then we extend a finite differential cell-centered approximation discretization to a multilevel hierarchy of refined grids, and finally we are employing the algorithm on parallel computer. The results in this work show that the approach presented is robust, and stable, thus demonstrating the increased solution accuracy due to local refinement and reduced computing resource consumption. (Author)
An Implementation and Parallelization of the Scale Space Meshing Algorithm
Directory of Open Access Journals (Sweden)
Julie Digne
2015-11-01
Full Text Available Creating an interpolating mesh from an unorganized set of oriented points is a difficult problemwhich is often overlooked. Most methods focus indeed on building a watertight smoothed meshby defining some function whose zero level set is the surface of the object. However in some casesit is crucial to build a mesh that interpolates the points and does not fill the acquisition holes:either because the data are sparse and trying to fill the holes would create spurious artifactsor because the goal is to explore visually the data exactly as they were acquired without anysmoothing process. In this paper we detail a parallel implementation of the Scale-Space Meshingalgorithm, which builds on the scale-space framework for reconstructing a high precision meshfrom an input oriented point set. This algorithm first smoothes the point set, producing asingularity free shape. It then uses a standard mesh reconstruction technique, the Ball PivotingAlgorithm, to build a mesh from the smoothed point set. The final step consists in back-projecting the mesh built on the smoothed positions onto the original point set. The result ofthis process is an interpolating, hole-preserving surface mesh reconstruction.
Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis
Brouwer, Randall Jay
1991-01-01
The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.
Near-Optimal Speedup of Graphics Algorithms Using Multigauge Parallel Computers.
1986-12-01
REPCRT & PERIOD COVERED Near-Optimal Speedup of Graphics Algorithms Technical Report () Using Multigauge Parallel Computers 6. PEFRMN ORG. REPORT...on1 reverse side if necessary and identify by block number) ’ultiaUqe parallel computers , Quarter Horse ricroprocessor- oraphics algorithm 20 ABST...structures it is a very low 0. , cost way of exploiting parallelism. , ’I’ V 1 Introduction Ii ’, . L, - It is not sufficient for parallel computers to speed
Optimization Algorithms for Calculation of the Joint Design Point in Parallel Systems
DEFF Research Database (Denmark)
Enevoldsen, I.; Sørensen, John Dalsgaard
1992-01-01
In large structures it is often necessary to estimate the reliability of the system by use of parallel systems. Optimality criteria-based algorithms for calculation of the joint design point in a parallel system are described and efficient active set strategies are developed. Three possible...... algorithms are tested in two examples against well-known general non-linear optimization algorithms. Especially one of the suggested algorithms seems to be stable and fast....
Optimal Design of Passive Power Filters Based on Pseudo-parallel Genetic Algorithm
Li, Pei; Li, Hongbo; Gao, Nannan; Niu, Lin; Guo, Liangfeng; Pei, Ying; Zhang, Yanyan; Xu, Minmin; Chen, Kerui
2017-05-01
The economic costs together with filter efficiency are taken as targets to optimize the parameter of passive filter. Furthermore, the method of combining pseudo-parallel genetic algorithm with adaptive genetic algorithm is adopted in this paper. In the early stages pseudo-parallel genetic algorithm is introduced to increase the population diversity, and adaptive genetic algorithm is used in the late stages to reduce the workload. At the same time, the migration rate of pseudo-parallel genetic algorithm is improved to change with population diversity adaptively. Simulation results show that the filter designed by the proposed method has better filtering effect with lower economic cost, and can be used in engineering.
A parallel stereo reconstruction algorithm with applications in entomology (APSRA)
Bhasin, Rajesh; Jang, Won Jun; Hart, John C.
2012-03-01
We propose a fast parallel algorithm for the reconstruction of 3-Dimensional point clouds of insects from binocular stereo image pairs using a hierarchical approach for disparity estimation. Entomologists study various features of insects to classify them, build their distribution maps, and discover genetic links between specimens among various other essential tasks. This information is important to the pesticide and the pharmaceutical industries among others. When considering the large collections of insects entomologists analyze, it becomes difficult to physically handle the entire collection and share the data with researchers across the world. With the method presented in our work, Entomologists can create an image database for their collections and use the 3D models for studying the shape and structure of the insects thus making it easier to maintain and share. Initial feedback shows that the reconstructed 3D models preserve the shape and size of the specimen. We further optimize our results to incorporate multiview stereo which produces better overall structure of the insects. Our main contribution is applying stereoscopic vision techniques to entomology to solve the problems faced by entomologists.
The merits of a parallel genetic algorithm in solving hard optimization problems
van Soest, A.J.; Casius, L.J.R.
2003-01-01
A parallel genetic algorithm for optimization is outlined, and its performance on both mathematical and biomechanical optimization problems is compared to a sequential quadratic programming algorithm, a downhill simplex algorithm and a simulated annealing algorithm. When high-dimensional non-smooth
An Alternative Algorithm for Computing Watersheds on Shared Memory Parallel Computers
Meijster, A.; Roerdink, J.B.T.M.
1995-01-01
In this paper a parallel implementation of a watershed algorithm is proposed. The algorithm can easily be implemented on shared memory parallel computers. The watershed transform is generally considered to be inherently sequential since the discrete watershed of an image is defined using recursion.
Accelerated parallel algorithm for gene network reverse engineering
National Research Council Canada - National Science Library
He, Jing; Zhou, Zhou; Reed, Michael; Califano, Andrea
2017-01-01
... used. These issues can be addressed elegantly in a GPU computing framework, where repeated mathematical computation can be done efficiently, but requires extensive redesign to apply parallel computing...
Comparison Of Hybrid Sorting Algorithms Implemented On Different Parallel Hardware Platforms
Directory of Open Access Journals (Sweden)
Dominik Zurek
2013-01-01
Full Text Available Sorting is a common problem in computer science. There are lot of well-known sorting algorithms created for sequential execution on a single processor. Recently, hardware platforms enable to create wide parallel algorithms. We have standard processors consist of multiple cores and hardware accelerators like GPU. The graphic cards with their parallel architecture give new possibility to speed up many algorithms. In this paper we describe results of implementation of a few different sorting algorithms on GPU cards and multicore processors. Then hybrid algorithm will be presented which consists of parts executed on both platforms, standard CPU and GPU.
Mattei, D.; Smith, I.; Ferrari, A.; Carbillet, M.
2010-10-01
Post-processing for exoplanet detection using direct imaging requires large data cubes and/or sophisticated signal processing technics. For alt-azimuthal mounts, a projection effect called field rotation makes the potential planet rotate in a known manner on the set of images. For ground based telescopes that use extreme adaptive optics and advanced coronagraphy, technics based on field rotation are already broadly used and still under progress. In most such technics, for a given initial position of the planet the planet intensity estimate is a linear function of the set of images. However, due to field rotation the modified instrumental response applied is not shift invariant like usual linear filters. Testing all possible initial positions is therefore very time-consuming. To reduce the time process, we propose to deal with each subset of initial positions computed on a different machine using parallelization programming. In particular, the MOODS algorithm dedicated to the VLT-SPHERE instrument, that estimates jointly the light contributions of the star and the potential exoplanet, is parallelized on the Observatoire de la Cote d'Azur cluster. Different parallelization methods (OpenMP, MPI, Jobs Array) have been elaborated for the initial MOODS code and compared to each other. The one finally chosen splits the initial positions on the processors available by accounting at best for the different constraints of the cluster structure: memory, job submission queues, number of available CPUs, cluster average load. At the end, a standard set of images is satisfactorily processed in a few hours instead of a few days.
Scalable Parallel Algorithms for Formal Verification of Software Project
National Aeronautics and Space Administration — We will develop an efficient Graphics Processing Unit (GPU) based parallel Binary Decision Diagram (BDD) software package, and will also combine it with our...
Scalable Parallel Algorithms for Formal Verification of Software Project
National Aeronautics and Space Administration — We will develop a prototype of a GPU-based parallel Binary Decision Diagram (BDD) software package. BDDs are a data structure that satisfies some simple...
Pernpointner, M.; Visscher, L.
2003-01-01
Given the importance of the Coupled-cluster (CC) method as an efficient and accurate way to take electron correlation into account, we extend the parallelization technique in the second part of this series also to the 4-Spinor CCSD algorithm implemented in the Dirac-Fock packages DIRAC and MOLFDIR.
A formal reduction for lock-free parallel algorithms
Gao, H.; Hesselink, W.H.; Alur, R; Peled, DA
2004-01-01
On shared memory multiprocessors, synchronization often turns out to be a performance bottleneck and the source of poor fault-tolerance. Lock-free algorithms can do without locking mechanisms, and are therefore desirable. Lock-free algorithms are hard to design correctly, however, even when
Parallel Algorithms for Online Track Finding for the \\bar{{\\rm{P}}}ANDA Experiment at FAIR
Bianchi, L.; Herten, A.; Ritman, J.; Stockmanns, T.; PANDA Collaboration
2017-10-01
\\bar{{{P}}}ANDA is a future hadron and nuclear physics experiment at the FAIR facility in construction in Darmstadt, Germany. Unlike the majority of current experiments, \\bar{{{P}}}ANDA’s strategy for data acquisition is based on online event reconstruction from free-streaming data, performed in real time entirely by software algorithms using global detector information. This paper reports on the status of the development of algorithms for the reconstruction of charged particle tracks, targeted towards online data processing applications, designed for execution on data-parallel processors such as GPUs (Graphic Processing Units). Two parallel algorithms for track finding, derived from the Circle Hough algorithm, are being developed to extend the parallelism to all stages of the algorithm. The concepts of the algorithms are described, along with preliminary results and considerations about their implementations and performance.
2014-01-01
Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs
Directory of Open Access Journals (Sweden)
Vaughn Matthew
2010-11-01
Full Text Available Abstract Background Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ messages (Σ being the size of the alphabet. Results In this paper we present a Θ(n/p time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/BBlog(M/B (M being the main memory size and B being the size of the disk block. We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster - both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. Conclusions The bi
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.
Kundeti, Vamsi K; Rajasekaran, Sanguthevar; Dinh, Hieu; Vaughn, Matthew; Thapar, Vishal
2010-11-15
Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages (Σ being the size of the alphabet). In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/B)Blog(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster--both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. The bi-directed de Bruijn graph is a fundamental data structure for
Data-Parallel Algorithm for Contour Tree Construction
Energy Technology Data Exchange (ETDEWEB)
Sewell, Christopher Meyer [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Ahrens, James Paul [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Carr, Hamish [Univ. of Leeds (United Kingdom); Weber, Gunther [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
2017-01-19
The goal of this project is to develop algorithms for additional visualization and analysis filters in order to expand the functionality of the VTK-m toolkit to support less critical but commonly used operators.
Efficient Parallel Algorithm for Statistical Ion Track Simulations in Crystalline Materials
Jeon, Byoungseon
2008-01-01
We present an efficient parallel algorithm for statistical Molecular Dynamics simulations of ion tracks in solids. The method is based on the Rare Event Enhanced Domain following Molecular Dynamics (REED-MD) algorithm, which has been successfully applied to studies of, e.g., ion implantation into crystalline semiconductor wafers. We discuss the strategies for parallelizing the method, and we settle on a host-client type polling scheme in which a multiple of asynchronous processors are continuously fed to the host, which, in turn, distributes the resulting feed-back information to the clients. This real-time feed-back consists of, e.g., cumulative damage information or statistics updates necessary for the cloning in the rare event algorithm. We finally demonstrate the algorithm for radiation effects in a nuclear oxide fuel, and we show the balanced parallel approach with high parallel efficiency in multiple processor configurations.
Jacobian free monotonic descent algorithm for forward kinematics of spatial parallel manipulator
Gang Shen; Yu Tang; Jinsong Zhao; Hao Lu; Xiang Li; Ge Li
2016-01-01
In order to efficiently solve a forward kinematics of parallel manipulators for real-time applications, a Jacobian free monotonic descent algorithm is proposed in this article. The system of nonlinear equations of a specified 6-degree-of-freedom parallel manipulator is established using a geometric analysis method. The proposed Jacobian free monotonic descent algorithm is modified using a traditional Newton–Raphson method by employing a first-order Taylor series expansion to numerically appro...
ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches
Rognes, Torbjørn
2001-01-01
There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of th...
Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie
2014-01-01
It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
A Parallel Processing Algorithm for Remote Sensing Classification
Gualtieri, J. Anthony
2005-01-01
A current thread in parallel computation is the use of cluster computers created by networking a few to thousands of commodity general-purpose workstation-level commuters using the Linux operating system. For example on the Medusa cluster at NASA/GSFC, this provides for super computing performance, 130 G(sub flops) (Linpack Benchmark) at moderate cost, $370K. However, to be useful for scientific computing in the area of Earth science, issues of ease of programming, access to existing scientific libraries, and portability of existing code need to be considered. In this paper, I address these issues in the context of tools for rendering earth science remote sensing data into useful products. In particular, I focus on a problem that can be decomposed into a set of independent tasks, which on a serial computer would be performed sequentially, but with a cluster computer can be performed in parallel, giving an obvious speedup. To make the ideas concrete, I consider the problem of classifying hyperspectral imagery where some ground truth is available to train the classifier. In particular I will use the Support Vector Machine (SVM) approach as applied to hyperspectral imagery. The approach will be to introduce notions about parallel computation and then to restrict the development to the SVM problem. Pseudocode (an outline of the computation) will be described and then details specific to the implementation will be given. Then timing results will be reported to show what speedups are possible using parallel computation. The paper will close with a discussion of the results.
Evaluating Efficiency of Parallel Algorithms of Transformation Operations with Graph Model
Directory of Open Access Journals (Sweden)
G. S. Ivanova
2014-01-01
Full Text Available The usage of graphs in the analysis and design of complex large-scale system structures, which require a significant computing capacity, has led to the need to seek for the new methods both of graph models representation and of graph operations implementation. To reduce the execution time of algorithms the parallel computing systems are appropriate to use.In this case, to achieve the maximum acceleration graph processing is implemented in hardware and software parts of the system. In the article, analysis of graph model operations was performed in the context of their implementation in a parallel computing system based on the abstract description of the graph by the sets that allows the utilizing of various parallel processing systems, regardless of their architecture features. The most common in the algorithms graph transformation operations were considered.As a result of analysis a set of elementary operations on graph structures, which constitute graph operations, was revealed, and graph operations parallel algorithms were realized. Efficiency evaluation of parallel algorithms, presented by speedup for each operation, showed a high degree of graph processing acceleration, compared with sequential operations algorithms.The proposed realization can be used to solve time-consuming large-scale tasks on parallel computing systems. At the same time, in a particular parallel processing system it is also possible to parallelize elementary operations thereby greatly reducing the execution time of operation in general.Further research is focused on describing and complementing current implementation with complex parallel graph transformation operations such as intersection, union, composition of graphs, etc., as well as analysis operations of various graph characteristics. That will expand the set of elementary operations and will provide an opportunity to evaluate more accurately the efficiency of parallel computing systems for processing graph models.
Comparative efficiencies of three parallel algorithms for nonlinear ...
Indian Academy of Sciences (India)
R. Narasimhan (Krishtel eMaging) 1461 1996 Oct 15 13:05:22
1977) and element-by-element methods (Ortiz et al 1983), and virtual pulse techniques (Chen et al. 1995) etc. However, in recent years the most exciting possibility in the algorithm development area for nonlinear dynamic analysis has been ...
ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
Cai, Yunpeng; Zheng, Wei; Yao, Jin; Yang, Yujie; Mai, Volker; Mao, Qi; Sun, Yijun
2017-04-01
The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.
Study on performance improvement of oil paint image filter algorithm using parallel pattern library
Mukherjee, Siddhartha
2014-01-01
This paper gives a detailed study on the performanc e of oil paint image filter algorithm with various parameters applied on an image of RGB model . Oil Paint image processing, being very performance hungry, current research tries to find improvement using parallel pattern library. With increasing kernel-size, the processing time of oil paint image filter algorithm increases exponentially.
Directory of Open Access Journals (Sweden)
Vasiliy Yu. Meltsov
2012-05-01
Full Text Available This paper presents the results of the development of one of the modules of the system verification of parallel algorithms that are used to verify the inference engine. This module is designed to build the specification requirements, the feasibility of which on the algorithm is necessary to prove (test.
Parallel algorithm of trigonometric collocation method in nonlinear dynamics of rotors
Directory of Open Access Journals (Sweden)
Musil T.
2007-11-01
Full Text Available A parallel algorithm of a numeric procedure based on a method of trigonometric collocation is presented for investigating an unbalance response of a rotor supported by journal bearings. After a condensation process the trigonometric collocation method results in a set of nonlinear algebraic equations which is solved by the Newton-Raphson method. The order of the set is proportional to the number of nonlinear bearing coordinates and terms of the finite Fourier series. The algorithm, realized in the MATLAB parallel computing environment (DCT/DCE, uses message passing technique for interacting among processes on nodes of a parallel computer. This technique enables portability of the source code both on parallel computers with distributed and shared memory. Tests, made on a Beowulf cluster and a symmetric multiprocessor, have revealed very good speed-up and scalability of this algorithm.
A New Approach of Parallelism and Load Balance for the Apriori Algorithm
Directory of Open Access Journals (Sweden)
BOLINA, A. C.
2013-06-01
Full Text Available The main goal of data mining is to discover relevant information on digital content. The Apriori algorithm is widely used to this objective, but its sequential version has a low performance when execu- ted over large volumes of data. Among the solutions for this problem is the parallel implementation of the algorithm, and among the parallel implementations presented in the literature that based on Apriori, it highlights the DPA (Distributed Parallel Apriori [10]. This paper presents the DMTA (Distributed Multithread Apriori algorithm, which is based on DPA and exploits the parallelism level of threads in order to increase the performance. Besides, DMTA can be executed over heterogeneous hardware platform, using different number of cores. The results showed that DMTA outperforms DPA, presents load balance among processes and threads, and it is effective in current multicore architectures.
High-speed detection of emergent market clustering via an unsupervised parallel genetic algorithm
Directory of Open Access Journals (Sweden)
Dieter Hendricks
2016-02-01
Full Text Available We implement a master-slave parallel genetic algorithm with a bespoke log-likelihood fitness function to identify emergent clusters within price evolutions. We use graphics processing units (GPUs to implement a parallel genetic algorithm and visualise the results using disjoint minimal spanning trees. We demonstrate that our GPU parallel genetic algorithm, implemented on a commercially available general purpose GPU, is able to recover stock clusters in sub-second speed, based on a subset of stocks in the South African market. This approach represents a pragmatic choice for low-cost, scalable parallel computing and is significantly faster than a prototype serial implementation in an optimised C-based fourth-generation programming language, although the results are not directly comparable because of compiler differences. Combined with fast online intraday correlation matrix estimation from high frequency data for cluster identification, the proposed implementation offers cost-effective, near-real-time risk assessment for financial practitioners.
A dataflow analysis tool for parallel processing of algorithms
Jones, Robert L., III
1993-01-01
A graph-theoretic design process and software tool is presented for selecting a multiprocessing scheduling solution for a class of computational problems. The problems of interest are those that can be described using a dataflow graph and are intended to be executed repetitively on a set of identical parallel processors. Typical applications include signal processing and control law problems. Graph analysis techniques are introduced and shown to effectively determine performance bounds, scheduling constraints, and resource requirements. The software tool is shown to facilitate the application of the design process to a given problem.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods
Xie, Lang; Luo, Yi-han; Bao, Qi-liang
2013-08-01
GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
A hierarchical approach to reducing communication in parallel graph algorithms
Harshvardhan,
2015-01-01
Large-scale graph computing has become critical due to the ever-increasing size of data. However, distributed graph computations are limited in their scalability and performance due to the heavy communication inherent in such computations. This is exacerbated in scale-free networks, such as social and web graphs, which contain hub vertices that have large degrees and therefore send a large number of messages over the network. Furthermore, many graph algorithms and computations send the same data to each of the neighbors of a vertex. Our proposed approach recognizes this, and reduces communication performed by the algorithm without change to user-code, through a hierarchical machine model imposed upon the input graph. The hierarchical model takes advantage of locale information of the neighboring vertices to reduce communication, both in message volume and total number of bytes sent. It is also able to better exploit the machine hierarchy to further reduce the communication costs, by aggregating traffic between different levels of the machine hierarchy. Results of an implementation in the STAPL GL shows improved scalability and performance over the traditional level-synchronous approach, with 2.5 × - 8× improvement for a variety of graph algorithms at 12, 000+ cores.
Gong, Yiyuan; Guan, Senlin; Nakamura, Morikazu
This paper investigates migration effects of parallel genetic algorithms (GAs) on the line topology of heterogeneous computing resources. Evolution process of parallel GAs is evaluated experimentally on two types of arrangements of heterogeneous computing resources: the ascending and descending order arrangements. Migration effects are evaluated from the viewpoints of scalability, chromosome diversity, migration frequency and solution quality. The results reveal that the performance of parallel GAs strongly depends on the design of the chromosome migration in which we need to consider the arrangement of heterogeneous computing resources, the migration frequency and so on. The results contribute to provide referential scheme of implementation of parallel GAs on heterogeneous computing resources.
A Runtime Analysis of Parallel Evolutionary Algorithms in Dynamic Optimization
DEFF Research Database (Denmark)
Lissovoi, Andrei; Witt, Carsten
2017-01-01
A simple island model with (Formula presented.) islands and migration occurring after every (Formula presented.) iterations is studied on the dynamic fitness function Maze. This model is equivalent to a (Formula presented.) EA if (Formula presented.), i. e., migration occurs during every iteration....... It is proved that even for an increased offspring population size up to (Formula presented.), the (Formula presented.) EA is still not able to track the optimum of Maze. If the migration interval is chosen carefully, the algorithm is able to track the optimum even for logarithmic (Formula presented...
Communication-Avoiding Parallel Recursive Algorithms for Matrix Multiplication
2013-05-17
machines. The results in this chapter are joint work with Grey Ballard , James Demmel, Olga Holtz, and Oded Schwartz. The algorithm, analysis, and...LINPACK score of 1.05 Tflop/s on a matrix of dimension about 4.5 million. CHAPTER 2. STRASSEN’S MATRIX MULTIPLICATION 22 0 0.2 0.4 0.6 0.8 1...IBM’s ESSL version 4.4.1-0. As of November 2011, it was ranked number 23 on the TOP500 list [53], with a LINPACK score of 459 Tflop/s. Intrepid allows
On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms.
Chen, Chunlei; He, Li; Zhang, Huixiang; Zheng, Hao; Wang, Lei
2017-01-01
Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions.
A block-wise approximate parallel implementation for ART algorithm on CUDA-enabled GPU.
Fan, Zhongyin; Xie, Yaoqin
2015-01-01
Computed tomography (CT) has been widely used to acquire volumetric anatomical information in the diagnosis and treatment of illnesses in many clinics. However, the ART algorithm for reconstruction from under-sampled and noisy projection is still time-consuming. It is the goal of our work to improve a block-wise approximate parallel implementation for the ART algorithm on CUDA-enabled GPU to make the ART algorithm applicable to the clinical environment. The resulting method has several compelling features: (1) the rays are allotted into blocks, making the rays in the same block parallel; (2) GPU implementation caters to the actual industrial and medical application demand. We test the algorithm on a digital shepp-logan phantom, and the results indicate that our method is more efficient than the existing CPU implementation. The high computation efficiency achieved in our algorithm makes it possible for clinicians to obtain real-time 3D images.
Parallel implementation of a watershed algorithm on shared memory multicore architecture
Braham, Yosra; Akil, Mohamed; Bedoui, Mohamed Hédi
2017-03-01
Watershed transform is widely used in image segmentation. In literature, this transform is computed by various algorithms among which the M-border kernel algorithm [1]. This algorithm computes the watershed transform in the framework of edge weighted graphs. It is based on a local property that makes it adapted to parallelization. In this paper we propose a parallel implementation of this algorithm. We start by studying the data dependencies problematic that it raises. We give then an approach that allows overcoming this problematic based on an alternated edges processing strategy. The implementation of this strategy on a shared memory multicore architecture using a Single Program Multiple Data (SPMD) approach proves its effectiveness. In fact, experimental results show that our implementation achieves a relative speedup factor of 2.8 using 4 processors over the performance of the sequential algorithm using a single processor on the same system.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Directory of Open Access Journals (Sweden)
Jinwei Wang
2014-01-01
Full Text Available The active appearance model (AAM is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA on the Nvidia’s GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua
2011-01-01
A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform.
Directory of Open Access Journals (Sweden)
Long-Hua Ma
2011-08-01
Full Text Available A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform.
On the Optimization and Parallelizing Little Algorithm for Solving the Traveling Salesman Problem
Directory of Open Access Journals (Sweden)
V. V. Vasilchikov
2016-01-01
Full Text Available The paper describes some ways to accelerate solving the NP-complete Traveling Salesman Problem. The classic Little algorithm belonging to the category of ”branch and bound methods” can solve it both for directed and undirected graphs. However, for undirected graphs its operation can be accelerated by eliminating the consideration of branches examined earlier. The paper proposes changes to be made in the key operations of the algorithm to speed up its execution. It also describes the results of an experiment that demonstrated a significant acceleration of solving the problem by using an advanced algorithm. Another way to speed up the work is to parallelize the algorithm. For problems of this kind it is difficult to break the task into a sufficient number of subtasks having comparable complexity. Their parallelism arises dynamically during the execution. For such problems, it seems reasonable to use parallel-recursive algorithms. In our case the use of the library RPM ParLib developed by the author was a good choice. It allows us to develop effective applications for parallel computing on a local network using any .NET-compatible programming language. We used C# to develop the programs. Parallel applications were developed as for basic and modified algorithms, the comparing of their speed was made. Experiments were performed for the graphs with the number of vertexes up to 45 and with the number of network computers up to 16. We also investigated the acceleration that can be achieved by parallelizing the basic Little algorithm for directed graphs. The results of these experiments are also presented in the paper.
PMCR-Miner: parallel maximal confident association rules miner algorithm for microarray data set.
Zakaria, Wael; Kotb, Yasser; Ghaleb, Fayed F M
2015-01-01
The MCR-Miner algorithm is aimed to mine all maximal high confident association rules form the microarray up/down-expressed genes data set. This paper introduces two new algorithms: IMCR-Miner and PMCR-Miner. The IMCR-Miner algorithm is an extension of the MCR-Miner algorithm with some improvements. These improvements implement a novel way to store the samples of each gene into a list of unsigned integers in order to benefit using the bitwise operations. In addition, the IMCR-Miner algorithm overcomes the drawbacks faced by the MCR-Miner algorithm by setting some restrictions to ignore repeated comparisons. The PMCR-Miner algorithm is a parallel version of the new proposed IMCR-Miner algorithm. The PMCR-Miner algorithm is based on shared-memory systems and task parallelism, where no time is needed in the process of sharing and combining data between processors. The experimental results on real microarray data sets show that the PMCR-Miner algorithm is more efficient and scalable than the counterparts.
A Novel Discrete Fruit Fly Optimization Algorithm for Intelligent Parallel Test sheets Generation
Directory of Open Access Journals (Sweden)
Wang Fengrui
2015-01-01
Full Text Available Parallel test sheet generation (PTSG is a NP-hard combinational optimization problem, in which test sheet generation algorithm with high quality and efficiency is the core technology. Basic fruit fly optimization algorithm (FOA has the defects of easily relapsing into local optimal and low convergence precision when solving PTSG problem. In this paper, a novel discrete fruit fly optimization algorithm is proposed to solve the PTSG problem, in which a discrete osphesis searching operator based on the problem-specific knowledge is designed to help the FOA escaping from being trapped in local minima. To evaluate the performance of the proposed algorithm, the simulation experiments were conducted using a series of item banks with different scales. The superiority of the proposed algorithm is demonstrated by comparing it with the particle swarm optimization algorithm and differential evolution algorithm.
Algorithms for a parallel implementation of Hidden Markov Models with a small state space
DEFF Research Database (Denmark)
Nielsen, Jesper; Sand, Andreas
2011-01-01
, they require very little communication between processors, and are fast in practice on models with a small state space. We have tested our implementation against two other imple- mentations on artificial data and observe a speed-up of roughly a factor of 5 for the forward algorithm and more than 6......Two of the most important algorithms for Hidden Markov Models are the forward and the Viterbi algorithms. We show how formulating these using linear algebra naturally lends itself to parallelization. Although the obtained algorithms are slow for Hidden Markov Models with large state spaces...
A Multiple-Heaps Algorithm for Parallel Simulation of Collision Systems
Mu, Mo
2002-07-01
We consider the parallel simulation of collision systems. It has wide application, such as in hard-sphere molecular dynamics simulation for gas dynamics and crystals, as well as in studying molecular collision dynamics of chemical reactions. With detailed analysis, proper data structures are designed so that the central computational task is formulated as a consecutive search for the minimum in the collision time space of O(N2) entries, with multiple updates on O(N) entries in the same space per collision step. The abstraction and formulation enable us to incorporate efficient techniques in computer science into this application, which leads to a heap-based sequential algorithm of O(N log N) time in one typical collision step, where N is the number of particles of the simulated collision system. A parallel algorithm of multiple heaps with a diagonal-oriented mapping is then proposed. We show that the parallel algorithm is load balanced and the parallel time per collision step is O((N/P) log (N2/P)+log P), where P is the number of processors. The parallel algorithm uses two levels of partitioning independently, one in the particle-based physical space and the other in the collision time space. An exchange-shift communication algorithm is presented to bridge the two different partitioning schemes. Besides collision system simulation, the parallel multiple heaps algorithm may find applications in many other computing areas where a heap-based priority queue needs to be maintained, such as in fast level-set methods.
Customizing FP-growth algorithm to parallel mining with Charm++ library
Puścian, Marek
2017-08-01
This paper presents a frequent item mining algorithm that was customized to handle growing data repositories. The proposed solution applies Master Slave scheme to frequent pattern growth technique. Efficient utilization of available computation units is achieved by dynamic reallocation of tasks. Conditional frequent trees are assigned to parallel workers basing on their workload. Proposed enhancements have been successfully implemented using Charm++ library. This paper discusses results of the performance of parallelized FP-growth algorithm against different datasets. The approach has been illustrated with many experiments and measurements performed using multiprocessor and multithreaded computer.
Parallel genetic algorithms with migration for the hybrid flow shop scheduling problem
Directory of Open Access Journals (Sweden)
K. Belkadi
2006-01-01
Full Text Available This paper addresses scheduling problems in hybrid flow shop-like systems with a migration parallel genetic algorithm (PGA_MIG. This parallel genetic algorithm model allows genetic diversity by the application of selection and reproduction mechanisms nearer to nature. The space structure of the population is modified by dividing it into disjoined subpopulations. From time to time, individuals are exchanged between the different subpopulations (migration. Influence of parameters and dedicated strategies are studied. These parameters are the number of independent subpopulations, the interconnection topology between subpopulations, the choice/replacement strategy of the migrant individuals, and the migration frequency. A comparison between the sequential and parallel version of genetic algorithm (GA is provided. This comparison relates to the quality of the solution and the execution time of the two versions. The efficiency of the parallel model highly depends on the parameters and especially on the migration frequency. In the same way this parallel model gives a significant improvement of computational time if it is implemented on a parallel architecture which offers an acceptable number of processors (as many processors as subpopulations.
Characterization of robotics parallel algorithms and mapping onto a reconfigurable SIMD machine
Lee, C. S. G.; Lin, C. T.
1989-01-01
The kinematics, dynamics, Jacobian, and their corresponding inverse computations are six essential problems in the control of robot manipulators. Efficient parallel algorithms for these computations are discussed and analyzed. Their characteristics are identified and a scheme on the mapping of these algorithms to a reconfigurable parallel architecture is presented. Based on the characteristics including type of parallelism, degree of parallelism, uniformity of the operations, fundamental operations, data dependencies, and communication requirement, it is shown that most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure. Moreover, they are well-suited to be implemented on a single-instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network. The model of a reconfigurable dual network SIMD machine with internal direct feedback is introduced. A systematic procedure internal direct feedback is introduced. A systematic procedure to map these computations to the proposed machine is presented. A new scheduling problem for SIMD machines is investigated and a heuristic algorithm, called neighborhood scheduling, that reorders the processing sequence of subtasks to reduce the communication time is described. Mapping results of a benchmark algorithm are illustrated and discussed.
A Parallel Adaptive Particle Swarm Optimization Algorithm for Economic/Environmental Power Dispatch
Directory of Open Access Journals (Sweden)
Jinchao Li
2012-01-01
Full Text Available A parallel adaptive particle swarm optimization algorithm (PAPSO is proposed for economic/environmental power dispatch, which can overcome the premature characteristic, the slow-speed convergence in the late evolutionary phase, and lacking good direction in particles’ evolutionary process. A search population is randomly divided into several subpopulations. Then for each subpopulation, the optimal solution is searched synchronously using the proposed method, and thus parallel computing is realized. To avoid converging to a local optimum, a crossover operator is introduced to exchange the information among the subpopulations and the diversity of population is sustained simultaneously. Simulation results show that the proposed algorithm can effectively solve the economic/environmental operation problem of hydropower generating units. Performance comparisons show that the solution from the proposed method is better than those from the conventional particle swarm algorithm and other optimization algorithms.
Feed-forward volume rendering algorithm for moderately parallel MIMD machines
Yagel, Roni
1993-01-01
Algorithms for direct volume rendering on parallel and vector processors are investigated. Volumes are transformed efficiently on parallel processors by dividing the data into slices and beams of voxels. Equal sized sets of slices along one axis are distributed to processors. Parallelism is achieved at two levels. Because each slice can be transformed independently of others, processors transform their assigned slices with no communication, thus providing maximum possible parallelism at the first level. Within each slice, consecutive beams are incrementally transformed using coherency in the transformation computation. Also, coherency across slices can be exploited to further enhance performance. This coherency yields the second level of parallelism through the use of the vector processing or pipelining. Other ongoing efforts include investigations into image reconstruction techniques, load balancing strategies, and improving performance.
A Hybrid Parallel Algorithm for Computing and Tracking Level Set Topology
Maadasamy, Senthilnathan; Doraiswamy, Harish; Natarajan, Vijay
2012-01-01
The contour tree is a topological abstraction of a scalar field that captures evolution in level set connectivity. It is an effective representation for visual exploration and analysis of scientific data. We describe a work-efficient, output sensitive, and scalable parallel algorithm for computing the contour tree of a scalar field defined on a domain that is represented using either an unstructured mesh or a structured grid. A hybrid implementation of the algorithm using the GPU and multi-co...
Advanced Algorithms and Automation Tools for Discrete Ordinates Methods in Parallel Environments
Energy Technology Data Exchange (ETDEWEB)
Alireza Haghighat
2003-05-07
This final report discusses major accomplishments of a 3-year project under the DOE's NEER Program. The project has developed innovative and automated algorithms, codes, and tools for solving the discrete ordinates particle transport method efficiently in parallel environments. Using a number of benchmark and real-life problems, the performance and accuracy of the new algorithms have been measured and analyzed.
GridVis: Visualisation of Island-based parallel genetic algorithms
Lutton, Evelyne; Gilbert, Hugo; Cancino, Waldo; Bach, Benjamin; Parrend, Pierre; Pierre, Collet
2014-01-01
Island Model parallel genetic algorithms rely on various mi- gration models and their associated parameter settings. A fine under- standing of how the islands interact and exchange informations is an im- portant issue for the design of efficient algorithms. This article presents GridVis, an interactive tool for visualising the exchange of individuals and the propagation of fitness values between islands. We performed sev- eral experiments on a grid and on a cluster to evaluate GridVis' abilit...
Jacobian free monotonic descent algorithm for forward kinematics of spatial parallel manipulator
Directory of Open Access Journals (Sweden)
Gang Shen
2016-04-01
Full Text Available In order to efficiently solve a forward kinematics of parallel manipulators for real-time applications, a Jacobian free monotonic descent algorithm is proposed in this article. The system of nonlinear equations of a specified 6-degree-of-freedom parallel manipulator is established using a geometric analysis method. The proposed Jacobian free monotonic descent algorithm is modified using a traditional Newton–Raphson method by employing a first-order Taylor series expansion to numerically approximate a Jacobian matrix. A monotonic descent factor is employed for preventing the iteration from divergence even with poor initial conditions. The proposed algorithm inherits the merits of the Newton–Raphson algorithm and overcomes its drawbacks. The Jacobian free monotonic descent algorithm is programmed in MATLAB/Simulink and then is compiled to a real-time PC system with xPC target technology for implementation. The experimental results demonstrate that the proposed Jacobian free monotonic descent algorithm is effective and feasible for the real-time forward kinematics of parallel manipulators in terms of accuracy, convergence, and execution time.
A scalable method for parallelizing sampling-based motion planning algorithms
Jacobs, Sam Ade
2012-05-01
This paper describes a scalable method for parallelizing sampling-based motion planning algorithms. It subdivides configuration space (C-space) into (possibly overlapping) regions and independently, in parallel, uses standard (sequential) sampling-based planners to construct roadmaps in each region. Next, in parallel, regional roadmaps in adjacent regions are connected to form a global roadmap. By subdividing the space and restricting the locality of connection attempts, we reduce the work and inter-processor communication associated with nearest neighbor calculation, a critical bottleneck for scalability in existing parallel motion planning methods. We show that our method is general enough to handle a variety of planning schemes, including the widely used Probabilistic Roadmap (PRM) and Rapidly-exploring Random Trees (RRT) algorithms. We compare our approach to two other existing parallel algorithms and demonstrate that our approach achieves better and more scalable performance. Our approach achieves almost linear scalability on a 2400 core LINUX cluster and on a 153,216 core Cray XE6 petascale machine. © 2012 IEEE.
Parallel-tempering cluster algorithm for computer simulations of critical phenomena.
Bittner, Elmar; Janke, Wolfhard
2011-09-01
In finite-size scaling analyses of Monte Carlo simulations of second-order phase transitions one often needs an extended temperature range around the critical point. By combining the parallel-tempering algorithm with cluster updates and an adaptive routine to find the temperature window of interest, we introduce a flexible and powerful method for systematic investigations of critical phenomena. As a result, we gain one to two orders of magnitude in the performance for two- and three-dimensional Ising models in comparison with the recently proposed Wang-Landau recursion for cluster algorithms based on the multibondic algorithm, which is already a great improvement over the standard multicanonical variant.
Energy Technology Data Exchange (ETDEWEB)
Ellison, C. Leland [PPPL; Finn, J. M. [LANL; Qin, H. [PPPL; Tang, William M. [PPPL
2014-10-01
Structure-preserving algorithms obtained via discrete variational principles exhibit strong promise for the calculation of guiding center test particle trajectories. The non-canonical Hamiltonian structure of the guiding center equations forms a novel and challenging context for geometric integration. To demonstrate the practical relevance of these methods, a prototypical variational midpoint algorithm is applied to an experimental magnetic equilibrium. The stability characteristics, conservation properties, and implementation requirements associated with the variational algorithms are addressed. Furthermore, computational run time is reduced for large numbers of particles by parallelizing the calculation on GPU hardware.
DEFF Research Database (Denmark)
Dollerup, Niels; Jepsen, Michael S.; Frier, Christian
2014-01-01
in the determination of the search direction in each iteration step, and the formualtion also allows for parallel computation. The implementation has been used in load optimization of reinforced concrete slabs but is fully general. Different examples are treated to benchmark the algorithm with previous work...
Yazdani, Roghayeh; Fallah, Hamid R; Hajimahmoodzadeh, Morteza
2014-03-15
We numerically and experimentally demonstrate an iterative method to simultaneously reconstruct two unknown interfering wavefronts. A three-dimensional interference pattern is analyzed and then Zernike polynomials and the stochastic parallel gradient descent algorithm are used to expand and calculate wavefronts.
Creating IRT-Based Parallel Test Forms Using the Genetic Algorithm Method
Sun, Koun-Tem; Chen, Yu-Jen; Tsai, Shu-Yen; Cheng, Chien-Fen
2008-01-01
In educational measurement, the construction of parallel test forms is often a combinatorial optimization problem that involves the time-consuming selection of items to construct tests having approximately the same test information functions (TIFs) and constraints. This article proposes a novel method, genetic algorithm (GA), to construct parallel…
PARALLEL ADAPTIVE MULTILEVEL SAMPLING ALGORITHMS FOR THE BAYESIAN ANALYSIS OF MATHEMATICAL MODELS
Prudencio, Ernesto
2012-01-01
In recent years, Bayesian model updating techniques based on measured data have been applied to many engineering and applied science problems. At the same time, parallel computational platforms are becoming increasingly more powerful and are being used more frequently by the engineering and scientific communities. Bayesian techniques usually require the evaluation of multi-dimensional integrals related to the posterior probability density function (PDF) of uncertain model parameters. The fact that such integrals cannot be computed analytically motivates the research of stochastic simulation methods for sampling posterior PDFs. One such algorithm is the adaptive multilevel stochastic simulation algorithm (AMSSA). In this paper we discuss the parallelization of AMSSA, formulating the necessary load balancing step as a binary integer programming problem. We present a variety of results showing the effectiveness of load balancing on the overall performance of AMSSA in a parallel computational environment.
Pretty Fast Analysis: An embarrassingly parallel algorithm for biological simulation analysis
Lebard, David N
2008-01-01
A parallel code has been written in FORTRAN90, C, and MPI for the analysis of biological simulation data. Using a master/slave algorithm, the software operates on AMBER generated trajectory data using either UNIX or MPI file IO, and it supports up to 15 simultaneous function calls. This software has been performance tested on the Ranger Supercomputer on trajectory data of an aqueous bacterial reaction center micelle. Although the parallel reading is poor, the analysis algorithm itself shows embarrassingly parallel speedup up to 1024 compute nodes. At this CPU count the overall scaling of the software compares well NAMD's best reported speedup, and outperforms AMBER's best known scaling by a factor of 3, while using only a small number of function calls and a short trajectory length.
Fast parallel molecular algorithms for DNA-based computation: factoring integers.
Chang, Weng-Long; Guo, Minyi; Ho, Michael Shan-Hui
2005-06-01
The RSA public-key cryptosystem is an algorithm that converts input data to an unrecognizable encryption and converts the unrecognizable data back into its original decryption form. The security of the RSA public-key cryptosystem is based on the difficulty of factoring the product of two large prime numbers. This paper demonstrates to factor the product of two large prime numbers, and is a breakthrough in basic biological operations using a molecular computer. In order to achieve this, we propose three DNA-based algorithms for parallel subtractor, parallel comparator, and parallel modular arithmetic that formally verify our designed molecular solutions for factoring the product of two large prime numbers. Furthermore, this work indicates that the cryptosystems using public-key are perhaps insecure and also presents clear evidence of the ability of molecular computing to perform complicated mathematical operations.
Wang, Congzhe; Fang, Yuefa; Guo, Sheng
2015-07-01
Dimensional synthesis is one of the most difficult issues in the field of parallel robots with actuation redundancy. To deal with the optimal design of a redundantly actuated parallel robot used for ankle rehabilitation, a methodology of dimensional synthesis based on multi-objective optimization is presented. First, the dimensional synthesis of the redundant parallel robot is formulated as a nonlinear constrained multi-objective optimization problem. Then four objective functions, separately reflecting occupied space, input/output transmission and torque performances, and multi-criteria constraints, such as dimension, interference and kinematics, are defined. In consideration of the passive exercise of plantar/dorsiflexion requiring large output moment, a torque index is proposed. To cope with the actuation redundancy of the parallel robot, a new output transmission index is defined as well. The multi-objective optimization problem is solved by using a modified Differential Evolution(DE) algorithm, which is characterized by new selection and mutation strategies. Meanwhile, a special penalty method is presented to tackle the multi-criteria constraints. Finally, numerical experiments for different optimization algorithms are implemented. The computation results show that the proposed indices of output transmission and torque, and constraint handling are effective for the redundant parallel robot; the modified DE algorithm is superior to the other tested algorithms, in terms of the ability of global search and the number of non-dominated solutions. The proposed methodology of multi-objective optimization can be also applied to the dimensional synthesis of other redundantly actuated parallel robots only with rotational movements.
A new asynchronous parallel algorithm for inferring large-scale gene regulatory networks.
Directory of Open Access Journals (Sweden)
Xiangyun Xiao
Full Text Available The reconstruction of gene regulatory networks (GRNs from high-throughput experimental data has been considered one of the most important issues in systems biology research. With the development of high-throughput technology and the complexity of biological problems, we need to reconstruct GRNs that contain thousands of genes. However, when many existing algorithms are used to handle these large-scale problems, they will encounter two important issues: low accuracy and high computational cost. To overcome these difficulties, the main goal of this study is to design an effective parallel algorithm to infer large-scale GRNs based on high-performance parallel computing environments. In this study, we proposed a novel asynchronous parallel framework to improve the accuracy and lower the time complexity of large-scale GRN inference by combining splitting technology and ordinary differential equation (ODE-based optimization. The presented algorithm uses the sparsity and modularity of GRNs to split whole large-scale GRNs into many small-scale modular subnetworks. Through the ODE-based optimization of all subnetworks in parallel and their asynchronous communications, we can easily obtain the parameters of the whole network. To test the performance of the proposed approach, we used well-known benchmark datasets from Dialogue for Reverse Engineering Assessments and Methods challenge (DREAM, experimentally determined GRN of Escherichia coli and one published dataset that contains more than 10 thousand genes to compare the proposed approach with several popular algorithms on the same high-performance computing environments in terms of both accuracy and time complexity. The numerical results demonstrate that our parallel algorithm exhibits obvious superiority in inferring large-scale GRNs.
A divide-and-inner product parallel algorithm for polynomial evaluation
Energy Technology Data Exchange (ETDEWEB)
Hu, Jie; Li, Lei [Aomori Univ. (Japan); Nakamura, Tadao [Tohoku Univ., Sendai (Japan)
1994-12-31
In this paper, a divide-and-inner product parallel algorithm for evaluating a polynomial of degree N (N+1=KL) on a MIMD computer is presented. It needs 2K + log{sub 2}L steps to evaluate a polynomial of degree N in parallel on L+1 processors (L{<=}2K-2log{sub 2}K) which is a decrease of log{sub 2}L steps as compared with the L-order Homer`s method, and which is a decrease of (2log{sub 2}L){sup 1/2} steps as compared with the some MIMD algorithms. The new algorithm is simple in structure and easy to be realized.
A parallel nonlinear adaptive enhancement algorithm for low- or high-intensity color images
Zhou, Zhigang; Sang, Nong; Hu, Xinrong
2014-12-01
This article addresses the problem of color image enhancement for images with low or high intensity and poor contrast (LIPC or HIPC). A parallel nonlinear adaptive enhancement (PNAE) algorithm using information from local neighborhood is presented to resolve the problem in parallel. The PNAE algorithm consists of three steps. First, a red-green-blue (RGB) color image is converted to an intensity image, then an adaptive intensity adjustment with local contrast enhancement is parallelly performed, and finally, colors are restored. The PNAE algorithm can be adjusted to control the level of enhancement on the overall lightness and the contrast achieved at the output separately. Most of the parameters used in PNAE are robust for LIPC and HIPC color image enhancement. Experimental results show that PNAE outperforms two popular methods in both computational efficiency and overall content preservation of image while improving local contrast for LIPC and HIPC image enhancement.
Directory of Open Access Journals (Sweden)
Yu Huang
Full Text Available Parameter estimation for fractional-order chaotic systems is an important issue in fractional-order chaotic control and synchronization and could be essentially formulated as a multidimensional optimization problem. A novel algorithm called quantum parallel particle swarm optimization (QPPSO is proposed to solve the parameter estimation for fractional-order chaotic systems. The parallel characteristic of quantum computing is used in QPPSO. This characteristic increases the calculation of each generation exponentially. The behavior of particles in quantum space is restrained by the quantum evolution equation, which consists of the current rotation angle, individual optimal quantum rotation angle, and global optimal quantum rotation angle. Numerical simulation based on several typical fractional-order systems and comparisons with some typical existing algorithms show the effectiveness and efficiency of the proposed algorithm.
Huang, Yu; Guo, Feng; Li, Yongling; Liu, Yufeng
2015-01-01
Parameter estimation for fractional-order chaotic systems is an important issue in fractional-order chaotic control and synchronization and could be essentially formulated as a multidimensional optimization problem. A novel algorithm called quantum parallel particle swarm optimization (QPPSO) is proposed to solve the parameter estimation for fractional-order chaotic systems. The parallel characteristic of quantum computing is used in QPPSO. This characteristic increases the calculation of each generation exponentially. The behavior of particles in quantum space is restrained by the quantum evolution equation, which consists of the current rotation angle, individual optimal quantum rotation angle, and global optimal quantum rotation angle. Numerical simulation based on several typical fractional-order systems and comparisons with some typical existing algorithms show the effectiveness and efficiency of the proposed algorithm.
A parallel graded-mesh FDTD algorithm for human-antenna interaction problems.
Catarinucci, Luca; Tarricone, Luciano
2009-01-01
The finite difference time domain method (FDTD) is frequently used for the numerical solution of a wide variety of electromagnetic (EM) problems and, among them, those concerning human exposure to EM fields. In many practical cases related to the assessment of occupational EM exposure, large simulation domains are modeled and high space resolution adopted, so that strong memory and central processing unit power requirements have to be satisfied. To better afford the computational effort, the use of parallel computing is a winning approach; alternatively, subgridding techniques are often implemented. However, the simultaneous use of subgridding schemes and parallel algorithms is very new. In this paper, an easy-to-implement and highly-efficient parallel graded-mesh (GM) FDTD scheme is proposed and applied to human-antenna interaction problems, demonstrating its appropriateness in dealing with complex occupational tasks and showing its capability to guarantee the advantages of a traditional subgridding technique without affecting the parallel FDTD performance.
Parallel supercomputing: Advanced methods, algorithms and software for large-scale problems
Energy Technology Data Exchange (ETDEWEB)
Carey, G.F.; Young, D.M.
1992-04-01
Research has continued with excellent progress and new results on methodology and algorithms. We have also made supporting benchmark application studies on representative parallel computing architectures. Results from these research studies have been reported at scientific meetings, as technical reports and as journal publications. A list of pertinent presentations and publications is attached. The work on parallel element-by-element techniques and domain decomposition schemes has developed well. In particular, we have focused on the use of finite element spectral methods (or high p methods) on distributed massively parallel systems. The approach has been implemented in a prototype finite element program for solution of coupled Navier Stokes flow and transport processes. This class of problems is of fundamental interest and basic to many grand challenge'' type problems for which parallel supercomputing is warranted.
Jiang, Y.; Xing, H. L.
2016-12-01
Micro-seismic events induced by water injection, mining activity or oil/gas extraction are quite informative, the interpretation of which can be applied for the reconstruction of underground stress and monitoring of hydraulic fracturing progress in oil/gas reservoirs. The source characterises and locations are crucial parameters that required for these purposes, which can be obtained through the waveform matching inversion (WMI) method. Therefore it is imperative to develop a WMI algorithm with high accuracy and convergence speed. Heuristic algorithm, as a category of nonlinear method, possesses a very high convergence speed and good capacity to overcome local minimal values, and has been well applied for many areas (e.g. image processing, artificial intelligence). However, its effectiveness for micro-seismic WMI is still poorly investigated; very few literatures exits that addressing this subject. In this research an advanced heuristic algorithm, gravitational search algorithm (GSA) , is proposed to estimate the focal mechanism (angle of strike, dip and rake) and source locations in three dimension. Unlike traditional inversion methods, the heuristic algorithm inversion does not require the approximation of green function. The method directly interacts with a CPU parallelized finite difference forward modelling engine, and updating the model parameters under GSA criterions. The effectiveness of this method is tested with synthetic data form a multi-layered elastic model; the results indicate GSA can be well applied on WMI and has its unique advantages. Keywords: Micro-seismicity, Waveform matching inversion, gravitational search algorithm, parallel computation
Directory of Open Access Journals (Sweden)
Ivan Komarov
Full Text Available The Gillespie Stochastic Simulation Algorithm (GSSA and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs. We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism. A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.
Energy Technology Data Exchange (ETDEWEB)
Carey, G.F.; Young, D.M.
1993-12-31
The program outlined here is directed to research on methods, algorithms, and software for distributed parallel supercomputers. Of particular interest are finite element methods and finite difference methods together with sparse iterative solution schemes for scientific and engineering computations of very large-scale systems. Both linear and nonlinear problems will be investigated. In the nonlinear case, applications with bifurcation to multiple solutions will be considered using continuation strategies. The parallelizable numerical methods of particular interest are a family of partitioning schemes embracing domain decomposition, element-by-element strategies, and multi-level techniques. The methods will be further developed incorporating parallel iterative solution algorithms with associated preconditioners in parallel computer software. The schemes will be implemented on distributed memory parallel architectures such as the CRAY MPP, Intel Paragon, the NCUBE3, and the Connection Machine. We will also consider other new architectures such as the Kendall-Square (KSQ) and proposed machines such as the TERA. The applications will focus on large-scale three-dimensional nonlinear flow and reservoir problems with strong convective transport contributions. These are legitimate grand challenge class computational fluid dynamics (CFD) problems of significant practical interest to DOE. The methods developed and algorithms will, however, be of wider interest.
Parameterized String Matching Algorithms with Application to ...
African Journals Online (AJOL)
In the parameterized string matching problem, a given pattern P is said to match with a sub-string t of the text T, if there exist a bijection from the symbols of P to the symbols of t. Salmela and Tarhio solve the parameterized string matching problem in sub-linear time by applying the concept of q-gram in the Horspool algorithm ...
ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use
Directory of Open Access Journals (Sweden)
Kraj Piotr
2008-04-01
Full Text Available Abstract Background During the last decade, the use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data. A common technique used to organize and analyze microarray data is to perform cluster analysis. While many clustering algorithms have been developed, they all suffer a significant decrease in computational performance as the size of the dataset being analyzed becomes very large. For example, clustering 10000 genes from an experiment containing 200 microarrays can be quite time consuming and challenging on a desktop PC. One solution to the scalability problem of clustering algorithms is to distribute or parallelize the algorithm across multiple computers. Results The software described in this paper is a high performance multithreaded application that implements a parallelized version of the K-means Clustering algorithm. Most parallel processing applications are not accessible to the general public and require specialized software libraries (e.g. MPI and specialized hardware configurations. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Here we show our parallel implementation provides significant performance gains over a wide range of datasets using as little as seven nodes. The software was written in C# and was designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. Conclusion ParaKMeans was designed to provide the general scientific community with an easy and manageable client-server application that can be installed on a wide variety of Windows operating systems.
Parallel Algorithm of Geometrical Hashing Based on NumPy Package and Processes Pool
Directory of Open Access Journals (Sweden)
Klyachin Vladimir Aleksandrovich
2015-10-01
Full Text Available The article considers the problem of multi-dimensional geometric hashing. The paper describes a mathematical model of geometric hashing and considers an example of its use in localization problems for the point. A method of constructing the corresponding hash matrix by parallel algorithm is considered. In this paper an algorithm of parallel geometric hashing using a development pattern «pool processes» is proposed. The implementation of the algorithm is executed using the Python programming language and NumPy package for manipulating multidimensional data. To implement the process pool it is proposed to use a class Process Pool Executor imported from module concurrent.futures, which is included in the distribution of the interpreter Python since version 3.2. All the solutions are presented in the paper by corresponding UML class diagrams. Designed GeomNash package includes classes Data, Result, GeomHash, Job. The results of the developed program presents the corresponding graphs. Also, the article presents the theoretical justification for the application process pool for the implementation of parallel algorithms. It is obtained condition t2 > (p/(p-1*t1 of the appropriateness of process pool. Here t1 - the time of transmission unit of data between processes, and t2 - the time of processing unit data by one processor.
Directory of Open Access Journals (Sweden)
Helio Yochihiro Fuchigami
2014-08-01
Full Text Available This article addresses the problem of minimizing makespan on two parallel flow shops with proportional processing and setup times. The setup times are separated and sequence-independent. The parallel flow shop scheduling problem is a specific case of well-known hybrid flow shop, characterized by a multistage production system with more than one machine working in parallel at each stage. This situation is very common in various kinds of companies like chemical, electronics, automotive, pharmaceutical and food industries. This work aimed to propose six Simulated Annealing algorithms, their perturbation schemes and an algorithm for initial sequence generation. This study can be classified as “applied research” regarding the nature, “exploratory” about the objectives and “experimental” as to procedures, besides the “quantitative” approach. The proposed algorithms were effective regarding the solution and computationally efficient. Results of Analysis of Variance (ANOVA revealed no significant difference between the schemes in terms of makespan. It’s suggested the use of PS4 scheme, which moves a subsequence of jobs, for providing the best percentage of success. It was also found that there is a significant difference between the results of the algorithms for each value of the proportionality factor of the processing and setup times of flow shops.
Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.
Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias
2011-01-01
The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.
A nested decomposition algorithm for parallel computations of very large sparse systems
Directory of Open Access Journals (Sweden)
iljak D. D.
1995-01-01
Full Text Available In this paper we present a generalization of the balanced border block diagonal (BBD decomposition algorithm, which was developed for the parallel computation of sparse systems of linear equations. The efficiency of the new procedure is substantially higher, and it extends the applicability of the BBD decomposition to extremely large problems. Examples of the decomposition are provided for matrices as large as 250 , 000 × 250 , 000 , and its performance is compared to other sparse decompositions. Applications to the parallel solution of sparse systems are discussed for a variety of engineering problems.
A Discretization Algorithm for Meteorological Data and its Parallelization Based on Hadoop
Liu, Chao; Jin, Wen; Yu, Yuting; Qiu, Taorong; Bai, Xiaoming; Zou, Shuilong
2017-10-01
In view of the large amount of meteorological observation data, the property is more and the attribute values are continuous values, the correlation between the elements is the need for the application of meteorological data, this paper is devoted to solving the problem of how to better discretize large meteorological data to more effectively dig out the hidden knowledge in meteorological data and research on the improvement of discretization algorithm for large scale data, in order to achieve data in the large meteorological data discretization for the follow-up to better provide knowledge to provide protection, a discretization algorithm based on information entropy and inconsistency of meteorological attributes is proposed and the algorithm is parallelized under Hadoop platform. Finally, the comparison test validates the effectiveness of the proposed algorithm for discretization in the area of meteorological large data.
HPC-NMF: A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization
Energy Technology Data Exchange (ETDEWEB)
2016-08-22
NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient distributed algorithms to solve the problem for big data sets. We propose a high-performance distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for $\\WW$ and $\\HH$. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementation, our algorithm is also flexible: It performs well for both dense and sparse matrices, and allows the user to choose any one of the multiple algorithms for solving the updates to low rank factors $\\WW$ and $\\HH$ within the alternating iterations.
Research on B Cell Algorithm for Learning to Rank Method Based on Parallel Strategy.
Directory of Open Access Journals (Sweden)
Yuling Tian
Full Text Available For the purposes of information retrieval, users must find highly relevant documents from within a system (and often a quite large one comprised of many individual documents based on input query. Ranking the documents according to their relevance within the system to meet user needs is a challenging endeavor, and a hot research topic-there already exist several rank-learning methods based on machine learning techniques which can generate ranking functions automatically. This paper proposes a parallel B cell algorithm, RankBCA, for rank learning which utilizes a clonal selection mechanism based on biological immunity. The novel algorithm is compared with traditional rank-learning algorithms through experimentation and shown to outperform the others in respect to accuracy, learning time, and convergence rate; taken together, the experimental results show that the proposed algorithm indeed effectively and rapidly identifies optimal ranking functions.
Förster, Michael
2014-01-01
Numerical programs often use parallel programming techniques such as OpenMP to compute the program's output values as efficient as possible. In addition, derivative values of these output values with respect to certain input values play a crucial role. To achieve code that computes not only the output values simultaneously but also the derivative values, this work introduces several source-to-source transformation rules. These rules are based on a technique called algorithmic differentiation. The main focus of this work lies on the important reverse mode of algorithmic differentiation. The inh
Zhou, Pu; Ma, Yanxing; Wang, Xiaolin; Ma, Haotong; Xu, Xiaojun; Liu, Zejin
2009-10-01
Multitone radiation is a promising technique to mitigate stimulated Brillouin scattering effects in narrow-linewidth fiber amplifiers. We demonstrate coherent beam combination of three two-tone fiber amplifiers using a stochastic parallel gradient descent (SPGD) algorithm. Phase control on the fiber amplifiers are performed by running the SPGD algorithm on a digital signal processor. The contrast of far-field intensity pattern of a coherently combined beam is more than 85%. Experimental results validate that a single-frequency seed laser is not indispensable for coherent beam combination in master oscillator power amplifier configuration.
Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms
Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel
2016-04-01
Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and
SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws
Cooke, Daniel; Rushton, Nelson
2013-01-01
With the introduction of new parallel architectures like the cell and multicore chips from IBM, Intel, AMD, and ARM, as well as the petascale processing available for highend computing, a larger number of programmers will need to write parallel codes. Adding the parallel control structure to the sequence, selection, and iterative control constructs increases the complexity of code development, which often results in increased development costs and decreased reliability. SequenceL is a high-level programming language that is, a programming language that is closer to a human s way of thinking than to a machine s. Historically, high-level languages have resulted in decreased development costs and increased reliability, at the expense of performance. In recent applications at JSC and in industry, SequenceL has demonstrated the usual advantages of high-level programming in terms of low cost and high reliability. SequenceL programs, however, have run at speeds typically comparable with, and in many cases faster than, their counterparts written in C and C++ when run on single-core processors. Moreover, SequenceL is able to generate parallel executables automatically for multicore hardware, gaining parallel speedups without any extra effort from the programmer beyond what is required to write the sequen tial/singlecore code. A SequenceL-to-C++ translator has been developed that automatically renders readable multithreaded C++ from a combination of a SequenceL program and sample data input. The SequenceL language is based on two fundamental computational laws, Consume-Simplify- Produce (CSP) and Normalize-Trans - pose (NT), which enable it to automate the creation of parallel algorithms from high-level code that has no annotations of parallelism whatsoever. In our anecdotal experience, SequenceL development has been in every case less costly than development of the same algorithm in sequential (that is, single-core, single process) C or C++, and an order of magnitude less
PARALLEL ALGORITHM FOR THREE-DIMENSIONAL STOKES FLOW SIMULATION USING BOUNDARY ELEMENT METHOD
Directory of Open Access Journals (Sweden)
D. G. Pribytok
2016-01-01
Full Text Available Parallel computing technique for modeling three-dimensional viscous flow (Stokes flow using direct boundary element method is presented. The problem is solved in three phases: sampling and construction of system of linear algebraic equations (SLAE, its decision and finding the velocity of liquid at predetermined points. For construction of the system and finding the velocity, the parallel algorithms using graphics CUDA cards programming technology have been developed and implemented. To solve the system of linear algebraic equations the implemented software libraries are used. A comparison of time consumption for three main algorithms on the example of calculation of viscous fluid motion in three-dimensional cavity is performed.
Loring, B.; Karimabadi, H.; Rortershteyn, V.
2015-10-01
The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.
Energy Technology Data Exchange (ETDEWEB)
Loring, Burlen; Karimabadi, Homa; Rortershteyn, Vadim
2014-07-01
The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.
Ozmutlu, H. Cenk
2014-01-01
We developed mixed integer programming (MIP) models and hybrid genetic-local search algorithms for the scheduling problem of unrelated parallel machines with job sequence and machine-dependent setup times and with job splitting property. The first contribution of this paper is to introduce novel algorithms which make splitting and scheduling simultaneously with variable number of subjobs. We proposed simple chromosome structure which is constituted by random key numbers in hybrid genetic-local search algorithm (GAspLA). Random key numbers are used frequently in genetic algorithms, but it creates additional difficulty when hybrid factors in local search are implemented. We developed algorithms that satisfy the adaptation of results of local search into the genetic algorithms with minimum relocation operation of genes' random key numbers. This is the second contribution of the paper. The third contribution of this paper is three developed new MIP models which are making splitting and scheduling simultaneously. The fourth contribution of this paper is implementation of the GAspLAMIP. This implementation let us verify the optimality of GAspLA for the studied combinations. The proposed methods are tested on a set of problems taken from the literature and the results validate the effectiveness of the proposed algorithms. PMID:24977204
Lü, Haibin; Zhou, Pu; Wang, Xiaolin; Jiang, Zongfu
2013-04-20
A new approach for the complete modal decomposition of the optical fields emerging from the multimode fiber is presented in this paper. Based on the stochastic parallel gradient descent algorithm, mode coefficients for all the bound modes in the multimode fiber can be exactly calculated by utilizing one intensity profile of the beam. Numerical simulation validates the feasibility, and the reconstructed error is below 0.1%. In the case of six modes within the fiber, the running time is about 2 s.
Service Composition Optimization Method Based on Parallel Particle Swarm Algorithm on Spark
Directory of Open Access Journals (Sweden)
Xing Guo
2017-01-01
Full Text Available Web service composition is one of the core technologies of realizing service-oriented computing. Web service composition satisfies the requirements of users to form new value-added services by composing existing services. As Cloud Computing develops, the emergence of Web services with different quality yet similar functionality has brought new challenges to service composition optimization problem. How to solve large-scale service composition in the Cloud Computing environment has become an urgent problem. To tackle this issue, this paper proposes a parallel optimization approach based on Spark distributed environment. Firstly, the parallel covering algorithm is used to cluster the Web services. Next, the multiple clustering centers obtained are used as the starting point of the particles to improve the diversity of the initial population. Then, according to the parallel data coding rules of resilient distributed dataset (RDD, the large-scale combination service is generated with the proposed algorithm named Spark Particle Swarm Optimization Algorithm (SPSO. Finally, the usage of particle elite selection strategy removes the inert particles to optimize the performance of the combination of service selection. This paper adopts real data set WS-Dream to prove the validity of the proposed method with a large number of experimental results.
Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms
Fidel, Adam
2014-05-01
Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores. © 2014 IEEE.
ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches.
Rognes, T
2001-04-01
There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of this technology. The new algorithm initially exploits parallelism to perform a very rapid computation of the exact optimal ungapped alignment score for all diagonals in the alignment matrix. Then, a novel heuristic is employed to compute an approximate score of a gapped alignment by combining the scores of several diagonals. This approximate score is used to select the most interesting database sequences for a subsequent Smith-Waterman alignment, which is also parallelised. The resulting method represents a substantial improvement compared to existing heuristics. The sensitivity and specificity of ParAlign was found to be as good as Smith-Waterman implementations when the same method for computing the statistical significance of the matches was used. In terms of speed, only the significantly less sensitive NCBI BLAST 2 program was found to outperform the new approach. Online searches are available at http://dna.uio.no/search/
Parallel algorithm for dominant points correspondences in robot binocular stereo vision
Al-Tammami, A.; Singh, B.
1993-01-01
This paper presents an algorithm to find the correspondences of points representing dominant feature in robot stereo vision. The algorithm consists of two main steps: dominant point extraction and dominant point matching. In the feature extraction phase, the algorithm utilizes the widely used Moravec Interest Operator and two other operators: the Prewitt Operator and a new operator called Gradient Angle Variance Operator. The Interest Operator in the Moravec algorithm was used to exclude featureless areas and simple edges which are oriented in the vertical, horizontal, and two diagonals. It was incorrectly detecting points on edges which are not on the four main directions (vertical, horizontal, and two diagonals). The new algorithm uses the Prewitt operator to exclude featureless areas, so that the Interest Operator is applied only on the edges to exclude simple edges and to leave interesting points. This modification speeds-up the extraction process by approximately 5 times. The Gradient Angle Variance (GAV), an operator which calculates the variance of the gradient angle in a window around the point under concern, is then applied on the interesting points to exclude the redundant ones and leave the actual dominant ones. The matching phase is performed after the extraction of the dominant points in both stereo images. The matching starts with dominant points in the left image and does a local search, looking for corresponding dominant points in the right image. The search is geometrically constrained the epipolar line of the parallel-axes stereo geometry and the maximum disparity of the application environment. If one dominant point in the right image lies in the search areas, then it is the corresponding point of the reference dominant point in the left image. A parameter provided by the GAV is thresholded and used as a rough similarity measure to select the corresponding dominant point if there is more than one point the search area. The correlation is used as
Optimized simulations of Olami-Feder-Christensen systems using parallel algorithms
Dominguez, Rachele; Necaise, Rance; Montag, Eric
The sequential nature of the Olami-Feder-Christensen (OFC) model for earthquake simulations limits the benefits of parallel computing approaches because of the frequent communication required between processors. We developed a parallel version of the OFC algorithm for multi-core processors. Our data, even for relatively small system sizes and low numbers of processors, indicates that increasing the number of processors provides significantly faster simulations; producing more efficient results than previous attempts that used network-based Beowulf clusters. Our algorithm optimizes performance by exploiting the multi-core processor architecture, minimizing communication time in contrast to the networked Beowulf-cluster approaches. Our multi-core algorithm is the basis for a new algorithm using GPUs that will drastically increase the number of processors available. Previous studies incorporating realistic structural features of faults into OFC models have revealed spatial and temporal patterns observed in real earthquake systems. The computational advances presented here will allow for studying interacting networks of faults, rather than individual faults, further enhancing our understanding of the relationship between the earth's structure and the triggering process. Support for this project comes from the Chenery Research Fund, the Rashkind Family Endowment, the Walter Williams Craigie Teaching Endowment, and the Schapiro Undergraduate Research Fellowship.
A Parallel Decoding Algorithm for Short Polar Codes Based on Error Checking and Correcting
Directory of Open Access Journals (Sweden)
Yingxian Zhang
2014-01-01
Full Text Available We propose a parallel decoding algorithm based on error checking and correcting to improve the performance of the short polar codes. In order to enhance the error-correcting capacity of the decoding algorithm, we first derive the error-checking equations generated on the basis of the frozen nodes, and then we introduce the method to check the errors in the input nodes of the decoder by the solutions of these equations. In order to further correct those checked errors, we adopt the method of modifying the probability messages of the error nodes with constant values according to the maximization principle. Due to the existence of multiple solutions of the error-checking equations, we formulate a CRC-aided optimization problem of finding the optimal solution with three different target functions, so as to improve the accuracy of error checking. Besides, in order to increase the throughput of decoding, we use a parallel method based on the decoding tree to calculate probability messages of all the nodes in the decoder. Numerical results show that the proposed decoding algorithm achieves better performance than that of some existing decoding algorithms with the same code length.
Directory of Open Access Journals (Sweden)
Ion LUNGU
2012-01-01
Full Text Available In this paper, we research, analyze and develop optimization solutions for the parallel reduction function using graphics processing units (GPUs that implement the Compute Unified Device Architecture (CUDA, a modern and novel approach for improving the software performance of data processing applications and algorithms. Many of these applications and algorithms make use of the reduction function in their computational steps. After having designed the function and its algorithmic steps in CUDA, we have progressively developed and implemented optimization solutions for the reduction function. In order to confirm, test and evaluate the solutions' efficiency, we have developed a custom tailored benchmark suite. We have analyzed the obtained experimental results regarding: the comparison of the execution time and bandwidth when using graphic processing units covering the main CUDA architectures (Tesla GT200, Fermi GF100, Kepler GK104 and a central processing unit; the data type influence; the binary operator's influence.
Chen, R. J.; Wang, M.; Yan, X. L.; Yang, Q.; Lam, Y. H.; Yang, L.; Zhang, Y. H.
2017-12-01
The periodic signals tracking algorithm has been used to determine the revolution times of ions stored in storage rings in isochronous mass spectrometry (IMS) experiments. It has been a challenge to perform real-time data analysis by using the periodic signals tracking algorithm in the IMS experiments. In this paper, a parallelization scheme of the periodic signals tracking algorithm is introduced and a new program is developed. The computing time of data analysis can be reduced by a factor of ∼71 and of ∼346 by using our new program on Tesla C1060 GPU and Tesla K20c GPU, compared to using old program on Xeon E5540 CPU. We succeed in performing real-time data analysis for the IMS experiments by using the new program on Tesla K20c GPU.
Directory of Open Access Journals (Sweden)
Chunfeng Liu
2013-01-01
Full Text Available The paper presents a novel hybrid genetic algorithm (HGA for a deterministic scheduling problem where multiple jobs with arbitrary precedence constraints are processed on multiple unrelated parallel machines. The objective is to minimize total tardiness, since delays of the jobs may lead to punishment cost or cancellation of orders by the clients in many situations. A priority rule-based heuristic algorithm, which schedules a prior job on a prior machine according to the priority rule at each iteration, is suggested and embedded to the HGA for initial feasible schedules that can be improved in further stages. Computational experiments are conducted to show that the proposed HGA performs well with respect to accuracy and efficiency of solution for small-sized problems and gets better results than the conventional genetic algorithm within the same runtime for large-sized problems.
Parallel 3D-TLM algorithm for simulation of the Earth-ionosphere cavity
Toledo-Redondo, Sergio; Salinas, Alfonso; Morente-Molinera, Juan Antonio; Méndez, Antonio; Fornieles, Jesús; Portí, Jorge; Morente, Juan Antonio
2013-03-01
A parallel 3D algorithm for solving time-domain electromagnetic problems with arbitrary geometries is presented. The technique employed is the Transmission Line Modeling (TLM) method implemented in Shared Memory (SM) environments. The benchmarking performed reveals that the maximum speedup depends on the memory size of the problem as well as multiple hardware factors, like the disposition of CPUs, cache, or memory. A maximum speedup of 15 has been measured for the largest problem. In certain circumstances of low memory requirements, superlinear speedup is achieved using our algorithm. The model is employed to model the Earth-ionosphere cavity, thus enabling a study of the natural electromagnetic phenomena that occur in it. The algorithm allows complete 3D simulations of the cavity with a resolution of 10 km, within a reasonable timescale.
Directory of Open Access Journals (Sweden)
Zhang Xuejun
2015-04-01
Full Text Available The continuous growth of air traffic has led to acute airspace congestion and severe delays, which threatens operation safety and cause enormous economic loss. Flight assignment is an economical and effective strategic plan to reduce the flight delay and airspace congestion by reasonably regulating the air traffic flow of China. However, it is a large-scale combinatorial optimization problem which is difficult to solve. In order to improve the quality of solutions, an effective multi-objective parallel evolution algorithm (MPEA framework with dynamic migration interval strategy is presented in this work. Firstly, multiple evolution populations are constructed to solve the problem simultaneously to enhance the optimization capability. Then a new strategy is proposed to dynamically change the migration interval among different evolution populations to improve the efficiency of the cooperation of populations. Finally, the cooperative co-evolution (CC algorithm combined with non-dominated sorting genetic algorithm II (NSGA-II is introduced for each population. Empirical studies using the real air traffic data of the Chinese air route network and daily flight plans show that our method outperforms the existing approaches, multi-objective genetic algorithm (MOGA, multi-objective evolutionary algorithm based on decomposition (MOEA/D, CC-based multi-objective algorithm (CCMA as well as other two MPEAs with different migration interval strategies.
Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan
2016-01-01
A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network's initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data.
A Parallel Multiclassification Algorithm for Big Data Using an Extreme Learning Machine.
Duan, Mingxing; Li, Kenli; Liao, Xiangke; Li, Keqin
2017-04-24
As data sets become larger and more complicated, an extreme learning machine (ELM) that runs in a traditional serial environment cannot realize its ability to be fast and effective. Although a parallel ELM (PELM) based on MapReduce to process large-scale data shows more efficient learning speed than identical ELM algorithms in a serial environment, some operations, such as intermediate results stored on disks and multiple copies for each task, are indispensable, and these operations create a large amount of extra overhead and degrade the learning speed and efficiency of the PELMs. In this paper, an efficient ELM based on the Spark framework (SELM), which includes three parallel subalgorithms, is proposed for big data classification. By partitioning the corresponding data sets reasonably, the hidden layer output matrix calculation algorithm, matrix Û decomposition algorithm, and matrix V decomposition algorithm perform most of the computations locally. At the same time, they retain the intermediate results in distributed memory and cache the diagonal matrix as broadcast variables instead of several copies for each task to reduce a large amount of the costs, and these actions strengthen the learning ability of the SELM. Finally, we implement our SELM algorithm to classify large data sets. Extensive experiments have been conducted to validate the effectiveness of the proposed algorithms. As shown, our SELM achieves an 8.71x speedup on a cluster with ten nodes, and reaches a 13.79x speedup with 15 nodes, an 18.74x speedup with 20 nodes, a 23.79x speedup with 25 nodes, a 28.89x speedup with 30 nodes, and a 33.81x speedup with 35 nodes.
Katouda, Michio; Nakajima, Takahito
2013-12-10
A new algorithm for massively parallel calculations of electron correlation energy of large molecules based on the resolution of identity second-order Møller-Plesset perturbation (RI-MP2) technique is developed and implemented into the quantum chemistry software NTChem. In this algorithm, a Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) hybrid parallel programming model is applied to attain efficient parallel performance on massively parallel supercomputers. An in-core storage scheme of intermediate data of three-center electron repulsion integrals utilizing the distributed memory is developed to eliminate input/output (I/O) overhead. The parallel performance of the algorithm is tested on massively parallel supercomputers such as the K computer (using up to 45 992 central processing unit (CPU) cores) and a commodity Intel Xeon cluster (using up to 8192 CPU cores). The parallel RI-MP2/cc-pVTZ calculation of two-layer nanographene sheets (C150H30)2 (number of atomic orbitals is 9640) is performed using 8991 node and 71 288 CPU cores of the K computer.
A Highly Parallel and Scalable Motion Estimation Algorithm with GPU for HEVC
Directory of Open Access Journals (Sweden)
Yun-gang Xue
2017-01-01
Full Text Available We propose a highly parallel and scalable motion estimation algorithm, named multilevel resolution motion estimation (MLRME for short, by combining the advantages of local full search and downsampling. By subsampling a video frame, a large amount of computation is saved. While using the local full-search method, it can exploit massive parallelism and make full use of the powerful modern many-core accelerators, such as GPU and Intel Xeon Phi. We implanted the proposed MLRME into HM12.0, and the experimental results showed that the encoding quality of the MLRME method is close to that of the fast motion estimation in HEVC, which declines by less than 1.5%. We also implemented the MLRME with CUDA, which obtained 30–60x speed-up compared to the serial algorithm on single CPU. Specifically, the parallel implementation of MLRME on a GTX 460 GPU can meet the real-time coding requirement with about 25 fps for the 2560×1600 video format, while, for 832×480, the performance is more than 100 fps.
An Intrinsic Algorithm for Parallel Poisson Disk Sampling on Arbitrary Surfaces.
Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying
2013-03-08
Poisson disk sampling plays an important role in a variety of visual computing, due to its useful statistical property in distribution and the absence of aliasing artifacts. While many effective techniques have been proposed to generate Poisson disk distribution in Euclidean space, relatively few work has been reported to the surface counterpart. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. We propose a new technique for parallelizing the dart throwing. Rather than the conventional approaches that explicitly partition the spatial domain to generate the samples in parallel, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. It is worth noting that our algorithm is accurate as the generated Poisson disks are uniformly and randomly distributed without bias. Our method is intrinsic in that all the computations are based on the intrinsic metric and are independent of the embedding space. This intrinsic feature allows us to generate Poisson disk distributions on arbitrary surfaces. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
Karthik, Victor U.; Sivasuthan, Sivamayam; Hoole, Samuel Ratnajeevan H.
2014-02-01
The computational algorithms for device synthesis and nondestructive evaluation (NDE) are often the same. In both we have a goal - a particular field configuration yielding the design performance in synthesis or to match exterior measurements in NDE. The geometry of the design or the postulated interior defect is then computed. Several optimization methods are available for this. The most efficient like conjugate gradients are very complex to program for the required derivative information. The least efficient zeroth order algorithms like the genetic algorithm take much computational time but little programming effort. This paper reports launching a Genetic Algorithm kernel on thousands of compute unified device architecture (CUDA) threads exploiting the NVIDIA graphics processing unit (GPU) architecture. The efficiency of parallelization, although below that on shared memory supercomputer architectures, is quite effective in cutting down solution time into the realm of the practicable. We carry this further into multi-physics electro-heat problems where the parameters of description are in the electrical problem and the object function in the thermal problem. Indeed, this is where the derivative of the object function in the heat problem with respect to the parameters in the electrical problem is the most difficult to compute for gradient methods, and where the genetic algorithm is most easily implemented.
Liu, Kuojuey Ray
1990-01-01
Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.
Energy Technology Data Exchange (ETDEWEB)
Azmy, Yousry
2014-06-10
We employ the Integral Transport Matrix Method (ITMM) as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells' fluxes and between the cells' and boundary surfaces' fluxes. The main goals of this work are to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and parallel performance of the developed methods with increasing number of processes, P. The fastest observed parallel solution method, Parallel Gauss-Seidel (PGS), was used in a weak scaling comparison with the PARTISN transport code, which uses the source iteration (SI) scheme parallelized with the Koch-baker-Alcouffe (KBA) method. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method- even without acceleration/preconditioning-is completitive for optically thick problems as P is increased to the tens of thousands range. For the most optically thick cells tested, PGS reduced execution time by an approximate factor of three for problems with more than 130 million computational cells on P = 32,768. Moreover, the SI-DSA execution times's trend rises generally more steeply with increasing P than the PGS trend. Furthermore, the PGS method outperforms SI for the periodic heterogeneous layers (PHL) configuration problems. The PGS method outperforms SI and SI-DSA on as few as P = 16 for PHL problems and reduces execution time by a factor of ten or more for all problems considered with more than 2 million computational cells on P = 4.096.
Optimizing ion channel models using a parallel genetic algorithm on graphical processors.
Ben-Shalom, Roy; Aviv, Amit; Razon, Benjamin; Korngreen, Alon
2012-01-01
We have recently shown that we can semi-automatically constrain models of voltage-gated ion channels by combining a stochastic search algorithm with ionic currents measured using multiple voltage-clamp protocols. Although numerically successful, this approach is highly demanding computationally, with optimization on a high performance Linux cluster typically lasting several days. To solve this computational bottleneck we converted our optimization algorithm for work on a graphical processing unit (GPU) using NVIDIA's CUDA. Parallelizing the process on a Fermi graphic computing engine from NVIDIA increased the speed ∼180 times over an application running on an 80 node Linux cluster, considerably reducing simulation times. This application allows users to optimize models for ion channel kinetics on a single, inexpensive, desktop "super computer," greatly reducing the time and cost of building models relevant to neuronal physiology. We also demonstrate that the point of algorithm parallelization is crucial to its performance. We substantially reduced computing time by solving the ODEs (Ordinary Differential Equations) so as to massively reduce memory transfers to and from the GPU. This approach may be applied to speed up other data intensive applications requiring iterative solutions of ODEs. Copyright © 2012 Elsevier B.V. All rights reserved.
Energy Technology Data Exchange (ETDEWEB)
Madduri, Kamesh; Ediger, David; Jiang, Karl; Bader, David A.; Chavarria-Miranda, Daniel
2009-02-15
We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.
Zheng, Yan
2015-03-01
Internet of things (IoT), focusing on providing users with information exchange and intelligent control, attracts a lot of attention of researchers from all over the world since the beginning of this century. IoT is consisted of large scale of sensor nodes and data processing units, and the most important features of IoT can be illustrated as energy confinement, efficient communication and high redundancy. With the sensor nodes increment, the communication efficiency and the available communication band width become bottle necks. Many research work is based on the instance which the number of joins is less. However, it is not proper to the increasing multi-join query in whole internet of things. To improve the communication efficiency between parallel units in the distributed sensor network, this paper proposed parallel query optimization algorithm based on distribution attributes cost graph. The storage information relations and the network communication cost are considered in this algorithm, and an optimized information changing rule is established. The experimental result shows that the algorithm has good performance, and it would effectively use the resource of each node in the distributed sensor network. Therefore, executive efficiency of multi-join query between different nodes could be improved.
A New Parallel Approach for Accelerating the GPU-Based Execution of Edge Detection Algorithms.
Emrani, Zahra; Bateni, Soroosh; Rabbani, Hossein
2017-01-01
Real-time image processing is used in a wide variety of applications like those in medical care and industrial processes. This technique in medical care has the ability to display important patient information graphi graphically, which can supplement and help the treatment process. Medical decisions made based on real-time images are more accurate and reliable. According to the recent researches, graphic processing unit (GPU) programming is a useful method for improving the speed and quality of medical image processing and is one of the ways of real-time image processing. Edge detection is an early stage in most of the image processing methods for the extraction of features and object segments from a raw image. The Canny method, Sobel and Prewitt filters, and the Roberts' Cross technique are some examples of edge detection algorithms that are widely used in image processing and machine vision. In this work, these algorithms are implemented using the Compute Unified Device Architecture (CUDA), Open Source Computer Vision (OpenCV), and Matrix Laboratory (MATLAB) platforms. An existing parallel method for Canny approach has been modified further to run in a fully parallel manner. This has been achieved by replacing the breadth- first search procedure with a parallel method. These algorithms have been compared by testing them on a database of optical coherence tomography images. The comparison of results shows that the proposed implementation of the Canny method on GPU using the CUDA platform improves the speed of execution by 2-100× compared to the central processing unit-based implementation using the OpenCV and MATLAB platforms.
Make life simple: unleash the full power of the parallel tempering algorithm.
Bittner, Elmar; Nubbaumer, Andreas; Janke, Wolfhard
2008-09-26
We introduce a new update scheme to systematically improve the efficiency of parallel tempering simulations. We show that, by adapting the number of sweeps between replica exchanges to the canonical autocorrelation time, the average round-trip time of a replica in temperature space can be significantly decreased. The temperatures are not dynamically adjusted as in previous attempts but chosen to yield a 50% exchange rate of adjacent replicas. We illustrate the new algorithm with results for the Ising model in two and the Edwards-Anderson Ising spin glass in three dimensions.
Ma, Haotong; Liu, Zejin; Xu, Xiaojun; Wang, Sanhong; Liu, Changhai
2010-09-01
We demonstrate the adaptive generation of a near-diffraction-limited flattop laser beam in the near field based on the stochastic parallel gradient descent algorithm and dual-phase-only liquid crystal spatial light modulators (LC-SLMs). One LC-SLM redistributes the intensity, and the other compensates the wavefront of the output beam. The experimental results show that approximately 69% of the power is enclosed in a region with less than 6% rms intensity variation. The 5mm diameter near-diffraction-limited output beam retains a flattop intensity distribution without significant diffraction peaks for a working distance of more than 30 cm.
Wang, Xiaolong; Jiang, Aipeng; Jiangzhou, Shu; Li, Ping
2014-01-01
A large-scale parallel-unit seawater reverse osmosis desalination plant contains many reverse osmosis (RO) units. If the operating conditions change, these RO units will not work at the optimal design points which are computed before the plant is built. The operational optimization problem (OOP) of the plant is to find out a scheduling of operation to minimize the total running cost when the change happens. In this paper, the OOP is modelled as a mixed-integer nonlinear programming problem. A two-stage differential evolution algorithm is proposed to solve this OOP. Experimental results show that the proposed method is satisfactory in solution quality. PMID:24701180
Wang, Zhaocai; Pu, Jun; Cao, Liling; Tan, Jian
2015-10-23
The unbalanced assignment problem (UAP) is to optimally resolve the problem of assigning n jobs to m individuals (m problem in operation management and applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn) time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation.
D'Angelo, Gianni; Rampone, Salvatore
2014-01-01
The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n(3)) and of O(n(5)) order, respectively, and so, the algorithm is unaffordable for huge data sets. We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of
DEFF Research Database (Denmark)
Keibler, Evan; Arumugam, Manimozhiyan; Brent, Michael R
2007-01-01
MOTIVATION: Hidden Markov models (HMMs) and generalized HMMs been successfully applied to many problems, but the standard Viterbi algorithm for computing the most probable interpretation of an input sequence (known as decoding) requires memory proportional to the length of the sequence, which can......, our pair HMM based cDNA-to-genome aligner. AVAILABILITY: The TWINSCAN/N-SCAN/PAIRAGON open source software package is available from http://genes.cse.wustl.edu....... be prohibitive. Existing approaches to reducing memory usage either sacrifice optimality or trade increased running time for reduced memory. RESULTS: We developed two novel decoding algorithms, Treeterbi and Parallel Treeterbi, and implemented them in the TWINSCAN/N-SCAN gene-prediction system. The worst case...
Ma, Haotong; Xie, Zongliang; Long, Xuejun; Qi, Bo; Ren, Ge; Shi, Jianliang; Cui, Zhangang; Jiang, Yang; Xu, Xiaojun
2015-06-01
In this paper, we propose and demonstrate the synthetic aperture imaging by using spatial modulation diversity technology with stochastic parallel gradient descent (SPGD) algorithm. Instead of creating diversity images by means of focus adjustments, the technology, proposed in this paper, creates diversity images by modulating the transmittance of individual sub-aperture of multi-aperture system, respectively. Specifically, spatial modulation is realized by switching off the transmittance of each sub-aperture with electrical shutters, alternately. Based on these multi diversity images, SPGD algorithm is used for adaptively optimizing the coefficients of Zernike polynomials to reconstruct the real phase distortions of multi-aperture system and to restore the near-diffraction-limited image of object. Numerical simulation and experimental results show that this technology can be used for joint estimation of both pupil aberrations and an high resolution image of the object, successfully. The technology proposed in this paper can have wide applications in segmented and multi-aperture imaging systems.
Estimating the atmospheric correlation length with stochastic parallel gradient descent algorithm.
Yazdani, R; Hajimahmoodzadeh, M; Fallah, H R
2014-03-01
The atmospheric turbulence measurement has received much attention in various fields due to its effects on wave propagation. One of the interesting parameters for characterization of the atmospheric turbulence is the Fried parameter or the atmospheric correlation length. We numerically investigate the feasibility of estimating the Fried parameter using a simple and low-cost system based on the stochastic parallel gradient descent (SPGD) algorithm without the need for wavefront sensing. We simulate the atmospheric turbulence using Zernike polynomials and employ a wavefront sensor-less adaptive optics system based on the SPGD algorithm and report the estimated Fried parameter after compensating for atmospheric-turbulence-induced phase distortions. Several simulations for different atmospheric turbulence strengths are presented to validate the proposed method.
Distributed parallel processing applied to an implicit multigrid Euler/Navier-Stokes algorithm
Tysinger, T. L.; Caughey, D. A.
1993-01-01
An implicit multigrid algorithm for the solution of the Euler and Navier-Stokes equations has been implemented within the framework of multiple block-structured grids in which the physical domain is spatially decomposed into several blocks and the solution is advanced in parallel on each block. Utilities have been developed to implement such a scheme in a distributed computing environment. The multi-block algorithm is designed so that the explicit residual calculation is identical to that of single-block scheme, and therefore converged solutions for both schemes must be the same. To accelerate convergence, synchronous and asynchronous multigrid strategies are implemented. Significant speedups have been achieved in a multiple processor environment, while convergence rates similar to those of the single-block scheme are observed.
Kiesewetter, Simon; Drummond, Peter D.
2017-03-01
A variance reduction method for stochastic integration of Fokker-Planck equations is derived. This unifies the cumulant hierarchy and stochastic equation approaches to obtaining moments, giving a performance superior to either. We show that the brute force method of reducing sampling error by just using more trajectories in a sampled stochastic equation is not the best approach. The alternative of using a hierarchy of moment equations is also not optimal, as it may converge to erroneous answers. Instead, through Bayesian conditioning of the stochastic noise on the requirement that moment equations are satisfied, we obtain improved results with reduced sampling errors for a given number of stochastic trajectories. The method used here converges faster in time-step than Ito-Euler algorithms. This parallel optimized sampling (POS) algorithm is illustrated by several examples, including a bistable nonlinear oscillator case where moment hierarchies fail to converge.
A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm
Guo, Xinyu; Wang, Hong; Devabhaktuni, Vijay
2012-01-01
A design of systolic array-based Field Programmable Gate Array (FPGA) parallel architecture for Basic Local Alignment Search Tool (BLAST) Algorithm is proposed. BLAST is a heuristic biological sequence alignment algorithm which has been used by bioinformatics experts. In contrast to other designs that detect at most one hit in one-clock-cycle, our design applies a Multiple Hits Detection Module which is a pipelining systolic array to search multiple hits in a single-clock-cycle. Further, we designed a Hits Combination Block which combines overlapping hits from systolic array into one hit. These implementations completed the first and second step of BLAST architecture and achieved significant speedup comparing with previously published architectures. PMID:25969747
Scalable parallel algorithm for configuration planning for self-reconfiguring robots
Kotay, Keith D.; Rus, Daniela L.
2000-10-01
In this paper we present algorithms for planning the motion of robotic Molecules on a substrate of other Molecules. Our approach is to divide self-reconfiguration planning into three levels: trajectory planning, configuration planning, and task-level planning. This paper focuses on algorithms for configuration planning, moving a set of Molecules from a starting configuration to a goal configuration. We describe our scaffold planning approach in which the interior of a structure contains 3D tunnels. This allows Molecules to move within a structure as well as on the surface, simplifying Molecule motion planning as well as increasing parallelism. In addition, we present a new gripper-type connection mechanism for the Molecule which does not require power to maintain connections.
Directory of Open Access Journals (Sweden)
SILVA JUNIOR,J. B.
2016-12-01
Full Text Available The Intrusion Detection System (IDS needs to compare the contents of all packets arriving at the network interface with a set of signatures for indicating possible attacks, a task that consumes much CPU processing time. In order to alleviate this problem, some researchers have tried to parallelize the IDS's comparison engine, transferring execution from the CPU to GPU. This paper identifies and maps the parallelization features of the Aho-Corasick algorithm, which is used in Snort to compare patterns, in order to show this algorithm's implementation and execution issues, as well as optimization techniques for the Aho-Corasick machine. We have found 147 papers from important computer science publications databases, and have mapped them. We selected 22 and analyzed them in order to find our results. Our analysis of the papers showed, among other results, that parallelization of the AC algorithm is a new task and the authors have focused on the State Transition Table as the most common way to implement the algorithm on the GPU. Furthermore, we found that some techniques speed up the algorithm and reduce the required machine storage space are highly used, such as the algorithm running on the fastest memories and mechanisms for reducing the number of nodes and bit maping.
Dong, Yu-Shuang; Xu, Gao-Chao; Fu, Xiao-Dong
2014-01-01
The cloud platform provides various services to users. More and more cloud centers provide infrastructure as the main way of operating. To improve the utilization rate of the cloud center and to decrease the operating cost, the cloud center provides services according to requirements of users by sharding the resources with virtualization. Considering both QoS for users and cost saving for cloud computing providers, we try to maximize performance and minimize energy cost as well. In this paper, we propose a distributed parallel genetic algorithm (DPGA) of placement strategy for virtual machines deployment on cloud platform. It executes the genetic algorithm parallelly and distributedly on several selected physical hosts in the first stage. Then it continues to execute the genetic algorithm of the second stage with solutions obtained from the first stage as the initial population. The solution calculated by the genetic algorithm of the second stage is the optimal one of the proposed approach. The experimental results show that the proposed placement strategy of VM deployment can ensure QoS for users and it is more effective and more energy efficient than other placement strategies on the cloud platform.
A General-division Grid Pattern Delaunay-TIN Parallel Algorithm
Directory of Open Access Journals (Sweden)
HAN Yuanli
2015-06-01
Full Text Available This paper achieves out a new Delaunay triangulation algorithm. Firstly, the self-adaptation grid space division was proposed to realize the balanced logical grid division for massive point cloud data. Secondly, from far to near order the sequence of points in each grid by distance to the grid center and find out the nearest point and mark it as the central point. Thirdly, the TIN was built with by a new general-division Delaunay triangulation algorithm, which uses traditional insertion method to build TIN and add only one point from each grid at one times to form new TIN. When building TIN we use find-insertion method firstly and hereafter use topology-insertion method to keep high efficiency. This algorithm has good efficiency because it successfully avoided the merge process of sub grid triangulation mesh. Finally, the topological closure detection mechanism was established, and the independent parallel multithreading was started to model the rest points by topology-insertion algorithm limit to every grid space, which made the triangulation modeling of the whole space efficient. The method of this paper improved the support capacity of space modeling for massive point cloud data obviously.
Fine-grained parallel RNAalifold algorithm for RNA secondary structure prediction on FPGA.
Xia, Fei; Dou, Yong; Zhou, Xingming; Yang, Xuejun; Xu, Jiaqing; Zhang, Yang
2009-01-30
In the field of RNA secondary structure prediction, the RNAalifold algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50%. Field Programmable Gate-Array (FPGA) chips provide a new approach to accelerate RNAalifold by exploiting fine-grained custom design. RNAalifold shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master Processing Element (PE) and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 80%. To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete RNAalifold algorithm. The experimental results show a factor of 12.2 speedup over the RNAalifold (ViennaPackage - 1.6.5) software for a group of aligned RNA sequences with 2981-residue running on a Personal Computer (PC) platform with Pentium 4 2.6 GHz CPU.
An Efficient MapReduce-Based Parallel Clustering Algorithm for Distributed Traffic Subarea Division
Directory of Open Access Journals (Sweden)
Dawen Xia
2015-01-01
Full Text Available Traffic subarea division is vital for traffic system management and traffic network analysis in intelligent transportation systems (ITSs. Since existing methods may not be suitable for big traffic data processing, this paper presents a MapReduce-based Parallel Three-Phase K-Means (Par3PKM algorithm for solving traffic subarea division problem on a widely adopted Hadoop distributed computing platform. Specifically, we first modify the distance metric and initialization strategy of K-Means and then employ a MapReduce paradigm to redesign the optimized K-Means algorithm for parallel clustering of large-scale taxi trajectories. Moreover, we propose a boundary identifying method to connect the borders of clustering results for each cluster. Finally, we divide traffic subarea of Beijing based on real-world trajectory data sets generated by 12,000 taxis in a period of one month using the proposed approach. Experimental evaluation results indicate that when compared with K-Means, Par2PK-Means, and ParCLARA, Par3PKM achieves higher efficiency, more accuracy, and better scalability and can effectively divide traffic subarea with big taxi trajectory data.
Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel Implicit CFD
Gropp, W. D.; Keyes, D. E.; McInnes, L. C.; Tidriri, M. D.
1998-01-01
Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, "routine" parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz (Psi-NKS) algorithmic framework is presented as an answer. We show that, for the classical problem of three-dimensional transonic Euler flow about an M6 wing, Psi-NKS can simultaneously deliver: globalized, asymptotically rapid convergence through adaptive pseudo- transient continuation and Newton's method-, reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver; and high per- processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of Psi-NKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. We therefore distill several recommendations from our experience and from our reading of the literature on various algorithmic components of Psi-NKS, and we describe a freely available, MPI-based portable parallel software implementation of the solver employed here.
Bernabé, Sergio; Martin, Gabriel; Botella, Guillermo; Prieto-Matias, Manuel; Plaza, Antonio
2016-04-01
In the last years, hyperspectral analysis have been applied in many remote sensing applications. In fact, hyperspectral unmixing has been a challenging task in hyperspectral data exploitation. This process consists of three stages: (i) estimation of the number of pure spectral signatures or endmembers, (ii) automatic identification of the estimated endmembers, and (iii) estimation of the fractional abundance of each endmember in each pixel of the scene. However, unmixing algorithms can be computationally very expensive, a fact that compromises their use in applications under real-time constraints. In recent years, several techniques have been proposed to solve the aforementioned problem but until now, most works have focused on the second and third stages. The execution cost of the first stage is usually lower than the other stages. Indeed, it can be optional if we known a priori this estimation. However, its acceleration on parallel architectures is still an interesting and open problem. In this paper we have addressed this issue focusing on the GENE algorithm, a promising geometry-based proposal introduced in.1 We have evaluated our parallel implementation in terms of both accuracy and computational performance through Monte Carlo simulations for real and synthetic data experiments. Performance results on a modern GPU shows satisfactory 16x speedup factors, which allow us to expect that this method could meet real-time requirements on a fully operational unmixing chain.
Energy Technology Data Exchange (ETDEWEB)
Madduri, Kamesh; Bader, David A.
2009-02-15
Graph-theoretic abstractions are extensively used to analyze massive data sets. Temporal data streams from socioeconomic interactions, social networking web sites, communication traffic, and scientific computing can be intuitively modeled as graphs. We present the first study of novel high-performance combinatorial techniques for analyzing large-scale information networks, encapsulating dynamic interaction data in the order of billions of entities. We present new data structures to represent dynamic interaction networks, and discuss algorithms for processing parallel insertions and deletions of edges in small-world networks. With these new approaches, we achieve an average performance rate of 25 million structural updates per second and a parallel speedup of nearly28 on a 64-way Sun UltraSPARC T2 multicore processor, for insertions and deletions to a small-world network of 33.5 million vertices and 268 million edges. We also design parallel implementations of fundamental dynamic graph kernels related to connectivity and centrality queries. Our implementations are freely distributed as part of the open-source SNAP (Small-world Network Analysis and Partitioning) complex network analysis framework.
The lower bound on complexity of parallel branch-and-bound algorithm for subset sum problem
Kolpakov, Roman; Posypkin, Mikhail
2016-10-01
The subset sum problem is a particular case of the Boolean knapsack problem where each item has the price equal to its weight. This problem can be informally stated as searching for most dense packing of a set of items into a box with limited capacity. Recently, coarse-grain parallelization approaches to Branch-and-Bound (B&B) method attracted some attention due to the growing popularity of weakly-connected distributed computing platforms. In this paper we consider one of such approaches for solving the subset sum problem. One of the processors (manager) performs some number of B&B steps on the first stage with generating some subproblems. On the second stage, the generated subproblems are sent to other processors, one subproblem per processor. The processors solve completely the received subproblems, the manager collects all the obtained solutions and chooses the optimal one. For this algorithm we formally define the parallel execution model (frontal scheme of parallelization) and the notion of the frontal scheme complexity. We study the frontal scheme complexity for a series of subset sum problems.
The Parallel Implementation of Algorithms for Finding the Reflection Symmetry of the Binary Images
Fedotova, S.; Seredin, O.; Kushnir, O.
2017-05-01
In this paper, we investigate the exact method of searching an axis of binary image symmetry, based on brute-force search among all potential symmetry axes. As a measure of symmetry, we use the set-theoretic Jaccard similarity applied to two subsets of pixels of the image which is divided by some axis. Brute-force search algorithm definitely finds the axis of approximate symmetry which could be considered as ground-truth, but it requires quite a lot of time to process each image. As a first step of our contribution we develop the parallel version of the brute-force algorithm. It allows us to process large image databases and obtain the desired axis of approximate symmetry for each shape in database. Experimental studies implemented on "Butterflies" and "Flavia" datasets have shown that the proposed algorithm takes several minutes per image to find a symmetry axis. However, in case of real-world applications we need computational efficiency which allows solving the task of symmetry axis search in real or quasi-real time. So, for the task of fast shape symmetry calculation on the common multicore PC we elaborated another parallel program, which based on the procedure suggested before in (Fedotova, 2016). That method takes as an initial axis the axis obtained by superfast comparison of two skeleton primitive sub-chains. This process takes about 0.5 sec on the common PC, it is considerably faster than any of the optimized brute-force methods including ones implemented in supercomputer. In our experiments for 70 percent of cases the found axis coincides with the ground-truth one absolutely, and for the rest of cases it is very close to the ground-truth.
Algorithms and data structures for massively parallel generic adaptive finite element codes
Bangerth, Wolfgang
2011-12-01
Today\\'s largest supercomputers have 100,000s of processor cores and offer the potential to solve partial differential equations discretized by billions of unknowns. However, the complexity of scaling to such large machines and problem sizes has so far prevented the emergence of generic software libraries that support such computations, although these would lower the threshold of entry and enable many more applications to benefit from large-scale computing. We are concerned with providing this functionality for mesh-adaptive finite element computations. We assume the existence of an "oracle" that implements the generation and modification of an adaptive mesh distributed across many processors, and that responds to queries about its structure. Based on querying the oracle, we develop scalable algorithms and data structures for generic finite element methods. Specifically, we consider the parallel distribution of mesh data, global enumeration of degrees of freedom, constraints, and postprocessing. Our algorithms remove the bottlenecks that typically limit large-scale adaptive finite element analyses. We demonstrate scalability of complete finite element workflows on up to 16,384 processors. An implementation of the proposed algorithms, based on the open source software p4est as mesh oracle, is provided under an open source license through the widely used deal.II finite element software library. © 2011 ACM 0098-3500/2011/12-ART10 $10.00.
Rausch, Tobias; Thomas, Alun; Camp, Nicola J.; Cannon-Albright, Lisa A.; Facelli, Julio C.
2008-01-01
This paper describes a novel algorithm to analyze genetic linkage data using pattern recognition techniques and genetic algorithms (GA). The method allows a search for regions of the chromosome that may contain genetic variations that jointly predispose individuals for a particular disease. The method uses correlation analysis, filtering theory and genetic algorithms (GA) to achieve this goal. Because current genome scans use from hundreds to hundreds of thousands of markers, two versions of the method have been implemented. The first is an exhaustive analysis version that can be used to visualize, explore, and analyze small genetic data sets for two marker correlations; the second is a GA version, which uses a parallel implementation allowing searches of higher-order correlations in large data sets. Results on simulated data sets indicate that the method can be informative in the identification of major disease loci and gene-gene interactions in genome-wide linkage data and that further exploration of these techniques is justified. The results presented for both variants of the method show that it can help genetic epidemiologists to identify promising combinations of genetic factors that might predispose to complex disorders. In particular, the correlation analysis of IBD expression patterns might hint to possible gene-gene interactions and the filtering might be a fruitful approach to distinguish true correlation signals from noise. PMID:18547558
Subramanian, Nithya
Optimization under uncertainty accounts for design variables and external parameters or factors with probabilistic distributions instead of fixed deterministic values; it enables problem formulations that might maximize or minimize an expected value while satisfying constraints using probabilities. For discrete optimization under uncertainty, a Monte Carlo Sampling (MCS) approach enables high-accuracy estimation of expectations but it also results in high computational expense. The Genetic Algorithm (GA) with a Population-Based Sampling (PBS) technique enables optimization under uncertainty with discrete variables at a lower computational expense than using Monte Carlo sampling for every fitness evaluation. Population-Based Sampling uses fewer samples in the exploratory phase of the GA and a larger number of samples when `good designs' start emerging over the generations. This sampling technique therefore reduces the computational effort spent on `poor designs' found in the initial phase of the algorithm. Parallel computation evaluates the expected value of the objective and constraints in parallel to facilitate reduced wall-clock time. A customized stopping criterion is also developed for the GA with Population-Based Sampling. The stopping criterion requires that the design with the minimum expected fitness value to have at least 99% constraint satisfaction and to have accumulated at least 10,000 samples. The average change in expected fitness values in the last ten consecutive generations is also monitored. The optimization of composite laminates using ply orientation angle as a discrete variable provides an example to demonstrate further developments of the GA with Population-Based Sampling for discrete optimization under uncertainty. The focus problem aims to reduce the expected weight of the composite laminate while treating the laminate's fiber volume fraction and externally applied loads as uncertain quantities following normal distributions. Construction of
Directory of Open Access Journals (Sweden)
Fang Huang
2017-12-01
Full Text Available Density-based spatial clustering of applications with noise (DBSCAN is a density-based clustering algorithm that has the characteristics of being able to discover clusters of any shape, effectively distinguishing noise points and naturally supporting spatial databases. DBSCAN has been widely used in the field of spatial data mining. This paper studies the parallelization design and realization of the DBSCAN algorithm based on the Spark platform, and solves the following problems that arise when computing macro data: the requirement of a great deal of calculation using the single-node algorithm; the low level of resource-utilization with the multi-node algorithm; the large time consumption; and the lack of instantaneity. The experimental results indicate that the proposed parallel algorithm design is able to achieve more stable speedup at an increased involved spatial data scale.
Fast 2D DOA Estimation Algorithm by an Array Manifold Matching Method with Parallel Linear Arrays.
Yang, Lisheng; Liu, Sheng; Li, Dong; Jiang, Qingping; Cao, Hailin
2016-02-23
In this paper, the problem of two-dimensional (2D) direction-of-arrival (DOA) estimation with parallel linear arrays is addressed. Two array manifold matching (AMM) approaches, in this work, are developed for the incoherent and coherent signals, respectively. The proposed AMM methods estimate the azimuth angle only with the assumption that the elevation angles are known or estimated. The proposed methods are time efficient since they do not require eigenvalue decomposition (EVD) or peak searching. In addition, the complexity analysis shows the proposed AMM approaches have lower computational complexity than many current state-of-the-art algorithms. The estimated azimuth angles produced by the AMM approaches are automatically paired with the elevation angles. More importantly, for estimating the azimuth angles of coherent signals, the aperture loss issue is avoided since a decorrelation procedure is not required for the proposed AMM method. Numerical studies demonstrate the effectiveness of the proposed approaches.
Progress in parallel implementation of the multilevel plane wave time domain algorithm
Liu, Yang
2013-07-01
The computational complexity and memory requirements of classical schemes for evaluating transient electromagnetic fields produced by Ns dipoles active for Nt time steps scale as O(NtN s 2) and O(Ns 2), respectively. The multilevel plane wave time domain (PWTD) algorithm [A.A. Ergin et al., Antennas and Propagation Magazine, IEEE, vol. 41, pp. 39-52, 1999], viz. the extension of the frequency domain fast multipole method (FMM) to the time domain, reduces the above costs to O(NtNslog2Ns) and O(Ns α) with α = 1.5 for surface current distributions and α = 4/3 for volumetric ones. Its favorable computational and memory costs notwithstanding, serial implementations of the PWTD scheme unfortunately remain somewhat limited in scope and ill-suited to tackle complex real-world scattering problems, and parallel implementations are called for. © 2013 IEEE.
Implementing O(N N-Body Algorithms Efficiently in Data-Parallel Languages
Directory of Open Access Journals (Sweden)
Yu Hu
1996-01-01
Full Text Available The optimization techniques for hierarchical O(N N-body algorithms described here focus on managing the data distribution and the data references, both between the memories of different nodes and within the memory hierarchy of each node. We show how the techniques can be expressed in data-parallel languages, such as High Performance Fortran (HPF and Connection Machine Fortran (CMF. The effectiveness of our techniques is demonstrated on an implementation of Anderson's hierarchical O(N N-body method for the Connection Machine system CM-5/5E. Of the total execution time, communication accounts for about 10–20% of the total time, with the average efficiency for arithmetic operations being about 40% and the total efficiency (including communication being about 35%. For the CM-5E, a performance in excess of 60 Mflop/s per node (peak 160 Mflop/s per node has been measured.
Introduction of Parallel GPGPU Acceleration Algorithms for the Solution of Radiative Transfer
Godoy, William F.; Liu, Xu
2011-01-01
General-purpose computing on graphics processing units (GPGPU) is a recent technique that allows the parallel graphics processing unit (GPU) to accelerate calculations performed sequentially by the central processing unit (CPU). To introduce GPGPU to radiative transfer, the Gauss-Seidel solution of the well-known expressions for 1-D and 3-D homogeneous, isotropic media is selected as a test case. Different algorithms are introduced to balance memory and GPU-CPU communication, critical aspects of GPGPU. Results show that speed-ups of one to two orders of magnitude are obtained when compared to sequential solutions. The underlying value of GPGPU is its potential extension in radiative solvers (e.g., Monte Carlo, discrete ordinates) at a minimal learning curve.
Yang, Huizhen; Li, Xinyang; Gong, Chenglong; Jiang, Wenhan
2009-03-02
An adaptive optics (AO) system with Stochastic Parallel Gradient Descent (SPGD) algorithm and a 61-element deformable mirror is simulated to restore the image of a turbulence-degraded extended object. SPGD is used to search the optimum voltages for the actuators of the deformable mirror. We try to find a convenient image performance metric, which is needed by SPGD, merely from a gray level distorted image and without any additional optics elements. Simulation results show the gray level variance function acts more promising than other metrics, such as metrics based on the gray level gradient of each pixel. The restoration capability of the AO system is investigated with different images and different turbulence strength wave-front aberrations using SPGD with the above resultant image quality criterion. Numerical simulation results verify the performance metric is effective and the AO system can restore those images degraded by different turbulence strengths successfully.
Energy Technology Data Exchange (ETDEWEB)
Schatz, Martin D. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Kolda, Tamara G. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); van de Geijn, Robert [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2015-09-01
Large-scale datasets in computational chemistry typically require distributed-memory parallel methods to perform a special operation known as tensor contraction. Tensors are multidimensional arrays, and a tensor contraction is akin to matrix multiplication with special types of permutations. Creating an efficient algorithm and optimized im- plementation in this domain is complex, tedious, and error-prone. To address this, we develop a notation to express data distributions so that we can apply use automated methods to find optimized implementations for tensor contractions. We consider the spin-adapted coupled cluster singles and doubles method from computational chemistry and use our methodology to produce an efficient implementation. Experiments per- formed on the IBM Blue Gene/Q and Cray XC30 demonstrate impact both improved performance and reduced memory consumption.
Directory of Open Access Journals (Sweden)
Zhaocai Wang
2015-10-01
Full Text Available The unbalanced assignment problem (UAP is to optimally resolve the problem of assigning n jobs to m individuals (m < n, such that minimum cost or maximum profit obtained. It is a vitally important Non-deterministic Polynomial (NP complete problem in operation management and applied mathematics, having numerous real life applications. In this paper, we present a new parallel DNA algorithm for solving the unbalanced assignment problem using DNA molecular operations. We reasonably design flexible-length DNA strands representing different jobs and individuals, take appropriate steps, and get the solutions of the UAP in the proper length range and O(mn time. We extend the application of DNA molecular operations and simultaneity to simplify the complexity of the computation.
Wang, Zhaocai; Ji, Zuwen; Wang, Xiaoming; Wu, Tunhua; Huang, Wei
2017-12-01
As a promising approach to solve the computationally intractable problem, the method based on DNA computing is an emerging research area including mathematics, computer science and molecular biology. The task scheduling problem, as a well-known NP-complete problem, arranges n jobs to m individuals and finds the minimum execution time of last finished individual. In this paper, we use a biologically inspired computational model and describe a new parallel algorithm to solve the task scheduling problem by basic DNA molecular operations. In turn, we skillfully design flexible length DNA strands to represent elements of the allocation matrix, take appropriate biological experiment operations and get solutions of the task scheduling problem in proper length range with less than O(n 2 ) time complexity. Copyright © 2017. Published by Elsevier B.V.
Directory of Open Access Journals (Sweden)
Zhong-Kai Feng
2017-01-01
Full Text Available With the increasingly serious energy crisis and environmental pollution, the short-term economic environmental hydrothermal scheduling (SEEHTS problem is becoming more and more important in modern electrical power systems. In order to handle the SEEHTS problem efficiently, the parallel multi-objective genetic algorithm (PMOGA is proposed in the paper. Based on the Fork/Join parallel framework, PMOGA divides the whole population of individuals into several subpopulations which will evolve in different cores simultaneously. In this way, PMOGA can avoid the wastage of computational resources and increase the population diversity. Moreover, the constraint handling technique is used to handle the complex constraints in SEEHTS, and a selection strategy based on constraint violation is also employed to ensure the convergence speed and solution feasibility. The results from a hydrothermal system in different cases indicate that PMOGA can make the utmost of system resources to significantly improve the computing efficiency and solution quality. Moreover, PMOGA has competitive performance in SEEHTS when compared with several other methods reported in the previous literature, providing a new approach for the operation of hydrothermal systems.
Directory of Open Access Journals (Sweden)
Zhi Chen
2016-01-01
Full Text Available The extensive applications of support vector machines (SVMs require efficient method of constructing a SVM classifier with high classification ability. The performance of SVM crucially depends on whether optimal feature subset and parameter of SVM can be efficiently obtained. In this paper, a coarse-grained parallel genetic algorithm (CGPGA is used to simultaneously optimize the feature subset and parameters for SVM. The distributed topology and migration policy of CGPGA can help find optimal feature subset and parameters for SVM in significantly shorter time, so as to increase the quality of solution found. In addition, a new fitness function, which combines the classification accuracy obtained from bootstrap method, the number of chosen features, and the number of support vectors, is proposed to lead the search of CGPGA to the direction of optimal generalization error. Experiment results on 12 benchmark datasets show that our proposed approach outperforms genetic algorithm (GA based method and grid search method in terms of classification accuracy, number of chosen features, number of support vectors, and running time.
Algorithm and Implementation of Distributed ESN Using Spark Framework and Parallel PSO
Directory of Open Access Journals (Sweden)
Kehe Wu
2017-04-01
Full Text Available The echo state network (ESN employs a huge reservoir with sparsely and randomly connected internal nodes and only trains the output weights, which avoids the suboptimal problem, exploding and vanishing gradients, high complexity and other disadvantages faced by traditional recurrent neural network (RNN training. In light of the outstanding adaption to nonlinear dynamical systems, ESN has been applied into a wide range of applications. However, in the era of Big Data, with an enormous amount of data being generated continuously every day, the data are often distributed and stored in real applications, and thus the centralized ESN training process is prone to being technologically unsuitable. In order to achieve the requirement of Big Data applications in the real world, in this study we propose an algorithm and its implementation for distributed ESN training. The mentioned algorithm is based on the parallel particle swarm optimization (P-PSO technique and the implementation uses Spark, a famous large-scale data processing framework. Four extremely large-scale datasets, including artificial benchmarks, real-world data and image data, are adopted to verify our framework on a stretchable platform. Experimental results indicate that the proposed work is accurate in the era of Big Data, regarding speed, accuracy and generalization capabilities.
A new Lagrangian Relaxation Algorithm for scheduling dissimilar parallel machines with release dates
Tang, Lixin; Zhang, Yanyan
2011-07-01
In this article we investigate the parallel machine scheduling problem with job release dates, focusing on the case that machines are dissimilar with each other. The goal of scheduling is to find an assignment and sequence for a set of jobs so that the total weighted completion time is minimised. This type of production environment is frequently encountered in process industry, such as chemical and steel industries, where the scheduling of jobs with different purposes is an important goal. This article formulates the problem as an integer linear programming model. Because of the dissimilarity of machines, the ordinary job-based decomposition method is no longer applicable, a novel machine-based Lagrangian relaxation algorithm is therefore proposed. Penalty terms associated with violations of coupling constraints are introduced to the objective function by Lagrangian multipliers, which are updated using subgradient optimisation method. For each machine-level subproblem after decomposition, a forward dynamic programming algorithm is designed together with the weighted shortest processing time rule to provide an optimal solution. A heuristics is developed to obtain a feasible schedule from the solution of subproblems to provide an upper bound. Numerical results show that the new approach is computationally effective to handle the addressed problem and provide high quality schedules.
Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran
2017-03-01
In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
Unterkircher, A
2005-01-01
We propose methods for parallel assembling and iterative equation solving based on graph algorithms. The assembling technique is independent of dimension, element type and model shape. As a parallel solving technique we construct a multiplicative symmetric Schwarz preconditioner for the conjugate gradient method. Both methods have been incorporated into a non-linear FE code to simulate 3D metal extrusion processes. We illustrate the efficiency of these methods on shared memory computers by realistic examples.
Bernabe, Sergio; Igual, Francisco D.; Botella, Guillermo; Prieto-Matias, Manuel; Plaza, Antonio
2015-10-01
In the last decade, the issue of endmember variability has received considerable attention, particularly when each pixel is modeled as a linear combination of endmembers or pure materials. As a result, several models and algorithms have been developed for considering the effect of endmember variability in spectral unmixing and possibly include multiple endmembers in the spectral unmixing stage. One of the most popular approach for this purpose is the multiple endmember spectral mixture analysis (MESMA) algorithm. The procedure executed by MESMA can be summarized as follows: (i) First, a standard linear spectral unmixing (LSU) or fully constrained linear spectral unmixing (FCLSU) algorithm is run in an iterative fashion; (ii) Then, we use different endmember combinations, randomly selected from a spectral library, to decompose each mixed pixel; (iii) Finally, the model with the best fit, i.e., with the lowest root mean square error (RMSE) in the reconstruction of the original pixel, is adopted. However, this procedure can be computationally very expensive due to the fact that several endmember combinations need to be tested and several abundance estimation steps need to be conducted, a fact that compromises the use of MESMA in applications under real-time constraints. In this paper we develop (for the first time in the literature) an efficient implementation of MESMA on different platforms using OpenCL, an open standard for parallel programing on heterogeneous systems. Our experiments have been conducted using a simulated data set and the clMAGMA mathematical library. This kind of implementations with the same descriptive language on different architectures are very important in order to actually calibrate the possibility of using heterogeneous platforms for efficient hyperspectral imaging processing in real remote sensing missions.
Parallel Landscape Driven Data Reduction & Spatial Interpolation Algorithm for Big LiDAR Data
Directory of Open Access Journals (Sweden)
Rahil Sharma
2016-06-01
Full Text Available Airborne Light Detection and Ranging (LiDAR topographic data provide highly accurate digital terrain information, which is used widely in applications like creating flood insurance rate maps, forest and tree studies, coastal change mapping, soil and landscape classification, 3D urban modeling, river bank management, agricultural crop studies, etc. In this paper, we focus mainly on the use of LiDAR data in terrain modeling/Digital Elevation Model (DEM generation. Technological advancements in building LiDAR sensors have enabled highly accurate and highly dense LiDAR point clouds, which have made possible high resolution modeling of terrain surfaces. However, high density data result in massive data volumes, which pose computing issues. Computational time required for dissemination, processing and storage of these data is directly proportional to the volume of the data. We describe a novel technique based on the slope map of the terrain, which addresses the challenging problem in the area of spatial data analysis, of reducing this dense LiDAR data without sacrificing its accuracy. To the best of our knowledge, this is the first ever landscape-driven data reduction algorithm. We also perform an empirical study, which shows that there is no significant loss in accuracy for the DEM generated from a 52% reduced LiDAR dataset generated by our algorithm, compared to the DEM generated from an original, complete LiDAR dataset. For the accuracy of our statistical analysis, we perform Root Mean Square Error (RMSE comparing all of the grid points of the original DEM to the DEM generated by reduced data, instead of comparing a few random control points. Besides, our multi-core data reduction algorithm is highly scalable. We also describe a modified parallel Inverse Distance Weighted (IDW spatial interpolation method and show that the DEMs it generates are time-efficient and have better accuracy than the one’s generated by the traditional IDW method.
Bansal, Shonak; Singh, Arun Kumar; Gupta, Neena
2017-02-01
In real-life, multi-objective engineering design problems are very tough and time consuming optimization problems due to their high degree of nonlinearities, complexities and inhomogeneity. Nature-inspired based multi-objective optimization algorithms are now becoming popular for solving multi-objective engineering design problems. This paper proposes original multi-objective Bat algorithm (MOBA) and its extended form, namely, novel parallel hybrid multi-objective Bat algorithm (PHMOBA) to generate shortest length Golomb ruler called optimal Golomb ruler (OGR) sequences at a reasonable computation time. The OGRs found their application in optical wavelength division multiplexing (WDM) systems as channel-allocation algorithm to reduce the four-wave mixing (FWM) crosstalk. The performances of both the proposed algorithms to generate OGRs as optical WDM channel-allocation is compared with other existing classical computing and nature-inspired algorithms, including extended quadratic congruence (EQC), search algorithm (SA), genetic algorithms (GAs), biogeography based optimization (BBO) and big bang-big crunch (BB-BC) optimization algorithms. Simulations conclude that the proposed parallel hybrid multi-objective Bat algorithm works efficiently as compared to original multi-objective Bat algorithm and other existing algorithms to generate OGRs for optical WDM systems. The algorithm PHMOBA to generate OGRs, has higher convergence and success rate than original MOBA. The efficiency improvement of proposed PHMOBA to generate OGRs up to 20-marks, in terms of ruler length and total optical channel bandwidth (TBW) is 100 %, whereas for original MOBA is 85 %. Finally the implications for further research are also discussed.
Xie, Dexuan; Dash, Ranjan K.; Beard, Daniel A.
2009-01-01
Fast algorithms for simulating mathematical models of coupled blood-tissue transport and metabolism are critical for the analysis of data on transport and reaction in tissues. Here, by combining the method of characteristics with the standard grid discretization technique, a novel algorithm is introduced for solving a general blood-tissue transport and metabolism model governed by a large system of one-dimensional semilinear first order partial differential equations. The key part of the algorithm is to approximate the model as a group of independent ordinary differential equation (ODE) systems such that each ODE system has the same size as the model and can be integrated independently. Thus the method can be easily implemented in parallel on a large scale multiprocessor computer. The accuracy of the algorithm is demonstrated for solving a simple blood-tissue exchange model introduced by Sangren and Sheppard (Bull. Math. Biophys. 15:387–394, 1953), which has an analytical solution. Numerical experiments made on a distributed-memory parallel computer (an HP Linux cluster) and a shared-memory parallel computer (a SGI Origin 2000) demonstrate the parallel efficiency of the algorithm. PMID:20161089
DEFF Research Database (Denmark)
Cao, Bin; Zhao, Jianwei; Yang, Po
2018-01-01
-objective evolutionary algorithms the Cooperative Coevolutionary Generalized Differential Evolution 3, the Cooperative Multi-objective Differential Evolution and the Nondominated Sorting Genetic Algorithm III, the proposed algorithm addresses the deployment optimization problem efficiently and effectively.......Using immune algorithms is generally a time-intensive process especially for problems with a large number of variables. In this paper, we propose a distributed parallel cooperative coevolutionary multi-objective large-scale immune algorithm that is implemented using the message passing interface...... (MPI). The proposed algorithm is composed of three layers: objective, group and individual layers. First, for each objective in the multi-objective problem to be addressed, a subpopulation is used for optimization, and an archive population is used to optimize all the objectives. Second, the large...
Ferrucci, Filomena; Salza, Pasquale; Sarro, Federica
2017-06-29
The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time. Moreover, we are interested in understanding which PGA model can be most effective among the global, grid, and island models. We empirically assessed the performance of these three parallel models with respect to a sequential GA on a software engineering problem, evaluating the execution time and the achieved speedup. We also analysed the behaviour of the parallel models in relation to the overhead produced by the use of Hadoop MapReduce and the GAs' computational effort, which gives a more machine-independent measure of these algorithms. We exploited three problem instances to differentiate the computation load and three cluster configurations based on 2, 4, and 8 parallel nodes. Moreover, we estimated the costs of the execution of the experimentation on a potential cloud infrastructure, based on the pricing of the major commercial cloud providers. The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters. Using 2, 4, and 8 nodes, the island model achieves an average speedup over the three datasets of 1.8, 3.4, and 7.0 times, respectively. Hadoop MapReduce has a set of different constraints that need to be considered during the design and the implementation of parallel algorithms. The overhead of data store (i.e., HDFS) accesses, communication, and latency requires solutions that reduce data store
Efficient parallel implementations of approximation algorithms for guarding 1.5D terrains
Directory of Open Access Journals (Sweden)
Goran Martinović
2015-03-01
Full Text Available In the 1.5D terrain guarding problem, an x-monotone polygonal line is dened by k vertices and a G set of terrain points, i.e. guards, and a N set of terrain points which guards are to observe (guard. This involves a weighted version of the guarding problem where guards G have weights. The goal is to determine a minimum weight subset of G to cover all the points in N, including a version where points from N have demands. Furthermore, another goal is to determine the smallest subset of G, such that every point in N is observed by the required number of guards. Both problems are NP-hard and have a factor 5 approximation [3, 4]. This paper will show that if the (1+ϵ-approximate solver for the corresponding linear program is a computer, for any ϵ > 0, an extra 1+ϵ factor will appear in the final approximation factor for both problems. A comparison will be carried out the parallel implementation based on GPU and CPU threads with the Gurobi solver, leading to the conclusion that the respective algorithm outperforms the Gurobi solver on large and dense inputs typically by one order of magnitude.
DWFS: a wrapper feature selection tool based on a parallel genetic algorithm.
Soufan, Othman; Kleftogiannis, Dimitrios; Kalnis, Panos; Bajic, Vladimir B
2015-01-01
Many scientific problems can be formulated as classification tasks. Data that harbor relevant information are usually described by a large number of features. Frequently, many of these features are irrelevant for the class prediction. The efficient implementation of classification models requires identification of suitable combinations of features. The smaller number of features reduces the problem's dimensionality and may result in higher classification performance. We developed DWFS, a web-based tool that allows for efficient selection of features for a variety of problems. DWFS follows the wrapper paradigm and applies a search strategy based on Genetic Algorithms (GAs). A parallel GA implementation examines and evaluates simultaneously large number of candidate collections of features. DWFS also integrates various filtering methods that may be applied as a pre-processing step in the feature selection process. Furthermore, weights and parameters in the fitness function of GA can be adjusted according to the application requirements. Experiments using heterogeneous datasets from different biomedical applications demonstrate that DWFS is fast and leads to a significant reduction of the number of features without sacrificing performance as compared to several widely used existing methods. DWFS can be accessed online at www.cbrc.kaust.edu.sa/dwfs.
Control Strategy Optimization for Parallel Hybrid Electric Vehicles Using a Memetic Algorithm
Directory of Open Access Journals (Sweden)
Yu-Huei Cheng
2017-03-01
Full Text Available Hybrid electric vehicle (HEV control strategy is a management approach for generating, using, and saving energy. Therefore, the optimal control strategy is the sticking point to effectively manage hybrid electric vehicles. In order to realize the optimal control strategy, we use a robust evolutionary computation method called a “memetic algorithm (MA” to optimize the control parameters in parallel HEVs. The “local search” mechanism implemented in the MA greatly enhances its search capabilities. In the implementation of the method, the fitness function combines with the ADvanced VehIcle SimulatOR (ADVISOR and is set up according to an electric assist control strategy (EACS to minimize the fuel consumption (FC and emissions (HC, CO, and NOx of the vehicle engine. At the same time, driving performance requirements are also considered in the method. Four different driving cycles, the new European driving cycle (NEDC, Federal Test Procedure (FTP, Economic Commission for Europe + Extra-Urban driving cycle (ECE + EUDC, and urban dynamometer driving schedule (UDDS are carried out using the proposed method to find their respectively optimal control parameters. The results show that the proposed method effectively helps to reduce fuel consumption and emissions, as well as guarantee vehicle performance.
DWFS: A Wrapper Feature Selection Tool Based on a Parallel Genetic Algorithm
Soufan, Othman
2015-02-26
Many scientific problems can be formulated as classification tasks. Data that harbor relevant information are usually described by a large number of features. Frequently, many of these features are irrelevant for the class prediction. The efficient implementation of classification models requires identification of suitable combinations of features. The smaller number of features reduces the problem\\'s dimensionality and may result in higher classification performance. We developed DWFS, a web-based tool that allows for efficient selection of features for a variety of problems. DWFS follows the wrapper paradigm and applies a search strategy based on Genetic Algorithms (GAs). A parallel GA implementation examines and evaluates simultaneously large number of candidate collections of features. DWFS also integrates various filteringmethods thatmay be applied as a pre-processing step in the feature selection process. Furthermore, weights and parameters in the fitness function of GA can be adjusted according to the application requirements. Experiments using heterogeneous datasets from different biomedical applications demonstrate that DWFS is fast and leads to a significant reduction of the number of features without sacrificing performance as compared to several widely used existing methods. DWFS can be accessed online at www.cbrc.kaust.edu.sa/dwfs.
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
PARALLEL QUICK SEARCH ALGORITHM TO SPEED PACKET PAYLOAD FILTERING IN NIDS
Directory of Open Access Journals (Sweden)
ADNAN A. HNAIF
2009-06-01
Full Text Available An Intrusion Detection System (IDS is a system to detect intruders who try to hack in to the network and steal information and report them to the network administrator. There are many tools used in this field, snort consider one of the most tools mostly used in Network Intrusion Detection System (NIDS. In spite of consuming 31% of total processing due to string matching, and 80% of total processing in case of web-intensive traffic, snort using its rule sets to determine which packets are allowed to pass and which are rejected. In this paper, we parallelized the quick search algorithm using OpenMP and Pthread (Posix using C language and made a comparison between them; we determine the required number of threads according to many factors. By doing this, we managed to speed up the filtering process for more than 40% and finally. We applied the proposed method into NIDS to enhance the speed of matching process between incoming packet contents and snort rule sets.
Energy Technology Data Exchange (ETDEWEB)
Carey, G.F.; Young, D.M.
1992-04-01
Research has continued with excellent progress and new results on methodology and algorithms. We have also made supporting benchmark application studies on representative parallel computing architectures. Results from these research studies have been reported at scientific meetings, as technical reports and as journal publications. A list of pertinent presentations and publications is attached. The work on parallel element-by-element techniques and domain decomposition schemes has developed well. In particular, we have focused on the use of finite element spectral methods (or high p methods) on distributed massively parallel systems. The approach has been implemented in a prototype finite element program for solution of coupled Navier Stokes flow and transport processes. This class of problems is of fundamental interest and basic to many ``grand challenge`` type problems for which parallel supercomputing is warranted.
ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches
National Research Council Canada - National Science Library
Rognes, T
2001-01-01
.... Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel...
Pronk, Sander; Pouya, Iman; Lundborg, Magnus; Rotskoff, Grant; Wesén, Björn; Kasson, Peter M; Lindahl, Erik
2015-06-09
Computational chemistry and other simulation fields are critically dependent on computing resources, but few problems scale efficiently to the hundreds of thousands of processors available in current supercomputers-particularly for molecular dynamics. This has turned into a bottleneck as new hardware generations primarily provide more processing units rather than making individual units much faster, which simulation applications are addressing by increasingly focusing on sampling with algorithms such as free-energy perturbation, Markov state modeling, metadynamics, or milestoning. All these rely on combining results from multiple simulations into a single observation. They are potentially powerful approaches that aim to predict experimental observables directly, but this comes at the expense of added complexity in selecting sampling strategies and keeping track of dozens to thousands of simulations and their dependencies. Here, we describe how the distributed execution framework Copernicus allows the expression of such algorithms in generic workflows: dataflow programs. Because dataflow algorithms explicitly state dependencies of each constituent part, algorithms only need to be described on conceptual level, after which the execution is maximally parallel. The fully automated execution facilitates the optimization of these algorithms with adaptive sampling, where undersampled regions are automatically detected and targeted without user intervention. We show how several such algorithms can be formulated for computational chemistry problems, and how they are executed efficiently with many loosely coupled simulations using either distributed or parallel resources with Copernicus.
Hou, Zhenlong; Huang, Danian
2017-09-01
In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.
Directory of Open Access Journals (Sweden)
Vohradsky Jiri
2007-02-01
Full Text Available Abstract Background Identification of coordinately regulated genes according to the level of their expression during the time course of a process allows for discovering functional relationships among genes involved in the process. Results We present a single class classification method for the identification of genes of similar function from a gene expression time series. It is based on a parallel genetic algorithm which is a supervised computer learning method exploiting prior knowledge of gene function to identify unknown genes of similar function from expression data. The algorithm was tested with a set of randomly generated patterns; the results were compared with seven other classification algorithms including support vector machines. The algorithm avoids several problems associated with unsupervised clustering methods, and it shows better performance then the other algorithms. The algorithm was applied to the identification of secondary metabolite gene clusters of the antibiotic-producing eubacterium Streptomyces coelicolor. The algorithm also identified pathways associated with transport of the secondary metabolites out of the cell. We used the method for the prediction of the functional role of particular ORFs based on the expression data. Conclusion Through analysis of a time series of gene expression, the algorithm identifies pathways which are directly or indirectly associated with genes of interest, and which are active during the time course of the experiment.
Martyshko, P. S.; Fedorova, N. V.; Akimova, E. N.; Gemaidinov, D. V.
2014-07-01
We describe the parallel algorithms for studying the structural features of the anomalies in the gravity and magnetic fields of the lithosphere, which are based on the height transformations of the data. The algorithms are numerically implemented on the Uran supercomputer. The suggested computer technology is used for constructing the maps of the regional and local anomalies of the magnetic and gravity fields for the northeastern sector of Europe within an area confined between 48°-62° E and 60°-68° N.
Ren, Zhong; Liu, Guodong; Huang, Zhen
2012-11-01
The image reconstruction is a key step in medical imaging (MI) and its algorithm's performance determinates the quality and resolution of reconstructed image. Although some algorithms have been used, filter back-projection (FBP) algorithm is still the classical and commonly-used algorithm in clinical MI. In FBP algorithm, filtering of original projection data is a key step in order to overcome artifact of the reconstructed image. Since simple using of classical filters, such as Shepp-Logan (SL), Ram-Lak (RL) filter have some drawbacks and limitations in practice, especially for the projection data polluted by non-stationary random noises. So, an improved wavelet denoising combined with parallel-beam FBP algorithm is used to enhance the quality of reconstructed image in this paper. In the experiments, the reconstructed effects were compared between the improved wavelet denoising and others (directly FBP, mean filter combined FBP and median filter combined FBP method). To determine the optimum reconstruction effect, different algorithms, and different wavelet bases combined with three filters were respectively test. Experimental results show the reconstruction effect of improved FBP algorithm is better than that of others. Comparing the results of different algorithms based on two evaluation standards i.e. mean-square error (MSE), peak-to-peak signal-noise ratio (PSNR), it was found that the reconstructed effects of the improved FBP based on db2 and Hanning filter at decomposition scale 2 was best, its MSE value was less and the PSNR value was higher than others. Therefore, this improved FBP algorithm has potential value in the medical imaging.
Directory of Open Access Journals (Sweden)
Tang Jiangwen
2017-08-01
Full Text Available High-resolution synthetic aperture radar presents a significant challenge to imaging algorithms and computing power. Slide spotlight is an important mode that has both high resolution and wide azimuth swath. Generally, in the slide spotlight mode, the performance of conventional frequency domain imaging algorithms degrades because of orbit curvature, the time-variant azimuth chirp rate, and other factors. We adopt the Back-Projection (BP algorithm in this study to counteract this limitation. We also propose a CPU/GPU heterogeneous BP algorithm to deal with the high computing complexity O(N3 of the BP algorithm. This heterogeneous BP algorithm makes full use of computing resources and accelerates imaging progress, and the design of a scheduling thread improves the flexibility of the algorithm.
Shi, Haixiang; Schmidt, Bertil; Liu, Weiguo; Müller-Wittig, Wolfgang
2010-04-01
Emerging DNA sequencing technologies open up exciting new opportunities for genome sequencing by generating read data with a massive throughput. However, produced reads are significantly shorter and more error-prone compared to the traditional Sanger shotgun sequencing method. This poses challenges for de novo DNA fragment assembly algorithms in terms of both accuracy (to deal with short, error-prone reads) and scalability (to deal with very large input data sets). In this article, we present a scalable parallel algorithm for correcting sequencing errors in high-throughput short-read data so that error-free reads can be available before DNA fragment assembly, which is of high importance to many graph-based short-read assembly tools. The algorithm is based on spectral alignment and uses the Compute Unified Device Architecture (CUDA) programming model. To gain efficiency we are taking advantage of the CUDA texture memory using a space-efficient Bloom filter data structure for spectrum membership queries. We have tested the runtime and accuracy of our algorithm using real and simulated Illumina data for different read lengths, error rates, input sizes, and algorithmic parameters. Using a CUDA-enabled mass-produced GPU (available for less than US$400 at any local computer outlet), this results in speedups of 12-84 times for the parallelized error correction, and speedups of 3-63 times for both sequential preprocessing and parallelized error correction compared to the publicly available Euler-SR program. Our implementation is freely available for download from http://cuda-ec.sourceforge.net .
Katouda, Michio; Naruse, Akira; Hirano, Yukihiko; Nakajima, Takahito
2016-11-15
A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Directory of Open Access Journals (Sweden)
Youxin Luo
2013-01-01
Full Text Available The forward displacement problem of the parallel robot mechanism can be converted to nonlinear equations in order to find solutions, but it is very difficult to find all solutions because of the strong coupling of the nonlinear equations. Given the problems of having only one solution and sometimes no convergence when solving the nonlinear equations with the Newton method and quasi-Newton method, a LMF algorithm based on hyper-chaos is proposed to solve the general 6-6 platform parallel mechanism, based on the combination of the hyper-chaos system and the Levenberg-Marquardt-Fletcher (abbreviated as LMF algorithm. This method uses the hyper-chaotic system to produce the initial point of the LMF algorithm, and takes advantage of the characteristics of the chaotic sequence and the LMF algorithm to find all the real solutions. The numerical example shows that the new method has some characteristics such as that it runs in the initial value range, it has fast convergence, it finds all the real solutions that can be found, and it proves the correctness and validity. It provides a new approach to mechanism design.
Hou, Yongchao; Zhao, Yang
2015-01-01
A novel 3-PUU parallel robot was put forward, on which kinematic analysis was conducted to obtain its inverse kinematics solution, and on this basis, the limitations of the sliding pair and the Hooke joint on the workspace were analyzed. Moreover, the workspace was solved through the three dimensional limit search method, and then optimization analysis was performed on the workspace of this parallel robot, which laid the foundations for the configuration design and further analysis of the parallel mechanism, with the result indicated that this type of robot was equipped with promising application prospect. In addition that, the workspace after optimization can meet more requirements of patients.
Schwarzbach, Christoph; Börner, Ralph-Uwe; Spitzer, Klaus
2005-09-01
We introduce the concept of multi-objective optimization to cast the regularized inverse direct current resistivity problem into a general formulation. This formulation is suitable for the efficient application of a genetic algorithm, which is known as a global and non-linear optimization tool. The genetic inverse algorithm generates a set of solutions reflecting the trade-off between data misfit and some measure of model features. Examination of such an ensemble is highly preferable to classical approaches where just one `optimal' solution is examined since a better overview over the range of possible inverse models is gained. However, the computational cost to obtain this ensemble is enormous. We demonstrate that at the current state of computer performance inversion of 2-D direct current resistivity data using genetic algorithms is possible if state-of-the-art computational techniques such as parallelization and efficient 2-D forward operators are applied.
Directory of Open Access Journals (Sweden)
Bonfim Amaro Júnior
2017-01-01
Full Text Available The irregular strip packing problem (ISPP is a class of cutting and packing problem (C&P in which a set of items with arbitrary formats must be placed in a container with a variable length. The aim of this work is to minimize the area needed to accommodate the given demand. ISPP is present in various types of industries from manufacturers to exporters (e.g., shipbuilding, clothes, and glass. In this paper, we propose a parallel Biased Random-Key Genetic Algorithm (µ-BRKGA with multiple populations for the ISPP by applying a collision-free region (CFR concept as the positioning method, in order to obtain an efficient and fast layout solution. The layout problem for the proposed algorithm is represented by the placement order into the container and the corresponding orientation. In order to evaluate the proposed (µ-BRKGA algorithm, computational tests using benchmark problems were applied, analyzed, and compared with different approaches.
Parallelizing a molecular dynamics algorithm on a multiprocessor workstation using OpenMP.
Tarmyshov, Konstantin B; Müller-Plathe, Florian
2005-01-01
The atomistic molecular dynamics program YASP has been parallelized for shared-memory computer architectures. Parallelization was restricted to the most CPU-time-consuming parts: neighbor-list construction, calculation of nonbonded, angle and dihedral forces, and constraints. Most of the sequential FORTRAN code was kept; parallel constructs were inserted as compiler directives using the OpenMP standard. Only in the case of the neighbor list did the data structure have to be changed. The parallel code achieves a useful speedup over the sequential version for systems of several thousand atoms and above. On an IBM Regatta p690+, the throughput increases with the number of processors up to a maximum of 12-16 processors depending on the characteristics of the simulated systems. On dual-processor Xeon systems, the speedup is about 1.7.
An efficient and robust algorithm for parallel groupwise registration of bone surfaces
van de Giessen, Martijn; Vos, Frans M.; Grimbergen, Cornelis A.; van Vliet, Lucas J.; Streekstra, Geert J.
2012-01-01
In this paper a novel groupwise registration algorithm is proposed for the unbiased registration of a large number of densely sampled point clouds. The method fits an evolving mean shape to each of the example point clouds thereby minimizing the total deformation. The registration algorithm
A Scalable O(N) Algorithm for Large-Scale Parallel First-Principles Molecular Dynamics Simulations
Energy Technology Data Exchange (ETDEWEB)
Osei-Kuffuor, Daniel [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Fattebert, Jean-Luc [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
2014-01-01
Traditional algorithms for first-principles molecular dynamics (FPMD) simulations only gain a modest capability increase from current petascale computers, due to their O(N^{3}) complexity and their heavy use of global communications. To address this issue, we are developing a truly scalable O(N) complexity FPMD algorithm, based on density functional theory (DFT), which avoids global communications. The computational model uses a general nonorthogonal orbital formulation for the DFT energy functional, which requires knowledge of selected elements of the inverse of the associated overlap matrix. We present a scalable algorithm for approximately computing selected entries of the inverse of the overlap matrix, based on an approximate inverse technique, by inverting local blocks corresponding to principal submatrices of the global overlap matrix. The new FPMD algorithm exploits sparsity and uses nearest neighbor communication to provide a computational scheme capable of extreme scalability. Accuracy is controlled by the mesh spacing of the finite difference discretization, the size of the localization regions in which the electronic orbitals are confined, and a cutoff beyond which the entries of the overlap matrix can be omitted when computing selected entries of its inverse. We demonstrate the algorithm's excellent parallel scaling for up to O(100K) atoms on O(100K) processors, with a wall-clock time of O(1) minute per molecular dynamics time step.
Debelak, Rudolf; Tran, Ulrich S
2016-01-01
The analysis of polychoric correlations via principal component analysis and exploratory factor analysis are well-known approaches to determine the dimensionality of ordered categorical items. However, the application of these approaches has been considered as critical due to the possible indefiniteness of the polychoric correlation matrix. A possible solution to this problem is the application of smoothing algorithms. This study compared the effects of three smoothing algorithms, based on the Frobenius norm, the adaption of the eigenvalues and eigenvectors, and on minimum-trace factor analysis, on the accuracy of various variations of parallel analysis by the means of a simulation study. We simulated different datasets which varied with respect to the size of the respondent sample, the size of the item set, the underlying factor model, the skewness of the response distributions and the number of response categories in each item. We found that a parallel analysis and principal component analysis of smoothed polychoric and Pearson correlations led to the most accurate results in detecting the number of major factors in simulated datasets when compared to the other methods we investigated. Of the methods used for smoothing polychoric correlation matrices, we recommend the algorithm based on minimum trace factor analysis.
Energy Technology Data Exchange (ETDEWEB)
Carey, G.F.; Young, D.M.
1994-12-31
The focus of the subject DOE sponsored research concerns parallel methods, algorithms, and software for complex applications such as those in coupled fluid flow and heat transfer. The research has been directed principally toward the solution of large-scale PDE problems using iterative solvers for finite differences and finite elements on advanced computer architectures. This work embraces parallel domain decomposition, element-by-element, spectral, and multilevel schemes with adaptive parameter determination, rational iteration and related issues. In addition to the fundamental questions related to developing new methods and mapping these to parallel computers, there are important software issues. The group has played a significant role in the development of software both for iterative solvers and also for finite element codes. The research in computational fluid dynamics (CFD) led to sustained multi-Gigaflop performance rates for parallel-vector computations of realistic large scale applications (not computational kernels alone). The main application areas for these performance studies have been two-dimensional problems in CFD. Over the course of this DOE sponsored research significant progress has been made. A report of the progression of the research is given and at the end of the report is a list of related publications and presentations over the entire grant period.
National Research Council Canada - National Science Library
Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua
2011-01-01
A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single...
Long-Hua Ma; Can-Long Lin; Zong-Tao Li; Tie-Jun Wu
2011-01-01
A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations a...
Theoretical and Empirical Analysis of a Spatial EA Parallel Boosting Algorithm.
Kamath, Uday; Domeniconi, Carlotta; De Jong, Kenneth
2016-12-16
Many real-world problems involve massive amounts of data. Under these circumstances learning algorithms often become prohibitively expensive, making scalability a pressing issue to be addressed. A common approach is to perform sampling to reduce the size of the dataset and enable efficient learning. Alternatively, one customizes learning algorithms to achieve scalability. In either case, the key challenge is to obtain algorithmic efficiency without compromising the quality of the results. In this article we discuss a meta-learning algorithm (PSBML) that combines concepts from spatially structured evolutionary algorithms (SSEAs) with concepts from ensemble and boosting methodologies to achieve the desired scalability property. We present both theoretical and empirical analyses which show that PSBML preserves a critical property of boosting, specifically, convergence to a distribution centered around the margin. We then present additional empirical analyses showing that this meta-level algorithm provides a general and effective framework that can be used in combination with a variety of learning classifiers. We perform extensive experiments to investigate the trade-off achieved between scalability and accuracy, and robustness to noise, on both synthetic and real-world data. These empirical results corroborate our theoretical analysis, and demonstrate the potential of PSBML in achieving scalability without sacrificing accuracy.
Fast Time and Space Parallel Algorithms for Solution of Parabolic Partial Differential Equations
Fijany, Amir
1993-01-01
In this paper, fast time- and Space -Parallel agorithms for solution of linear parabolic PDEs are developed. It is shown that the seemingly strictly serial iterations of the time-stepping procedure for solution of the problem can be completed decoupled.
A hierarchical, automated target recognition algorithm for a parallel analog processor
Woodward, Gail; Padgett, Curtis
1997-01-01
A hierarchical approach is described for an automated target recognition (ATR) system, VIGILANTE, that uses a massively parallel, analog processor (3DANN). The 3DANN processor is capable of performing 64 concurrent inner products of size 1x4096 every 250 nanoseconds.
Ergül, Özgür
2011-11-01
Fast and accurate solutions of large-scale electromagnetics problems involving homogeneous dielectric objects are considered. Problems are formulated with the electric and magnetic current combined-field integral equation and discretized with the Rao-Wilton-Glisson functions. Solutions are performed iteratively by using the multilevel fast multipole algorithm (MLFMA). For the solution of large-scale problems discretized with millions of unknowns, MLFMA is parallelized on distributed-memory architectures using a rigorous technique, namely, the hierarchical partitioning strategy. Efficiency and accuracy of the developed implementation are demonstrated on very large problems involving as many as 100 million unknowns.
Osei-Kuffuor, Daniel; Fattebert, Jean-Luc
2014-01-31
We present the first truly scalable first-principles molecular dynamics algorithm with O(N) complexity and controllable accuracy, capable of simulating systems with finite band gaps of sizes that were previously impossible with this degree of accuracy. By avoiding global communications, we provide a practical computational scheme capable of extreme scalability. Accuracy is controlled by the mesh spacing of the finite difference discretization, the size of the localization regions in which the electronic wave functions are confined, and a cutoff beyond which the components of the overlap matrix can be omitted when computing selected elements of its inverse. We demonstrate the algorithm's excellent parallel scaling for up to 101,952 atoms on 23,328 processors, with a wall-clock time of the order of 1 min per molecular dynamics time step and numerical error on the forces of less than 7×10(-4) Ha/Bohr.
Shifting Control Algorithm for a Single-Axle Parallel Plug-In Hybrid Electric Bus Equipped with EMT
Directory of Open Access Journals (Sweden)
Yunyun Yang
2014-01-01
Full Text Available Combining the characteristics of motor with fast response speed, an electric-drive automated mechanical transmission (EMT is proposed as a novel type of transmission in this paper. Replacing the friction synchronization shifting of automated manual transmission (AMT in HEVs, the EMT can achieve active synchronization of speed shifting. The dynamic model of a single-axle parallel PHEV equipped with the EMT is built up, and the dynamic properties of the gearshift process are also described. In addition, the control algorithm is developed to improve the shifting quality of the PHEV equipped with the EMT in all its evaluation indexes. The key techniques of changing the driving force gradient in preshifting and shifting compensation phases as well as of predicting the meshing speed in the gear meshing phase are also proposed. Results of simulation, bench test, and real road test demonstrate that the proposed control algorithm can reduce the gearshift jerk and the power interruption time noticeably.
Weber, Valéry; Laino, Teodoro; Pozdneev, Alexander; Fedulova, Irina; Curioni, Alessandro
2015-07-14
In this paper, we present a novel, highly efficient, and massively parallel implementation of the sparse matrix-matrix multiplication algorithm inspired by the midpoint method that is suitable for matrices with decay. Compared with the state of the art in sparse matrix-matrix multiplications, the new algorithm heavily exploits data locality, yielding better performance and scalability, approaching a perfect linear scaling up to a process box size equal to a characteristic length that is intrinsic to the matrices. Moreover, the method is able to scale linearly with system size reaching constant time with proportional resources, also regarding memory consumption. We demonstrate how the proposed method can be effectively used for the construction of the density matrix in electronic structure theory, such as Hartree-Fock, density functional theory, and semiempirical Hamiltonians. We present the details of the implementation together with a performance analysis up to 185,193 processes, employing a Hamiltonian matrix generated from a semiempirical NDDO scheme.
Energy Technology Data Exchange (ETDEWEB)
Osei-Kuffuor, Daniel [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Fattebert, Jean-Luc [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
2014-01-01
We present the first truly scalable first-principles molecular dynamics algorithm with O(N) complexity and controllable accuracy, capable of simulating systems with finite band gaps of sizes that were previously impossible with this degree of accuracy. By avoiding global communications, we provide a practical computational scheme capable of extreme scalability. Accuracy is controlled by the mesh spacing of the finite difference discretization, the size of the localization regions in which the electronic wave functions are confined, and a cutoff beyond which the components of the overlap matrix can be omitted when computing selected elements of its inverse. We demonstrate the algorithm's excellent parallel scaling for up to 101 952 atoms on 23 328 processors, with a wall-clock time of the order of 1 min per molecular dynamics time step and numerical error on the forces of less than 7x10^{-4} Ha/Bohr.
Parallel DC3 Algorithm for Suffix Array Construction on Many-Core Accelerators
Liao, Gang
2015-05-01
In bioinformatics applications, suffix arrays are widely used to DNA sequence alignments in the initial exact match phase of heuristic algorithms. With the exponential growth and availability of data, using many-core accelerators, like GPUs, to optimize existing algorithms is very common. We present a new implementation of suffix array on GPU. As a result, suffix array construction on GPU achieves around 10x speedup on standard large data sets, which contain more than 100 million characters. The idea is simple, fast and scalable that can be easily scale to multi-core processors and even heterogeneous architectures. © 2015 IEEE.
New Parallel Algorithms for Structural Analysis and Design of Aerospace Structures
Nguyen, Duc T.
1998-01-01
Subspace and Lanczos iterations have been developed, well documented, and widely accepted as efficient methods for obtaining p-lowest eigen-pair solutions of large-scale, practical engineering problems. The focus of this paper is to incorporate recent developments in vectorized sparse technologies in conjunction with Subspace and Lanczos iterative algorithms for computational enhancements. Numerical performance, in terms of accuracy and efficiency of the proposed sparse strategies for Subspace and Lanczos algorithm, is demonstrated by solving for the lowest frequencies and mode shapes of structural problems on the IBM-R6000/590 and SunSparc 20 workstations.
Cell-based hardware architecture for full-parallel generation algorithm of digital holograms.
Seo, Young-Ho; Choi, Hyun-Jun; Yoo, Ji-Sang; Kim, Dong-Wook
2011-04-25
This paper proposes a new hardware architecture to speed-up the digital hologram calculation by parallel computation. To realize it, we modify the computer-generated hologram (CGH) equation and propose a cell-based very large scale integrated circuit architecture. We induce a new equation to calculate the horizontal or vertical hologram pixel values in parallel, after finding the calculation regularity in the horizontal or vertical direction from the basic CGH equation. We also propose the architecture of the computer-generated hologram cell consisting of an initial parameter calculator and update-phase calculators based on the equation, and then implement them in hardware. Modifying the equation could simplify the hardware, and approximating the cosine function could optimize the hardware. In addition, we show the hardware architecture to parallelize the calculation in the horizontal direction by extending computer-generated holograms. In the experiments, we analyze hardware resource usage and the performance-capability characteristics of the look-up table used in the computer-generated hologram cell. These analyses make it possible to select the amount of hardware to the precision of the results. Here, we used the platform from our previous work for the computer-generated hologram kernel and the structure of the processor.
DEFF Research Database (Denmark)
Dollerup, Niels; Jepsen, Michael S.; Damkilde, Lars
2013-01-01
The artide describes a robust and effective implementation of the interior point optimization algorithm. The adopted method includes a precalculation step, which reduces the number of variables by fulfilling the equilibrium equations a priori. This work presents an improved implementation of the ...
A Non-static Data Layout Enhancing Parallelism and Vectorization in Sparse Grid Algorithms
Buse, Gerrit
2012-06-01
The name sparse grids denotes a highly space-efficient, grid-based numerical technique to approximate high-dimensional functions. Although employed in a broad spectrum of applications from different fields, there have only been few tries to use it in real time visualization (e.g. [1]), due to complex data structures and long algorithm runtime. In this work we present a novel approach inspired by principles of I/0-efficient algorithms. Locally applied coefficient permutations lead to improved cache performance and facilitate the use of vector registers for our sparse grid benchmark problem hierarchization. Based on the compact data structure proposed for regular sparse grids in [2], we developed a new algorithm that outperforms existing implementations on modern multi-core systems by a factor of 37 for a grid size of 127 million points. For larger problems the speedup is even increasing, and with execution times below 1 s, sparse grids are well-suited for visualization applications. Furthermore, we point out how a broad class of sparse grid algorithms can benefit from our approach. © 2012 IEEE.
A fine-grained parallel algorithm for the cyclic flexible job shop problem
Directory of Open Access Journals (Sweden)
Bożejko Wojciech
2017-06-01
Full Text Available In this paper there is considered a flexible job shop problem of operations scheduling. The new, very fast method of determination of cycle time is presented. In the design of heuristic algorithm there was the neighborhood inspired by the game of golf applied. Lower bound of the criterion function was used in the search of the neighborhood.
Ergül, Özgür; Gürel, Levent
2013-03-01
Accurate electromagnetic modeling of complicated optical structures poses several challenges. Optical metamaterial and plasmonic structures are composed of multiple coexisting dielectric and/or conducting parts. Such composite structures may possess diverse values of conductivities and dielectric constants, including negative permittivity and permeability. Further challenges are the large sizes of the structures with respect to wavelength and the complexities of the geometries. In order to overcome these challenges and to achieve rigorous and efficient electromagnetic modeling of three-dimensional optical composite structures, we have developed a parallel implementation of the multilevel fast multipole algorithm (MLFMA). Precise formulation of composite structures is achieved with the so-called "electric and magnetic current combined-field integral equation." Surface integral equations are carefully discretized with piecewise linear basis functions, and the ensuing dense matrix equations are solved iteratively with parallel MLFMA. The hierarchical strategy is used for the efficient parallelization of MLFMA on distributed-memory architectures. In this paper, fast and accurate solutions of large-scale canonical and complicated real-life problems, such as optical metamaterials, discretized with tens of millions of unknowns are presented in order to demonstrate the capabilities of the proposed electromagnetic solver.
Al-Refaie, Ahmed F.; Tennyson, Jonathan
2017-12-01
Construction and diagonalization of the Hamiltonian matrix is the rate-limiting step in most low-energy electron - molecule collision calculations. Tennyson (1996) implemented a novel algorithm for Hamiltonian construction which took advantage of the structure of the wavefunction in such calculations. This algorithm is re-engineered to make use of modern computer architectures and the use of appropriate diagonalizers is considered. Test calculations demonstrate that significant speed-ups can be gained using multiple CPUs. This opens the way to calculations which consider higher collision energies, larger molecules and / or more target states. The methodology, which is implemented as part of the UK molecular R-matrix codes (UKRMol and UKRMol+) can also be used for studies of bound molecular Rydberg states, photoionization and positron-molecule collisions.
Chang, Weng-Long
2012-03-01
Assume that n is a positive integer. If there is an integer such that M (2) ≡ C (mod n), i.e., the congruence has a solution, then C is said to be a quadratic congruence (mod n). If the congruence does not have a solution, then C is said to be a quadratic noncongruence (mod n). The task of solving the problem is central to many important applications, the most obvious being cryptography. In this article, we describe a DNA-based algorithm for solving quadratic congruence and factoring integers. In additional to this novel contribution, we also show the utility of our encoding scheme, and of the algorithm's submodules. We demonstrate how a variety of arithmetic, shifted and comparative operations, namely bitwise and full addition, subtraction, left shifter and comparison perhaps are performed using strands of DNA.
2014-07-31
and multi- core configurations [4], [11], [12]. (Completed) c) Find out the limiting computational factors (length, memory, inter- processor ...11], [12]. (Completed) g) Increase the number of minority students that are involved and/or are aware of issues relating DSP algorithms and...our Hardware/Software Testbed Hardware (Cluster): Updated and made operational a 64 processor , 16-node Dell cluster purchased under a previous
Shoemaker, C. A.; Pang, M.; Akhtar, T.; Bindel, D.
2016-12-01
New parallel surrogate global optimization algorithms are developed and applied to objective functions that are expensive simulations (possibly with multiple local minima). The algorithms can be applied to most geophysical simulations, including those with nonlinear partial differential equations. The optimization does not require simulations be parallelized. Asynchronous (and synchronous) parallel execution is available in the optimization toolbox "pySOT". The parallel algorithms are modified from serial to eliminate fine grained parallelism. The optimization is computed with open source software pySOT, a Surrogate Global Optimization Toolbox that allows user to pick the type of surrogate (or ensembles), the search procedure on surrogate, and the type of parallelism (synchronous or asynchronous). pySOT also allows the user to develop new algorithms by modifying parts of the code. In the applications here, the objective function takes up to 30 minutes for one simulation, and serial optimization can take over 200 hours. Results from Yellowstone (NSF) and NCSS (Singapore) supercomputers are given for groundwater contaminant hydrology simulations with applications to model parameter estimation and decontamination management. All results are compared with alternatives. The first results are for optimization of pumping at many wells to reduce cost for decontamination of groundwater at a superfund site. The optimization runs with up to 128 processors. Superlinear speed up is obtained for up to 16 processors, and efficiency with 64 processors is over 80%. Each evaluation of the objective function requires the solution of nonlinear partial differential equations to describe the impact of spatially distributed pumping and model parameters on model predictions for the spatial and temporal distribution of groundwater contaminants. The second application uses an asynchronous parallel global optimization for groundwater quality model calibration. The time for a single objective
Trobec, Roman
2015-01-01
This book is concentrated on the synergy between computer science and numerical analysis. It is written to provide a firm understanding of the described approaches to computer scientists, engineers or other experts who have to solve real problems. The meshless solution approach is described in more detail, with a description of the required algorithms and the methods that are needed for the design of an efficient computer program. Most of the details are demonstrated on solutions of practical problems, from basic to more complicated ones. This book will be a useful tool for any reader interes
Pelletier, Mathew G
2008-02-08
One of the main hurdles standing in the way of optimal cleaning of cotton lint isthe lack of sensing systems that can react fast enough to provide the control system withreal-time information as to the level of trash contamination of the cotton lint. This researchexamines the use of programmable graphic processing units (GPU) as an alternative to thePC's traditional use of the central processing unit (CPU). The use of the GPU, as analternative computation platform, allowed for the machine vision system to gain asignificant improvement in processing time. By improving the processing time, thisresearch seeks to address the lack of availability of rapid trash sensing systems and thusalleviate a situation in which the current systems view the cotton lint either well before, orafter, the cotton is cleaned. This extended lag/lead time that is currently imposed on thecotton trash cleaning control systems, is what is responsible for system operators utilizing avery large dead-band safety buffer in order to ensure that the cotton lint is not undercleaned.Unfortunately, the utilization of a large dead-band buffer results in the majority ofthe cotton lint being over-cleaned which in turn causes lint fiber-damage as well assignificant losses of the valuable lint due to the excessive use of cleaning machinery. Thisresearch estimates that upwards of a 30% reduction in lint loss could be gained through theuse of a tightly coupled trash sensor to the cleaning machinery control systems. Thisresearch seeks to improve processing times through the development of a new algorithm forcotton trash sensing that allows for implementation on a highly parallel architecture.Additionally, by moving the new parallel algorithm onto an alternative computing platform,the graphic processing unit "GPU", for processing of the cotton trash images, a speed up ofover 6.5 times, over optimized code running on the PC's central processing unit "CPU", wasgained. The new parallel algorithm operating on the
Directory of Open Access Journals (Sweden)
Mathew G. Pelletier
2008-02-01
Full Text Available One of the main hurdles standing in the way of optimal cleaning of cotton lint isthe lack of sensing systems that can react fast enough to provide the control system withreal-time information as to the level of trash contamination of the cotton lint. This researchexamines the use of programmable graphic processing units (GPU as an alternative to thePCÃ¢Â€Â™s traditional use of the central processing unit (CPU. The use of the GPU, as analternative computation platform, allowed for the machine vision system to gain asignificant improvement in processing time. By improving the processing time, thisresearch seeks to address the lack of availability of rapid trash sensing systems and thusalleviate a situation in which the current systems view the cotton lint either well before, orafter, the cotton is cleaned. This extended lag/lead time that is currently imposed on thecotton trash cleaning control systems, is what is responsible for system operators utilizing avery large dead-band safety buffer in order to ensure that the cotton lint is not undercleaned.Unfortunately, the utilization of a large dead-band buffer results in the majority ofthe cotton lint being over-cleaned which in turn causes lint fiber-damage as well assignificant losses of the valuable lint due to the excessive use of cleaning machinery. Thisresearch estimates that upwards of a 30% reduction in lint loss could be gained through theuse of a tightly coupled trash sensor to the cleaning machinery control systems. Thisresearch seeks to improve processing times through the development of a new algorithm forcotton trash sensing that allows for implementation on a highly parallel architecture.Additionally, by moving the new parallel algorithm onto an alternative computing platform,the graphic processing unit Ã¢Â€ÂœGPUÃ¢Â€Â, for processing of the cotton trash images, a speed up ofover 6.5 times, over optimized code running on the PCÃ¢Â€Â™s central processing
Development and evaluation of a scheduling algorithm for parallel hardware tests at CERN
Galetzka, Michael
This thesis aims at describing the problem of scheduling, evaluating different scheduling algorithms and comparing them with each other as well as with the current prototype solution. The implementation of the final solution will be delineated, as will the design considerations that led to it. The CERN Large Hadron Collider (LHC) has to deal with unprecedented stored energy, both in its particle beams and its superconducting magnet circuits. This energy could result in major equipment damage and downtime if it is not properly extracted from the machine. Before commissioning the machine with the particle beam, several thousands of tests have to be executed, analyzed and tracked to assess the proper functioning of the equipment and protection systems. These tests access the accelerator's equipment in order to verify the correct behavior of all systems, such as magnets, power converters and interlock controllers. A test could, for example, ramp the magnet to a certain energy level and then provoke an emergency...
Niederhauser, Thomas; Wyss-Balmer, Thomas; Haeberlin, Andreas; Marisa, Thanks; Wildhaber, Reto A; Goette, Josef; Jacomet, Marcel; Vogel, Rolf
2015-06-01
Long-term electrocardiogram (ECG) often suffers from relevant noise. Baseline wander in particular is pronounced in ECG recordings using dry or esophageal electrodes, which are dedicated for prolonged registration. While analog high-pass filters introduce phase distortions, reliable offline filtering of the baseline wander implies a computational burden that has to be put in relation to the increase in signal-to-baseline ratio (SBR). Here, we present a graphics processor unit (GPU)-based parallelization method to speed up offline baseline wander filter algorithms, namely the wavelet, finite, and infinite impulse response, moving mean, and moving median filter. Individual filter parameters were optimized with respect to the SBR increase based on ECGs from the Physionet database superimposed to autoregressive modeled, real baseline wander. A Monte-Carlo simulation showed that for low input SBR the moving median filter outperforms any other method but negatively affects ECG wave detection. In contrast, the infinite impulse response filter is preferred in case of high input SBR. However, the parallelized wavelet filter is processed 500 and four times faster than these two algorithms on the GPU, respectively, and offers superior baseline wander suppression in low SBR situations. Using a signal segment of 64 mega samples that is filtered as entire unit, wavelet filtering of a seven-day high-resolution ECG is computed within less than 3 s. Taking the high filtering speed into account, the GPU wavelet filter is the most efficient method to remove baseline wander present in long-term ECGs, with which computational burden can be strongly reduced.
Sachetto Oliveira, Rafael; Martins Rocha, Bernardo; Burgarelli, Denise; Meira, Wagner; Constantinides, Christakis; Weber Dos Santos, Rodrigo
2018-02-01
The use of computer models as a tool for the study and understanding of the complex phenomena of cardiac electrophysiology has attained increased importance nowadays. At the same time, the increased complexity of the biophysical processes translates into complex computational and mathematical models. To speed up cardiac simulations and to allow more precise and realistic uses, 2 different techniques have been traditionally exploited: parallel computing and sophisticated numerical methods. In this work, we combine a modern parallel computing technique based on multicore and graphics processing units (GPUs) and a sophisticated numerical method based on a new space-time adaptive algorithm. We evaluate each technique alone and in different combinations: multicore and GPU, multicore and GPU and space adaptivity, multicore and GPU and space adaptivity and time adaptivity. All the techniques and combinations were evaluated under different scenarios: 3D simulations on slabs, 3D simulations on a ventricular mouse mesh, ie, complex geometry, sinus-rhythm, and arrhythmic conditions. Our results suggest that multicore and GPU accelerate the simulations by an approximate factor of 33×, whereas the speedups attained by the space-time adaptive algorithms were approximately 48. Nevertheless, by combining all the techniques, we obtained speedups that ranged between 165 and 498. The tested methods were able to reduce the execution time of a simulation by more than 498× for a complex cellular model in a slab geometry and by 165× in a realistic heart geometry simulating spiral waves. The proposed methods will allow faster and more realistic simulations in a feasible time with no significant loss of accuracy. Copyright © 2017 John Wiley & Sons, Ltd.
1982-01-01
Parallel Computations focuses on parallel computation, with emphasis on algorithms used in a variety of numerical and physical applications and for many different types of parallel computers. Topics covered range from vectorization of fast Fourier transforms (FFTs) and of the incomplete Cholesky conjugate gradient (ICCG) algorithm on the Cray-1 to calculation of table lookups and piecewise functions. Single tridiagonal linear systems and vectorized computation of reactive flow are also discussed.Comprised of 13 chapters, this volume begins by classifying parallel computers and describing techn
Directory of Open Access Journals (Sweden)
Liu Tian-Yuan
2016-01-01
Full Text Available Blade is one of the core components of turbine machinery. The reliability of blade is directly related to the normal operation of plant unit. However, with the increase of blade length and flow rate, non-linear effects such as finite deformation must be considered in strength computation to guarantee enough accuracy. Parallel computation is adopted to improve the efficiency of classical nonlinear finite element method and shorten the blade design period. So it is of extraordinary importance for engineering practice. In this paper, the dynamic partial differential equations and the finite element method forms for turbine blades under centrifugal load and flow load are given firstly. Then, according to the characteristics of turbine blade model, the classical method is optimized based on central processing unit + graphics processing unit heterogeneous parallel computation. Finally, the numerical experiment validations are performed. The computation speed of the algorithm proposed in this paper is compared with the speed of ANSYS. For the rectangle plate model with mesh number of 10 k to 4000 k, a maximum speed-up of 4.31 can be obtained. For the real blade-rim model with mesh number of 500 k, the speed-up of 4.54 times can be obtained.
Energy Technology Data Exchange (ETDEWEB)
Tavakkoli-Moghaddam, R. [Department of Industrial Engineering, Faculty of Engineering, University of Tehran, P.O. Box 11365/4563, Tehran (Iran, Islamic Republic of); Department of Mechanical Engineering, The University of British Columbia, Vancouver (Canada)], E-mail: tavakoli@ut.ac.ir; Safari, J. [Department of Industrial Engineering, Science and Research Branch, Islamic Azad University, Tehran (Iran, Islamic Republic of)], E-mail: jalalsafari@pideco.com; Sassani, F. [Department of Mechanical Engineering, The University of British Columbia, Vancouver (Canada)], E-mail: sassani@mech.ubc.ca
2008-04-15
This paper proposes a genetic algorithm (GA) for a redundancy allocation problem for the series-parallel system when the redundancy strategy can be chosen for individual subsystems. Majority of the solution methods for the general redundancy allocation problems assume that the redundancy strategy for each subsystem is predetermined and fixed. In general, active redundancy has received more attention in the past. However, in practice both active and cold-standby redundancies may be used within a particular system design and the choice of the redundancy strategy becomes an additional decision variable. Thus, the problem is to select the best redundancy strategy, component, and redundancy level for each subsystem in order to maximize the system reliability under system-level constraints. This belongs to the NP-hard class of problems. Due to its complexity, it is so difficult to optimally solve such a problem by using traditional optimization tools. It is demonstrated in this paper that GA is an efficient method for solving this type of problems. Finally, computational results for a typical scenario are presented and the robustness of the proposed algorithm is discussed.
Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Yasuhiro MATSUNAGA; Yoda, Takao; Feig, Michael; Sugita, Yuji
2015-01-01
GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP...
Torner, François M.; Karatas, Abdullah; Eifler, Matthias; Raid, Indek; Seewig, Jörg
2017-06-01
In nowadays industry, complex surfaces with material contrasts or surface coatings are present and represent a challenge for optical topography measuring instruments. The reason is that varying optical properties lead to phase jumps and to topography deviations when the surface height is evaluated. Thus, Ellipso-Height-Topometry as a measurement technique which can measure both topography and material properties of technical surfaces was proposed in order to achieve a correction of the occurring topographic artefacts. The height correction value can be obtained for the compensation of material-induced height deviations and the thickness of surface layers can be evaluated. Currently, it is possible to calculate the surface characteristics from ellipsometric parameters for at most two layers. However, the described height corrections are only possible when well-defined and realistic models of surface layers can be utilized, e.g. a given set of homogeneous oxide layers. Oxidation effects however describe statistical processes which can be predicted with underlying material distribution models. This leads to an uncertainty in ellipsometry, which is considered with a new approach that will be discussed in this publication. Therefore, an extended multi-layer approach which is capable of handling additional layers based on a parallelized algorithm using graphic processing units and the commonly known CUDA technology is proposed. This algorithm can also be used to consider material proportions which result from oxidation effects in z direction. The new approach for the Ellipso-Height-topometry measurement technique is compared with the current procedures which often neglect the existence of an oxide layer for the basic material. To experimentally verify the approach and according algorithm, it is applied for the evaluation of actual surfaces with multiple plane layers and different materials. Test samples with different materials are used in order to evaluate the complex
Guermond, J. L.
2011-05-04
The purpose of this paper is to validate a new highly parallelizable direction splitting algorithm. The parallelization capabilities of this algorithm are illustrated by providing a highly accurate solution for the start-up flow in a three-dimensional impulsively started lid-driven cavity of aspect ratio 1×1×2 at Reynolds numbers 1000 and 5000. The computations are done in parallel (up to 1024 processors) on adapted grids of up to 2 billion nodes in three space dimensions. Velocity profiles are given at dimensionless times t=4, 8, and 12; at least four digits are expected to be correct at Re=1000. © 2011 John Wiley & Sons, Ltd.
Metcalfe, A. G.; Bodenheimer, R. E.
1976-01-01
A parallel algorithm for counting the number of logic-l elements in a binary array or image developed during preliminary investigation of the Tse concept is described. The counting algorithm is implemented using a basic combinational structure. Modifications which improve the efficiency of the basic structure are also presented. A programmable Tse computer structure is proposed, along with a hardware control unit, Tse instruction set, and software program for execution of the counting algorithm. Finally, a comparison is made between the different structures in terms of their more important characteristics.
Directory of Open Access Journals (Sweden)
Feng Gu
2012-01-01
Full Text Available The purpose of this paper is to establish a strong convergence of a new parallel iterative algorithm with mean errors to a common fixed point for two finite families of Ćirić quasi-contractive operators in normed spaces. The results presented in this paper generalize and improve the corresponding results of Berinde, Gu, Rafiq, Rhoades, and Zamfirescu.
Liu, Yang
2014-07-01
The computational complexity and memory requirements of classically formulated marching-on-in-time (MOT)-based surface integral equation (SIE) solvers scale as O(Nt Ns 2) and O(Ns 2), respectively; here Nt and Ns denote the number of temporal and spatial degrees of freedom of the current density. The multilevel plane wave time domain (PWTD) algorithm, viz., the time domain counterpart of the multilevel fast multipole method, reduces these costs to O(Nt Nslog2 Ns) and O(Ns 1.5) (Ergin et al., IEEE Trans. Antennas Mag., 41, 39-52, 1999). Previously, PWTD-accelerated MOT-SIE solvers have been used to analyze transient scattering from perfect electrically conducting (PEC) and homogeneous dielectric objects discretized in terms of a million spatial unknowns (Shanker et al., IEEE Trans. Antennas Propag., 51, 628-641, 2003). More recently, an efficient parallelized solver that employs an advanced hierarchical and provably scalable spatial, angular, and temporal load partitioning strategy has been developed to analyze transient scattering problems that involve ten million spatial unknowns (Liu et. al., in URSI Digest, 2013).
Cickovski, Trevor; Flor, Tiffany; Irving-Sachs, Galen; Novikov, Philip; Parda, James; Narasimhan, Giri
2015-01-01
In order to make multiple copies of a target sequence in the laboratory, the technique of Polymerase Chain Reaction (PCR) requires the design of "primers", which are short fragments of nucleotides complementary to the flanking regions of the target sequence. If the same primer is to amplify multiple closely related target sequences, then it is necessary to make the primers "degenerate", which would allow it to hybridize to target sequences with a limited amount of variability that may have been caused by mutations. However, the PCR technique can only allow a limited amount of degeneracy, and therefore the design of degenerate primers requires the identification of reasonably well-conserved regions in the input sequences. We take an existing algorithm for designing degenerate primers that is based on clustering and parallelize it in a web-accessible software package GPUDePiCt, using a shared memory model and the computing power of Graphics Processing Units (GPUs). We test our implementation on large sets of aligned sequences from the human genome and show a multi-fold speedup for clustering using our hybrid GPU/CPU implementation over a pure CPU approach for these sequences, which consist of more than 7,500 nucleotides. We also demonstrate that this speedup is consistent over larger numbers and longer lengths of aligned sequences.
Indian Academy of Sciences (India)
positive numbers. The word 'algorithm' was most often associated with this algorithm till 1950. It may however be pOinted out that several non-trivial algorithms such as synthetic (polynomial) division have been found in Vedic Mathematics which are dated much before Euclid's algorithm. A programming language Is used.
Tahara, Tatsuki; Ito, Kenichi; Kakue, Takashi; Fujii, Motofumi; Shimozato, Yuki; Awatsuji, Yasuhiro; Nishio, Kenzo; Ura, Shogo; Kubota, Toshihiro; Matoba, Osamu
2011-03-01
We propose an algorithm for compensating the phase-shift error of polarization-based parallel two-step phase-shifting digital holography, which is a technique for recording a spatial two-step phase-shifted hologram. Although a polarization-based system of the technique has been experimentally demonstrated, there had been the problem that the phase difference of two phase-shifted holograms had been changed by the extinction ratio of the micropolarizer array attached to the image sensor used in the system. To improve the performance of the system, we established and formulated an algorithm for compensating the phase-shift error. Accurate spatial phase-shifting interferometry in the system can be conducted by the algorithm regardless of phase-shift error due to the extinction ratio. By the numerical simulation, the proposed algorithm was capable of reducing the root mean square errors of the reconstructed image by 1/4 and 1/5 in amplitude and phase, respectively. Also, the algorithm was experimentally demonstrated, and the experimental results showed that the system employing the proposed algorithm suppressed the conjugate image, which slightly appeared in the image reconstructed by the system not employing the algorithm, even when the extinction ratio was 10:1. Thus, the effectiveness of the proposed algorithm was numerically and experimentally verified. © 2010 Optical Society of America
Cabria, I.; Queiruga, D.
2005-09-01
A parallel algorithm for molecular dynamics, MD, the Koradi point-centered decomposition algorithm, especially designed for inhomogeneous systems, is improved and applied to the organization and optimization of recycling costs of Waste Electrical and Electronic Equipment, WEEE, and also to systems of atoms. This organization requires the numbers and locations of storage centers and recycling plants of the WEEE that minimize the recycling cost. The Koradi algorithm finds these optimal numbers and locations, dealing very fast with large numbers of data, in contrast with other methods. The changes of the original algorithm (different ways of generating the initial centers and especially the requirement of location convergence) improve its performance for this economic problem and also for MD simulations.
Indian Academy of Sciences (India)
In the description of algorithms and programming languages, what is the role of control abstraction? • What are the inherent limitations of the algorithmic processes? In future articles in this series, we will show that these constructs are powerful and can be used to encode any algorithm. In the next article, we will discuss ...
Lin, Mingpei; Xu, Ming; Fu, Xiaoyu
2017-05-01
Currently, a tremendous amount of space debris in Earth's orbit imperils operational spacecraft. It is essential to undertake risk assessments of collisions and predict dangerous encounters in space. However, collision predictions for an enormous amount of space debris give rise to large-scale computations. In this paper, a parallel algorithm is established on the Compute Unified Device Architecture (CUDA) platform of NVIDIA Corporation for collision prediction. According to the parallel structure of NVIDIA graphics processors, a block decomposition strategy is adopted in the algorithm. Space debris is divided into batches, and the computation and data transfer operations of adjacent batches overlap. As a consequence, the latency to access shared memory during the entire computing process is significantly reduced, and a higher computing speed is reached. Theoretically, a simulation of collision prediction for space debris of any amount and for any time span can be executed. To verify this algorithm, a simulation example including 1382 pieces of debris, whose operational time scales vary from 1 min to 3 days, is conducted on Tesla C2075 of NVIDIA. The simulation results demonstrate that with the same computational accuracy as that of a CPU, the computing speed of the parallel algorithm on a GPU is 30 times that on a CPU. Based on this algorithm, collision prediction of over 150 Chinese spacecraft for a time span of 3 days can be completed in less than 3 h on a single computer, which meets the timeliness requirement of the initial screening task. Furthermore, the algorithm can be adapted for multiple tasks, including particle filtration, constellation design, and Monte-Carlo simulation of an orbital computation.
Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji
2015-01-01
GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310–323. doi: 10.1002/wcms.1220 PMID:26753008
Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji
2015-07-01
GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310-323. doi: 10.1002/wcms.1220.
Díaz-Mojica, J. J.; Cruz-Atienza, V. M.; Madariaga, R.; Singh, S. K.; Iglesias, A.
2013-05-01
We introduce a novel approach for imaging the earthquakes dynamics from ground motion records based on a parallel genetic algorithm (GA). The method follows the elliptical dynamic-rupture-patch approach introduced by Di Carli et al. (2010) and has been carefully verified through different numerical tests (Díaz-Mojica et al., 2012). Apart from the five model parameters defining the patch geometry, our dynamic source description has four more parameters: the stress drop inside the nucleation and the elliptical patches; and two friction parameters, the slip weakening distance and the change of the friction coefficient. These parameters are constant within the rupture surface. The forward dynamic source problem, involved in the GA inverse method, uses a highly accurate computational solver for the problem, namely the staggered-grid split-node. The synthetic inversion presented here shows that the source model parameterization is suitable for the GA, and that short-scale source dynamic features are well resolved in spite of low-pass filtering of the data for periods comparable to the source duration. Since there is always uncertainty in the propagation medium as well as in the source location and the focal mechanisms, we have introduced a statistical approach to generate a set of solution models so that the envelope of the corresponding synthetic waveforms explains as much as possible the observed data. We applied the method to the 2012 Mw6.5 intraslab Zumpango, Mexico earthquake and determined several fundamental source parameters that are in accordance with different and completely independent estimates for Mexican and worldwide earthquakes. Our weighted-average final model satisfactorily explains eastward rupture directivity observed in the recorded data. Some parameters found for the Zumpango earthquake are: Δτ = 30.2+/-6.2 MPa, Er = 0.68+/-0.36x10^15 J, G = 1.74+/-0.44x10^15 J, η = 0.27+/-0.11, Vr/Vs = 0.52+/-0.09 and Mw = 6.64+/-0.07; for the stress drop
Indian Academy of Sciences (India)
, i is referred to as the loop-index, 'stat-body' is any sequence of ... while i ~ N do stat-body; i: = i+ 1; endwhile. The algorithm for sorting the numbers is described in Table 1 and the algorithmic steps on a list of 4 numbers shown in. Figure 1.
Sourbier, Florent; Operto, Stéphane; Virieux, Jean; Amestoy, Patrick; L'Excellent, Jean-Yves
2009-03-01
This is the first paper in a two-part series that describes a massively parallel code that performs 2D frequency-domain full-waveform inversion of wide-aperture seismic data for imaging complex structures. Full-waveform inversion methods, namely quantitative seismic imaging methods based on the resolution of the full wave equation, are computationally expensive. Therefore, designing efficient algorithms which take advantage of parallel computing facilities is critical for the appraisal of these approaches when applied to representative case studies and for further improvements. Full-waveform modelling requires the resolution of a large sparse system of linear equations which is performed with the massively parallel direct solver MUMPS for efficient multiple-shot simulations. Efficiency of the multiple-shot solution phase (forward/backward substitutions) is improved by using the BLAS3 library. The inverse problem relies on a classic local optimization approach implemented with a gradient method. The direct solver returns the multiple-shot wavefield solutions distributed over the processors according to a domain decomposition driven by the distribution of the LU factors. The domain decomposition of the wavefield solutions is used to compute in parallel the gradient of the objective function and the diagonal Hessian, this latter providing a suitable scaling of the gradient. The algorithm allows one to test different strategies for multiscale frequency inversion ranging from successive mono-frequency inversion to simultaneous multifrequency inversion. These different inversion strategies will be illustrated in the following companion paper. The parallel efficiency and the scalability of the code will also be quantified.
Narayanan, Kiran
2012-07-17
A hybrid parallelization method composed of a coarse-grained genetic algorithm (GA) and fine-grained objective function evaluations is implemented on a heterogeneous computational resource consisting of 16 IBM Blue Gene/P racks, a single x86 cluster node and a high-performance file system. The GA iterator is coupled with a finite-element (FE) analysis code developed in house to facilitate computational steering in order to calculate the optimal impact velocities of a projectile colliding with a polyurea/structural steel composite plate. The FE code is capable of capturing adiabatic shear bands and strain localization, which are typically observed in high-velocity impact applications, and it includes several constitutive models of plasticity, viscoelasticity and viscoplasticity for metals and soft materials, which allow simulation of ductile fracture by void growth. A strong scaling study of the FE code was conducted to determine the optimum number of processes run in parallel. The relative efficiency of the hybrid, multi-level parallelization method is studied in order to determine the parameters for the parallelization. Optimal impact velocities of the projectile calculated using the proposed approach, are reported. © The Author(s) 2012.
Li, Chuan; Li, Lin; Zhang, Jie; Alexov, Emil
2012-09-15
The Gauss-Seidel (GS) method is a standard iterative numerical method widely used to solve a system of equations and, in general, is more efficient comparing to other iterative methods, such as the Jacobi method. However, standard implementation of the GS method restricts its utilization in parallel computing due to its requirement of using updated neighboring values (i.e., in current iteration) as soon as they are available. Here, we report an efficient and exact (not requiring assumptions) method to parallelize iterations and to reduce the computational time as a linear/nearly linear function of the number of processes or computing units. In contrast to other existing solutions, our method does not require any assumptions and is equally applicable for solving linear and nonlinear equations. This approach is implemented in the DelPhi program, which is a finite difference Poisson-Boltzmann equation solver to model electrostatics in molecular biology. This development makes the iterative procedure on obtaining the electrostatic potential distribution in the parallelized DelPhi several folds faster than that in the serial code. Further, we demonstrate the advantages of the new parallelized DelPhi by computing the electrostatic potential and the corresponding energies of large supramolecular structures. Copyright © 2012 Wiley Periodicals, Inc.
Guermond, Jean-Luc
2012-01-01
We provide a convergence analysis for a new fractional timestepping technique for the incompressible Navier-Stokes equations based on direction splitting. This new technique is of linear complexity, unconditionally stable and convergent, and suitable for massive parallelization. © 2012 American Mathematical Society.
Ma, Haotong; Liu, Zejin; Xu, Xiaojun; Chen, Jinbao
2013-02-01
We propose and demonstrate the simultaneous adaptive control of a dual deformable mirror system for full-field beam shaping based on an improved stochastic parallel gradient descent (SPGD) algorithm and dual-phase-only liquid crystal spatial light modulators (LC-SLMs). One LC-SLM adaptively redistributes the intensity of the input beam and the other adaptively compensates the wavefront of the output beam. However, the intensity redistribution and wavefront compensation closed loops run simultaneously. In addition, the intensity redistribution and wavefront compensation closed loops adopt their respective metric functions independently. Experimental results show that the improved SPGD algorithm can not only be used for controlling dual deformable mirror configuration to adaptively generate near diffraction-limited flattop beams with desired intensity distributions, but also can greatly improve the control bandwidth.
Indian Academy of Sciences (India)
Algorithms. 3. Procedures and Recursion. R K Shyamasundar. In this article we introduce procedural abstraction and illustrate its uses. Further, we illustrate the notion of recursion which is one of the most useful features of procedural abstraction. Procedures. Let us consider a variation of the pro blem of summing the first M.
Indian Academy of Sciences (India)
number of elements. We shall illustrate the widely used matrix multiplication algorithm using the two dimensional arrays in the following. Consider two matrices A and B of integer type with di- mensions m x nand n x p respectively. Then, multiplication of. A by B denoted, A x B , is defined by matrix C of dimension m xp where.
Parallel Atomistic Simulations
Energy Technology Data Exchange (ETDEWEB)
HEFFELFINGER,GRANT S.
2000-01-18
Algorithms developed to enable the use of atomistic molecular simulation methods with parallel computers are reviewed. Methods appropriate for bonded as well as non-bonded (and charged) interactions are included. While strategies for obtaining parallel molecular simulations have been developed for the full variety of atomistic simulation methods, molecular dynamics and Monte Carlo have received the most attention. Three main types of parallel molecular dynamics simulations have been developed, the replicated data decomposition, the spatial decomposition, and the force decomposition. For Monte Carlo simulations, parallel algorithms have been developed which can be divided into two categories, those which require a modified Markov chain and those which do not. Parallel algorithms developed for other simulation methods such as Gibbs ensemble Monte Carlo, grand canonical molecular dynamics, and Monte Carlo methods for protein structure determination are also reviewed and issues such as how to measure parallel efficiency, especially in the case of parallel Monte Carlo algorithms with modified Markov chains are discussed.
Somavarapu, Dhathri H.
This thesis proposes a new parallel computing genetic algorithm framework for designing fuel-optimal trajectories for interplanetary spacecraft missions. The framework can capture the deep search space of the problem with the use of a fixed chromosome structure and hidden-genes concept, can explore the diverse set of candidate solutions with the use of the adaptive and twin-space crowding techniques and, can execute on any high-performance computing (HPC) platform with the adoption of the portable message passing interface (MPI) standard. The algorithm is implemented in C++ with the use of the MPICH implementation of the MPI standard. The algorithm uses a patched-conic approach with two-body dynamics assumptions. New procedures are developed for determining trajectories in the Vinfinity-leveraging legs of the flight from the launch and non-launch planets and, deep-space maneuver legs of the flight from the launch and non-launch planets. The chromosome structure maintains the time of flight as a free parameter within certain boundaries. The fitness or the cost function of the algorithm uses only the mission Delta V, and does not include time of flight. The optimization is conducted with two variations for the minimum mission gravity-assist sequence, the 4-gravity-assist, and the 3-gravity-assist, with a maximum of 5 gravity-assists allowed in both the cases. The optimal trajectories discovered using the framework in both of the cases demonstrate the success of this framework.
Energy Technology Data Exchange (ETDEWEB)
Koehler, Antonio R. Sertich [Itaipu Binacional, Hernandarias (Paraguay). Superintendencia de Ingenieria]. E-mail: sertich@itaipu.gov.py; Campagnolo, Jorge Mario; Costa, Antonio J.A. Simoes; Rolim, Jaqueline Gisele [Universidade Federal de Santa Catarina, Florianopolis, SC (Brazil)
2001-07-01
This article presents a method to obtain robust adjustments for power system stabilizers by implementing genetic algorithms and value calculations in parallel computing architecture. The problem referring to stabilizer parameter adjustments is converted into an optimization problem that can be solved with help from genetic algorithms which use an aptitude function based on the system self-values.
Wang, Lui; Bayer, Steven E.
1991-01-01
Genetic algorithms are mathematical, highly parallel, adaptive search procedures (i.e., problem solving methods) based loosely on the processes of natural genetics and Darwinian survival of the fittest. Basic genetic algorithms concepts are introduced, genetic algorithm applications are introduced, and results are presented from a project to develop a software tool that will enable the widespread use of genetic algorithm technology.
Directory of Open Access Journals (Sweden)
2009-03-01
Full Text Available We introduce a new approximation scheme combining the viscosity method with parallel method for finding a common element of the set of solutions of a generalized equilibrium problem and the set of fixed points of a family of finitely strict pseudocontractions. We obtain a strong convergence theorem for the sequences generated by these processes in Hilbert spaces. Based on this result, we also get some new and interesting results. The results in this paper extend and improve some well-known results in the literature.
2014-05-01
Windows 95. It was a straightforward port of Word 6.0 and it introduced few new features, one of them being red-squiggle underlined spell - checking ...Architecture (Baseline) Implementation of the MIT Quicksynch Sparse Algorithm Development and Implementation of Parallel Circular Correlator...our own MATLAB code for the MIT Quicksync algorithm [2] and implemented this algorithm in a Simulink model that, at the moment, works only with
Soufan, Othman
2012-09-01
Feature selection is the first task of any learning approach that is applied in major fields of biomedical, bioinformatics, robotics, natural language processing and social networking. In feature subset selection problem, a search methodology with a proper criterion seeks to find the best subset of features describing data (relevance) and achieving better performance (optimality). Wrapper approaches are feature selection methods which are wrapped around a classification algorithm and use a performance measure to select the best subset of features. We analyze the proper design of the objective function for the wrapper approach and highlight an objective based on several classification algorithms. We compare the wrapper approaches to different feature selection methods based on distance and information based criteria. Significant improvement in performance, computational time, and selection of minimally sized feature subsets is achieved by combining different objectives for the wrapper model. In addition, considering various classification methods in the feature selection process could lead to a global solution of desirable characteristics.
Hung, Che-Lun; Lin, Yu-Shiang; Lin, Chun-Yuan; Chung, Yeh-Ching; Chung, Yi-Fang
2015-10-01
For biological applications, sequence alignment is an important strategy to analyze DNA and protein sequences. Multiple sequence alignment is an essential methodology to study biological data, such as homology modeling, phylogenetic reconstruction and etc. However, multiple sequence alignment is a NP-hard problem. In the past decades, progressive approach has been proposed to successfully align multiple sequences by adopting iterative pairwise alignments. Due to rapid growth of the next generation sequencing technologies, a large number of sequences can be produced in a short period of time. When the problem instance is large, progressive alignment will be time consuming. Parallel computing is a suitable solution for such applications, and GPU is one of the important architectures for contemporary parallel computing researches. Therefore, we proposed a GPU version of ClustalW v2.0.11, called CUDA ClustalW v1.0, in this work. From the experiment results, it can be seen that the CUDA ClustalW v1.0 can achieve more than 33× speedups for overall execution time by comparing to ClustalW v2.0.11. Copyright © 2015 Elsevier Ltd. All rights reserved.
Nicol, David; Fujimoto, Richard
1992-01-01
This paper surveys topics that presently define the state of the art in parallel simulation. Included in the tutorial are discussions on new protocols, mathematical performance analysis, time parallelism, hardware support for parallel simulation, load balancing algorithms, and dynamic memory management for optimistic synchronization.
Sanhueza, Claudio; Jimenez, Francia; Berretta, Regina; Moscato, Pablo
2017-01-01
Multi-Objective Optimization Problems (MOPs) have attracted growing attention during the last decades. Multi-Objective Evolutionary Algorithms (MOEAs) have been extensively used to address MOPs because are able to approximate a set of non-dominated high-quality solutions. The Multi-Objective Quadratic Assignment Problem (mQAP) is a MOP. The mQAP is a generalization of the classical QAP which has been extensively studied, and used in several real-life applications. The mQAP is defined as havin...
Energy Technology Data Exchange (ETDEWEB)
Mehrotra, Sanjay [Northwestern Univ., Evanston, IL (United States)
2016-09-07
The support from this grant resulted in seven published papers and a technical report. Two papers are published in SIAM J. on Optimization [87, 88]; two papers are published in IEEE Transactions on Power Systems [77, 78]; one paper is published in Smart Grid [79]; one paper is published in Computational Optimization and Applications [44] and one in INFORMS J. on Computing [67]). The works in [44, 67, 87, 88] were funded primarily by this DOE grant. The applied papers in [77, 78, 79] were also supported through a subcontract from the Argonne National Lab. We start by presenting our main research results on the scenario generation problem in Sections 1–2. We present our algorithmic results on interior point methods for convex optimization problems in Section 3. We describe a new ‘central’ cutting surface algorithm developed for solving large scale convex programming problems (as is the case with our proposed research) with semi-infinite number of constraints in Section 4. In Sections 5–6 we present our work on two application problems of interest to DOE.
Maldonado Puente, Bryan Patricio
2014-01-01
The inner detector of the ATLAS experiment has two types of silicon detectors used for tracking: Pixel Detector and SCT (semiconductor tracker). Once a proton-proton collision occurs, the result- ing particles pass through these detectors and these are recorded as hits on the detector surfaces. A medium to high energy particle passes through seven different surfaces of the two detectors, leaving seven hits, while lower energy particles can leave many more hits as they circle through the detector. For a typical event during the expected operational conditions, there are 30 000 hits in average recorded by the sensors. Only high energy particles are of interest for physics analysis and are taken into account for the path reconstruction; thus, a filtering process helps to discard the low energy particles produced in the collision. The following report presents a solution for increasing the speed of the filtering process in the path reconstruction algorithm.
Directory of Open Access Journals (Sweden)
Bailing Liu
2015-01-01
Full Text Available Facility location, inventory control, and vehicle routes scheduling are three key issues to be settled in the design of logistics system for e-commerce. Due to the online shopping features of e-commerce, customer returns are becoming much more than traditional commerce. This paper studies a three-phase supply chain distribution system consisting of one supplier, a set of retailers, and a single type of product with continuous review (Q, r inventory policy. We formulate a stochastic location-inventory-routing problem (LIRP model with no quality defects returns. To solve the NP-hand problem, a pseudo-parallel genetic algorithm integrating simulated annealing (PPGASA is proposed. The computational results show that PPGASA outperforms GA on optimal solution, computing time, and computing stability.
Xie, Guodong; Ren, Yongxiong; Huang, Hao; Lavery, Martin P J; Ahmed, Nisar; Yan, Yan; Bao, Changjing; Li, Long; Zhao, Zhe; Cao, Yinwen; Willner, Moshe; Tur, Moshe; Dolinar, Samuel J; Boyd, Robert W; Shapiro, Jeffrey H; Willner, Alan E
2015-04-01
A stochastic-parallel-gradient-descent algorithm (SPGD) based on Zernike polynomials is proposed to generate the phase correction pattern for a distorted orbital angular momentum (OAM) beam. The Zernike-polynomial coefficients for the correction pattern are obtained by monitoring the intensity profile of the distorted OAM beam through an iteration-based feedback loop. We implement this scheme and experimentally show that the proposed approach improves the quality of the turbulence-distorted OAM beam. Moreover, we apply phase correction patterns derived from a probe OAM beam through emulated turbulence to correct other OAM beams transmitted through the same turbulence. Our experimental results show that the patterns derived this way simultaneously correct multiple OAM beams propagating through the same turbulence, and the crosstalk among these modes is reduced by more than 5 dB.
Directory of Open Access Journals (Sweden)
Rui Zhang
2013-01-01
Full Text Available We consider a parallel machine scheduling problem with random processing/setup times and adjustable production rates. The objective functions to be minimized consist of two parts; the first part is related with the due date performance (i.e., the tardiness of the jobs, while the second part is related with the setting of machine speeds. Therefore, the decision variables include both the production schedule (sequences of jobs and the production rate of each machine. The optimization process, however, is significantly complicated by the stochastic factors in the manufacturing system. To address the difficulty, a simulation-based three-stage optimization framework is presented in this paper for high-quality robust solutions to the integrated scheduling problem. The first stage (crude optimization is featured by the ordinal optimization theory, the second stage (finer optimization is implemented with a metaheuristic called differential evolution, and the third stage (fine-tuning is characterized by a perturbation-based local search. Finally, computational experiments are conducted to verify the effectiveness of the proposed approach. Sensitivity analysis and practical implications are also discussed.
Parallelism in matrix computations
Gallopoulos, Efstratios; Sameh, Ahmed H
2016-01-01
This book is primarily intended as a research monograph that could also be used in graduate courses for the design of parallel algorithms in matrix computations. It assumes general but not extensive knowledge of numerical linear algebra, parallel architectures, and parallel programming paradigms. The book consists of four parts: (I) Basics; (II) Dense and Special Matrix Computations; (III) Sparse Matrix Computations; and (IV) Matrix functions and characteristics. Part I deals with parallel programming paradigms and fundamental kernels, including reordering schemes for sparse matrices. Part II is devoted to dense matrix computations such as parallel algorithms for solving linear systems, linear least squares, the symmetric algebraic eigenvalue problem, and the singular-value decomposition. It also deals with the development of parallel algorithms for special linear systems such as banded ,Vandermonde ,Toeplitz ,and block Toeplitz systems. Part III addresses sparse matrix computations: (a) the development of pa...
Parallel digital forensics infrastructure.
Energy Technology Data Exchange (ETDEWEB)
Liebrock, Lorie M. (New Mexico Tech, Socorro, NM); Duggan, David Patrick
2009-10-01
This report documents the architecture and implementation of a Parallel Digital Forensics infrastructure. This infrastructure is necessary for supporting the design, implementation, and testing of new classes of parallel digital forensics tools. Digital Forensics has become extremely difficult with data sets of one terabyte and larger. The only way to overcome the processing time of these large sets is to identify and develop new parallel algorithms for performing the analysis. To support algorithm research, a flexible base infrastructure is required. A candidate architecture for this base infrastructure was designed, instantiated, and tested by this project, in collaboration with New Mexico Tech. Previous infrastructures were not designed and built specifically for the development and testing of parallel algorithms. With the size of forensics data sets only expected to increase significantly, this type of infrastructure support is necessary for continued research in parallel digital forensics. This report documents the implementation of the parallel digital forensics (PDF) infrastructure architecture and implementation.
Parallel Algorithms for Computer Vision.
1989-01-01
uses a packet-switched along rows or columns of the NEWS grid quickly. message routing scheme to direct mamages along the For example, grid-scani can be...pixel then gets Connection Machine supplies instructions so that many the result of the scan from the processor m in front ofK processors can read from...pixel above high through a only one match along the left or right lines of sight. If chain of pixels above low. All others are eliminated. there am no
Li, Tao; Mallick, Subhashis
2015-02-01
Consideration of azimuthal anisotropy, at least to an orthorhombic symmetry is important in exploring the naturally fractured and unconventional hydrocarbon reservoirs. Full waveform inversion of multicomponent seismic data can, in principle, provide more robust estimates of subsurface elastic parameters and density than the inversion of single component (P wave) seismic data. In addition, azimuthally dependent anisotropy can only be resolved by carefully studying the multicomponent seismic displacement data acquired and processed along different azimuths. Such an analysis needs an inversion algorithm capable of simultaneously optimizing multiple objectives, one for each data component along each azimuth. These multicomponent and multi-azimuthal seismic inversions are non-linear with non-unique solutions; it is therefore appropriate to treat the objectives as a vector and simultaneously optimize each of its components such that the optimal set of solutions could be obtained. The fast non-dominated sorting genetic algorithm (NSGA II) is a robust stochastic global search method capable of handling multiple objectives, but its computational expense increases with increasing number of objectives and the number of model parameters to be inverted for. In addition, an accurate extraction of subsurface azimuthal anisotropy requires multicomponent seismic data acquired at a fine spatial resolution along many source-to-receiver azimuths. Because routine acquisition of such data is prohibitively expensive, they are typically available along two or at most three azimuthal orientations at a spatial resolution where such an inversion could be applied. This paper proposes a novel multi-objective methodology using a parallelized version of NSGA II for waveform inversion of multicomponent seismic displacement data along two azimuths. By scaling the objectives prior to ranking, redefining the crowding distance as functions of the scaled objective and the model spaces, and varying
Energy Technology Data Exchange (ETDEWEB)
1994-02-02
This report consists of three separate but related reports. They are (1) Human Resource Development, (2) Carbon-based Structural Materials Research Cluster, and (3) Data Parallel Algorithms for Scientific Computing. To meet the objectives of the Human Resource Development plan, the plan includes K--12 enrichment activities, undergraduate research opportunities for students at the state`s two Historically Black Colleges and Universities, graduate research through cluster assistantships and through a traineeship program targeted specifically to minorities, women and the disabled, and faculty development through participation in research clusters. One research cluster is the chemistry and physics of carbon-based materials. The objective of this cluster is to develop a self-sustaining group of researchers in carbon-based materials research within the institutions of higher education in the state of West Virginia. The projects will involve analysis of cokes, graphites and other carbons in order to understand the properties that provide desirable structural characteristics including resistance to oxidation, levels of anisotropy and structural characteristics of the carbons themselves. In the proposed cluster on parallel algorithms, research by four WVU faculty and three state liberal arts college faculty are: (1) modeling of self-organized critical systems by cellular automata; (2) multiprefix algorithms and fat-free embeddings; (3) offline and online partitioning of data computation; and (4) manipulating and rendering three dimensional objects. This cluster furthers the state Experimental Program to Stimulate Competitive Research plan by building on existing strengths at WVU in parallel algorithms.
Díaz-Mojica, John; Cruz-Atienza, Víctor M.; Madariaga, Raúl; Singh, Shri K.; Tago, Josué; Iglesias, Arturo
2014-10-01
We introduce a method for imaging the earthquake source dynamics from the inversion of ground motion records based on a parallel genetic algorithm. The source model follows an elliptical patch approach and uses the staggered-grid split-node method to simulate the earthquake dynamics. A statistical analysis is used to estimate errors in both inverted and derived source parameters. Synthetic inversion tests reveal that the average rupture speed (Vr), the rupture area, and the stress drop (Δτ) may be determined with formal errors of ~30%, ~12%, and ~10%, respectively. In contrast, derived parameters such as the radiated energy (Er), the radiation efficiency (ηr), and the fracture energy (G) have larger errors, around ~70%, ~40%, and ~25%, respectively. We applied the method to the Mw 6.5 intermediate-depth (62 km) normal-faulting earthquake of 11 December 2011 in Guerrero, Mexico. Inferred values of Δτ = 29.2 ± 6.2 MPa and ηr = 0.26 ± 0.1 are significantly higher and lower, respectively, than those of typical subduction thrust events. Fracture energy is large so that more than 73% of the available potential energy for the dynamic process of faulting was deposited in the focal region (i.e., G = (14.4 ± 3.5) × 1014J), producing a slow rupture process (Vr/VS = 0.47 ± 0.09) despite the relatively high energy radiation (Er = (0.54 ± 0.31) × 1015 J) and energy-moment ratio (Er/M0 = 5.7 × 10- 5). It is interesting to point out that such a slow and inefficient rupture along with the large stress drop in a small focal region are features also observed in both the 1994 deep Bolivian earthquake and the seismicity of the intermediate-depth Bucaramanga nest.
Massively parallel mathematical sieves
Energy Technology Data Exchange (ETDEWEB)
Montry, G.R.
1989-01-01
The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.
The STAPL Parallel Graph Library
Harshvardhan,
2013-01-01
This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable distributed graph container and a collection of commonly used parallel graph algorithms. The library introduces pGraph pViews that separate algorithm design from the container implementation. It supports three graph processing algorithmic paradigms, level-synchronous, asynchronous and coarse-grained, and provides common graph algorithms based on them. Experimental results demonstrate improved scalability in performance and data size over existing graph libraries on more than 16,000 cores and on internet-scale graphs containing over 16 billion vertices and 250 billion edges. © Springer-Verlag Berlin Heidelberg 2013.
Energy Technology Data Exchange (ETDEWEB)
1991-10-23
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.
Energy Technology Data Exchange (ETDEWEB)
Fijany, A. [Jet Propulsion Lab., Pasadena, CA (United States); Coley, T.R. [Virtual Chemistry, Inc., San Diego, CA (United States); Cagin, T.; Goddard, W.A. III [California Institute of Technology, Pasadena, CA (United States)
1997-12-31
Successful molecular dynamics (MD) simulation of large systems (> million atoms) for long times (> nanoseconds) requires the integration of constrained equations of motion (CEOM). Constraints are used to eliminate high frequency degrees of freedom (DOF) and to allow the use of rigid bodies. Solving the CEOM allows for larger integration time-steps and helps focus the simulation on the important collective dynamics of chemical, biological, and materials systems. We explore advances in multibody dynamics which have resulted in O(N) algorithms for propagating the CEOM. However, because of their strictly sequential nature, the computational time required by these algorithms does not scale down with increased numbers of processors. We then present the new constraint force algorithm for solving the CEOM and show that this algorithm is fully parallelizable, leading to a computational cost of O(N/P+IogP) for N DOF on P processors.
Compilation Techniques for Embedded Data Parallel Languages
Catanzaro, Bryan Christopher
2011-01-01
Contemporary parallel microprocessors exploit Chip Multiprocessing along with Single Instruction, Multiple Data parallelism to deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level efficiency languages such as C++ is often a difficult task, since the programmer is burdened with mapping data parallelism from an application onto the ha...
Parallel computing: numerics, applications, and trends
National Research Council Canada - National Science Library
Trobec, Roman; Vajteršic, Marián; Zinterhof, Peter
2009-01-01
... and/or distributed systems. The contributions to this book are focused on topics most concerned in the trends of today's parallel computing. These range from parallel algorithmics, programming, tools, network computing to future parallel computing. Particular attention is paid to parallel numerics: linear algebra, differential equations, numerica...
Parallel hierarchical radiosity rendering
Energy Technology Data Exchange (ETDEWEB)
Carter, Michael [Iowa State Univ., Ames, IA (United States)
1993-07-01
In this dissertation, the step-by-step development of a scalable parallel hierarchical radiosity renderer is documented. First, a new look is taken at the traditional radiosity equation, and a new form is presented in which the matrix of linear system coefficients is transformed into a symmetric matrix, thereby simplifying the problem and enabling a new solution technique to be applied. Next, the state-of-the-art hierarchical radiosity methods are examined for their suitability to parallel implementation, and scalability. Significant enhancements are also discovered which both improve their theoretical foundations and improve the images they generate. The resultant hierarchical radiosity algorithm is then examined for sources of parallelism, and for an architectural mapping. Several architectural mappings are discussed. A few key algorithmic changes are suggested during the process of making the algorithm parallel. Next, the performance, efficiency, and scalability of the algorithm are analyzed. The dissertation closes with a discussion of several ideas which have the potential to further enhance the hierarchical radiosity method, or provide an entirely new forum for the application of hierarchical methods.
Massively Parallel Finite Element Programming
Heister, Timo
2010-01-01
Today\\'s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.
van der Vegt, Steven; Laarman, Alfons; Vojnar, Tomas
2011-01-01
We present the first parallel compact hash table algorithm. It delivers high performance and scalability due to its dynamic region-based locking scheme with only a fraction of the memory requirements of a regular hash table.
Ultrascalable petaflop parallel supercomputer
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Chiu, George [Cross River, NY; Cipolla, Thomas M [Katonah, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Hall, Shawn [Pleasantville, NY; Haring, Rudolf A [Cortlandt Manor, NY; Heidelberger, Philip [Cortlandt Manor, NY; Kopcsay, Gerard V [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Salapura, Valentina [Chappaqua, NY; Sugavanam, Krishnan [Mahopac, NY; Takken, Todd [Brewster, NY
2010-07-20
A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.
Energy Technology Data Exchange (ETDEWEB)
Little, J.J.; Poggio, T.; Gamble, E.B. Jr.
1988-01-01
Computer algorithms have been developed for early vision processes that give separate cues to the distance from the viewer of three-dimensional surfaces, their shape, and their material properties. The MIT Vision Machine is a computer system that integrates several early vision modules to achieve high-performance recognition and navigation in unstructured environments. It is also an experimental environment for theoretical progress in early vision algorithms, their parallel implementation, and their integration. The Vision Machine consists of a movable, two-camera Eye-Head input device and an 8K Connection Machine. The authors have developed and implemented several parallel early vision algorithms that compute edge detection, stereopsis, motion, texture, and surface color in close to real time. The integration stage, based on coupled Markov random field models, leads to a cartoon-like map of the discontinuities in the scene, with partial labeling of the brightness edges in terms of their physical origin.
Fuzzy Clustering in Parallel Universes
Wiswedel, Bernd; Berthold, Michael R.
2005-01-01
We propose a modified fuzzy c-means algorithm that operates on different feature spaces, so-called parallel universes, simultaneously. The method assigns membership values of patterns to different universes, which are then adopted throughout the training. This leads to better clustering results since patterns not contributing to clustering in a universe are (completely or partially) ignored. The outcome of the algorithm are clusters distributed over different parallel universes, each modeling...
Directory of Open Access Journals (Sweden)
P. Hanappe
2011-09-01
Full Text Available We have optimised the atmospheric radiation algorithm of the FAMOUS climate model on several hardware platforms. The optimisation involved translating the Fortran code to C and restructuring the algorithm around the computation of a single air column. Instead of the existing MPI-based domain decomposition, we used a task queue and a thread pool to schedule the computation of individual columns on the available processors. Finally, four air columns are packed together in a single data structure and computed simultaneously using Single Instruction Multiple Data operations.
The modified algorithm runs more than 50 times faster on the CELL's Synergistic Processing Element than on its main PowerPC processing element. On Intel-compatible processors, the new radiation code runs 4 times faster. On the tested graphics processor, using OpenCL, we find a speed-up of more than 2.5 times as compared to the original code on the main CPU. Because the radiation code takes more than 60 % of the total CPU time, FAMOUS executes more than twice as fast. Our version of the algorithm returns bit-wise identical results, which demonstrates the robustness of our approach. We estimate that this project required around two and a half man-years of work.
Hanappe, P.; Beurivé, A.; Laguzet, F.; Steels, L.; Bellouin, N.; Boucher, O.; Yamazaki, Y. H.; Aina, T.; Allen, M.
2011-09-01
We have optimised the atmospheric radiation algorithm of the FAMOUS climate model on several hardware platforms. The optimisation involved translating the Fortran code to C and restructuring the algorithm around the computation of a single air column. Instead of the existing MPI-based domain decomposition, we used a task queue and a thread pool to schedule the computation of individual columns on the available processors. Finally, four air columns are packed together in a single data structure and computed simultaneously using Single Instruction Multiple Data operations. The modified algorithm runs more than 50 times faster on the CELL's Synergistic Processing Element than on its main PowerPC processing element. On Intel-compatible processors, the new radiation code runs 4 times faster. On the tested graphics processor, using OpenCL, we find a speed-up of more than 2.5 times as compared to the original code on the main CPU. Because the radiation code takes more than 60 % of the total CPU time, FAMOUS executes more than twice as fast. Our version of the algorithm returns bit-wise identical results, which demonstrates the robustness of our approach. We estimate that this project required around two and a half man-years of work.
Parallel QR Decomposition for Electromagnetic Scattering Problems
National Research Council Canada - National Science Library
Boleng, Jeff
1997-01-01
This report introduces a new parallel QR decomposition algorithm. Test results are presented for several problem sizes, numbers of processors, and data from the electromagnetic scattering problem domain...
Evaluating parallel optimization on transputers
Directory of Open Access Journals (Sweden)
A.G. Chalmers
2003-12-01
Full Text Available The faster processing power of modern computers and the development of efficient algorithms have made it possible for operations researchers to tackle a much wider range of problems than ever before. Further improvements in processing speed can be achieved utilising relatively inexpensive transputers to process components of an algorithm in parallel. The Davidon-Fletcher-Powell method is one of the most successful and widely used optimisation algorithms for unconstrained problems. This paper examines the algorithm and identifies the components that can be processed in parallel. The results of some experiments with these components are presented which indicates under what conditions parallel processing with an inexpensive configuration is likely to be faster than the traditional sequential implementations. The performance of the whole algorithm with its parallel components is then compared with the original sequential algorithm. The implementation serves to illustrate the practicalities of speeding up typical OR algorithms in terms of difficulty, effort and cost. The results give an indication of the savings in time a given parallel implementation can be expected to yield.
Xu, Zuwei; Zhao, Haibo; Zheng, Chuguang
2015-01-01
This paper proposes a comprehensive framework for accelerating population balance-Monte Carlo (PBMC) simulation of particle coagulation dynamics. By combining Markov jump model, weighted majorant kernel and GPU (graphics processing unit) parallel computing, a significant gain in computational efficiency is achieved. The Markov jump model constructs a coagulation-rule matrix of differentially-weighted simulation particles, so as to capture the time evolution of particle size distribution with low statistical noise over the full size range and as far as possible to reduce the number of time loopings. Here three coagulation rules are highlighted and it is found that constructing appropriate coagulation rule provides a route to attain the compromise between accuracy and cost of PBMC methods. Further, in order to avoid double looping over all simulation particles when considering the two-particle events (typically, particle coagulation), the weighted majorant kernel is introduced to estimate the maximum coagulation rates being used for acceptance-rejection processes by single-looping over all particles, and meanwhile the mean time-step of coagulation event is estimated by summing the coagulation kernels of rejected and accepted particle pairs. The computational load of these fast differentially-weighted PBMC simulations (based on the Markov jump model) is reduced greatly to be proportional to the number of simulation particles in a zero-dimensional system (single cell). Finally, for a spatially inhomogeneous multi-dimensional (multi-cell) simulation, the proposed fast PBMC is performed in each cell, and multiple cells are parallel processed by multi-cores on a GPU that can implement the massively threaded data-parallel tasks to obtain remarkable speedup ratio (comparing with CPU computation, the speedup ratio of GPU parallel computing is as high as 200 in a case of 100 cells with 10 000 simulation particles per cell). These accelerating approaches of PBMC are
Energy Technology Data Exchange (ETDEWEB)
2017-04-04
A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique. We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.
DEFF Research Database (Denmark)
The following topics are dealt with: parallel scientific computing; numerical algorithms; parallel nonnumerical algorithms; cloud computing; evolutionary computing; metaheuristics; applied mathematics; GPU computing; multicore systems; hybrid architectures; hierarchical parallelism; HPC systems...
DEFF Research Database (Denmark)
The following topics are dealt with: parallel scientific computing; numerical algorithms; parallel nonnumerical algorithms; cloud computing; evolutionary computing; metaheuristics; applied mathematics; GPU computing; multicore systems; hybrid architectures; hierarchical parallelism; HPC systems; ...
DEFF Research Database (Denmark)
The following topics are dealt with: parallel scientific computing; numerical algorithms; parallel nonnumerical algorithms; cloud computing; evolutionary computing; metaheuristics; applied mathematics; GPU computing; multicore systems; hybrid architectures; hierarchical parallelism; HPC systems......; power monitoring; energy monitoring; and distributed computing....
Parallel thermal radiation transport in two dimensions
Energy Technology Data Exchange (ETDEWEB)
Smedley-Stevenson, R.P.; Ball, S.R. [AWE Aldermaston (United Kingdom)
2003-07-01
This paper describes the distributed memory parallel implementation of a deterministic thermal radiation transport algorithm in a 2-dimensional ALE hydrodynamics code. The parallel algorithm consists of a variety of components which are combined in order to produce a state of the art computational capability, capable of solving large thermal radiation transport problems using Blue-Oak, the 3 Tera-Flop MPP (massive parallel processors) computing facility at AWE (United Kingdom). Particular aspects of the parallel algorithm are described together with examples of the performance on some challenging applications. (author)
Parallel hierarchical global illumination
Energy Technology Data Exchange (ETDEWEB)
Snell, Quinn O. [Iowa State Univ., Ames, IA (United States)
1997-10-08
Solving the global illumination problem is equivalent to determining the intensity of every wavelength of light in all directions at every point in a given scene. The complexity of the problem has led researchers to use approximation methods for solving the problem on serial computers. Rather than using an approximation method, such as backward ray tracing or radiosity, the authors have chosen to solve the Rendering Equation by direct simulation of light transport from the light sources. This paper presents an algorithm that solves the Rendering Equation to any desired accuracy, and can be run in parallel on distributed memory or shared memory computer systems with excellent scaling properties. It appears superior in both speed and physical correctness to recent published methods involving bidirectional ray tracing or hybrid treatments of diffuse and specular surfaces. Like progressive radiosity methods, it dynamically refines the geometry decomposition where required, but does so without the excessive storage requirements for ray histories. The algorithm, called Photon, produces a scene which converges to the global illumination solution. This amounts to a huge task for a 1997-vintage serial computer, but using the power of a parallel supercomputer significantly reduces the time required to generate a solution. Currently, Photon can be run on most parallel environments from a shared memory multiprocessor to a parallel supercomputer, as well as on clusters of heterogeneous workstations.
Parallel processing from applications to systems
Moldovan, Dan I
1993-01-01
This text provides one of the broadest presentations of parallelprocessing available, including the structure of parallelprocessors and parallel algorithms. The emphasis is on mappingalgorithms to highly parallel computers, with extensive coverage ofarray and multiprocessor architectures. Early chapters provideinsightful coverage on the analysis of parallel algorithms andprogram transformations, effectively integrating a variety ofmaterial previously scattered throughout the literature. Theory andpractice are well balanced across diverse topics in this concisepresentation. For exceptional cla
Parallel integer sorting with medium and fine-scale parallelism
Dagum, Leonardo
1993-01-01
Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
Parallel and distributed Gr\\"obner bases computation in JAS
Kredel, Heinz
2010-01-01
This paper considers parallel Gr\\"obner bases algorithms on distributed memory parallel computers with multi-core compute nodes. We summarize three different Gr\\"obner bases implementations: shared memory parallel, pure distributed memory parallel and distributed memory combined with shared memory parallelism. The last algorithm, called distributed hybrid, uses only one control communication channel between the master node and the worker nodes and keeps polynomials in shared memory on a node....
Parallel machine architecture and compiler design facilities
Kuck, David J.; Yew, Pen-Chung; Padua, David; Sameh, Ahmed; Veidenbaum, Alex
1990-01-01
The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role.
On parameter synthesis by parallel model checking.
Barnat, Jirí; Brim, Lubos; Krejcí, Adam; Streck, Adam; Safránek, David; Vejnár, Martin; Vejpustek, Tomás
2012-01-01
An important problem in current computational systems biology is to analyze models of biological systems dynamics under parameter uncertainty. This paper presents a novel algorithm for parameter synthesis based on parallel model checking. The algorithm is conceptually universal with respect to the modeling approach employed. We introduce the algorithm, show its scalability, and examine its applicability on several biological models.
Algorithms and Algorithmic Languages.
Veselov, V. M.; Koprov, V. M.
This paper is intended as an introduction to a number of problems connected with the description of algorithms and algorithmic languages, particularly the syntaxes and semantics of algorithmic languages. The terms "letter, word, alphabet" are defined and described. The concept of the algorithm is defined and the relation between the algorithm and…
A parallel approach to the stable marriage problem
DEFF Research Database (Denmark)
Larsen, Jesper
1997-01-01
This paper describes two parallel algorithms for the stable marriage problem implemented on a MIMD parallel computer. The algorithms are tested against sequential algorithms on randomly generated and worst-case instances. The results clearly show that the combination fo a very simple problem and ...... and a commercial MIMD system results in parallel algorithms which are not competitive with sequential algorithms wrt. practical performance. 1 Introduction In 1962 the Stable Marriage Problem was.......This paper describes two parallel algorithms for the stable marriage problem implemented on a MIMD parallel computer. The algorithms are tested against sequential algorithms on randomly generated and worst-case instances. The results clearly show that the combination fo a very simple problem...
GPU-Accelerated Apriori Algorithm
Directory of Open Access Journals (Sweden)
Jiang Hao
2017-01-01
Full Text Available This paper propose a parallel Apriori algorithm based on GPU (GPUApriori for frequent itemsets mining, and designs a storage structure using bit table (BIT matrix to replace the traditional storage mode. In addition, parallel computing scheme on GPU is discussed. The experimental results show that GPUApriori algorithm can effectively improve the efficiency of frequent itemsets mining.
McCallum, Ethan
2011-01-01
It's tough to argue with R as a high-quality, cross-platform, open source statistical software product-unless you're in the business of crunching Big Data. This concise book introduces you to several strategies for using R to analyze large datasets. You'll learn the basics of Snow, Multicore, Parallel, and some Hadoop-related tools, including how to find them, how to use them, when they work well, and when they don't. With these packages, you can overcome R's single-threaded nature by spreading work across multiple CPUs, or offloading work to multiple machines to address R's memory barrier.
Graphical Representation of Parallel Algorithmic Processes
1990-12-01
other research. It accepts performance data from PICL (described below) and displays it in many ways. It provides node activity and node CPU ...translation involves performing byte-order swaps; this is required because the Intel 80386 processor stores numbers low byte first, while the Sun SPARC...basic information regarding message passing between processes, overall communications load, communica- tions statistics for each node, and CPU
A Topological Model for Parallel Algorithm Design
1991-09-01
New York, 1984. ACM Press. Hfeld at Salt Lake City, Utah- on January 15-18, 1984. 132. G. Frege . Thc Basic Laws of Arithmetic. University of California...IL In .1. van Ileijenoort, editor, From Frege to Godel, pages 199-215. Harvard University Press, Cambridge, MA, 1967. This article is the English
Parallelization of game theoretic centrality algorithms
Indian Academy of Sciences (India)
application such as viral marketing where the initial set of influencers determine the success. (Bass 1969; Brown & Reingen 1987; Domingos & Richardson 2001; Richardson & Domingos. 2002), identification of critical nodes in a power systems where the failure of a critical node may cause a cascading failure leading to ...
Parallelization of game theoretic centrality algorithms
Indian Academy of Sciences (India)
Communication has become a lot easier with the advent of easy and cheap means of reaching people across the globe. This has allowed the development of large networked communities and, with the technology available to track them, has opened up the study of social networks at unprecedented scales. This has ...
Binary Trees and Parallel Scheduling Algorithms.
1980-09-01
child of N) and t (N)- t (right child of N) . For bur example, the binary comutation tree together with time intervals is shown in Figure 2.2. A...Operationafle, 10.5, Supp.7-33, 1976. 22. Lenstra, J. K., "Sequencing by enumerative methods," Mathematical Centre Tract 69, Mathematisch Centrum, Amsterdam...Theory of Scheduling and its applications, Lecture Notes -Tn Economics and Mathematical Systems, 86 -34- Springer, Berlin, pp. 393-398, i973. 35. Smith, W
DEFF Research Database (Denmark)
Bilardi, Gianfranco; Pietracaprina, Andrea; Pucci, Geppino
2016-01-01
A framework is proposed for the design and analysis of network-oblivious algorithms, namely algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelism and communication capabilities. The framework prescribes that a network...... in the latter model implies optimality in the decomposable bulk synchronous parallel model, which is known to effectively describe a wide and significant class of parallel platforms. The proposed framework can be regarded as an attempt to port the notion of obliviousness, well established in the context...
Automatic Parallelization Tool: Classification of Program Code for Parallel Computing
Directory of Open Access Journals (Sweden)
Mustafa Basthikodi
2016-04-01
Full Text Available Performance growth of single-core processors has come to a halt in the past decade, but was re-enabled by the introduction of parallelism in processors. Multicore frameworks along with Graphical Processing Units empowered to enhance parallelism broadly. Couples of compilers are updated to developing challenges forsynchronization and threading issues. Appropriate program and algorithm classifications will have advantage to a great extent to the group of software engineers to get opportunities for effective parallelization. In present work we investigated current species for classification of algorithms, in that related work on classification is discussed along with the comparison of issues that challenges the classification. The set of algorithms are chosen which matches the structure with different issues and perform given task. We have tested these algorithms utilizing existing automatic species extraction toolsalong with Bones compiler. We have added functionalities to existing tool, providing a more detailed characterization. The contributions of our work include support for pointer arithmetic, conditional and incremental statements, user defined types, constants and mathematical functions. With this, we can retain significant data which is not captured by original speciesof algorithms. We executed new theories into the device, empowering automatic characterization of program code.
Massively parallel quantum computer simulator
De Raedt, K.; Michielsen, K.; De Raedt, H.; Trieu, B.; Arnold, G.; Richter, M.; Lippert, Th.; Watanabe, H.; Ito, N.
2007-01-01
We describe portable software to simulate universal quantum computers on massive parallel Computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray
Parallel Computational Protein Design.
Zhou, Yichao; Donald, Bruce R; Zeng, Jianyang
2017-01-01
Computational structure-based protein design (CSPD) is an important problem in computational biology, which aims to design or improve a prescribed protein function based on a protein structure template. It provides a practical tool for real-world protein engineering applications. A popular CSPD method that guarantees to find the global minimum energy solution (GMEC) is to combine both dead-end elimination (DEE) and A* tree search algorithms. However, in this framework, the A* search algorithm can run in exponential time in the worst case, which may become the computation bottleneck of large-scale computational protein design process. To address this issue, we extend and add a new module to the OSPREY program that was previously developed in the Donald lab (Gainza et al., Methods Enzymol 523:87, 2013) to implement a GPU-based massively parallel A* algorithm for improving protein design pipeline. By exploiting the modern GPU computational framework and optimizing the computation of the heuristic function for A* search, our new program, called gOSPREY, can provide up to four orders of magnitude speedups in large protein design cases with a small memory overhead comparing to the traditional A* search algorithm implementation, while still guaranteeing the optimality. In addition, gOSPREY can be configured to run in a bounded-memory mode to tackle the problems in which the conformation space is too large and the global optimal solution cannot be computed previously. Furthermore, the GPU-based A* algorithm implemented in the gOSPREY program can be combined with the state-of-the-art rotamer pruning algorithms such as iMinDEE (Gainza et al., PLoS Comput Biol 8:e1002335, 2012) and DEEPer (Hallen et al., Proteins 81:18-39, 2013) to also consider continuous backbone and side-chain flexibility.
Writing parallel programs that work
CERN. Geneva
2012-01-01
Serial algorithms typically run inefficiently on parallel machines. This may sound like an obvious statement, but it is the root cause of why parallel programming is considered to be difficult. The current state of the computer industry is still that almost all programs in existence are serial. This talk will describe the techniques used in the Intel Parallel Studio to provide a developer with the tools necessary to understand the behaviors and limitations of the existing serial programs. Once the limitations are known the developer can refactor the algorithms and reanalyze the resulting programs with the tools in the Intel Parallel Studio to create parallel programs that work. About the speaker Paul Petersen is a Sr. Principal Engineer in the Software and Solutions Group (SSG) at Intel. He received a Ph.D. degree in Computer Science from the University of Illinois in 1993. After UIUC, he was employed at Kuck and Associates, Inc. (KAI) working on auto-parallelizing compiler (KAP), and was involved in th...
Template based parallel checkpointing in a massively parallel computer system
Archer, Charles Jens [Rochester, MN; Inglett, Todd Alan [Rochester, MN
2009-01-13
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
Directory of Open Access Journals (Sweden)
James G. Worner
2017-05-01
Full Text Available James Worner is an Australian-based writer and scholar currently pursuing a PhD at the University of Technology Sydney. His research seeks to expose masculinities lost in the shadow of Australia’s Anzac hegemony while exploring new opportunities for contemporary historiography. He is the recipient of the Doctoral Scholarship in Historical Consciousness at the university’s Australian Centre of Public History and will be hosted by the University of Bologna during 2017 on a doctoral research writing scholarship. ‘Parallel Lines’ is one of a collection of stories, The Shapes of Us, exploring liminal spaces of modern life: class, gender, sexuality, race, religion and education. It looks at lives, like lines, that do not meet but which travel in proximity, simultaneously attracted and repelled. James’ short stories have been published in various journals and anthologies.
Fast parallel event reconstruction
CERN. Geneva
2010-01-01
On-line processing of large data volumes produced in modern HEP experiments requires using maximum capabilities of modern and future many-core CPU and GPU architectures.One of such powerful feature is a SIMD instruction set, which allows packing several data items in one register and to operate on all of them, thus achievingmore operations per clock cycle. Motivated by the idea of using the SIMD unit ofmodern processors, the KF based track fit has been adapted for parallelism, including memory optimization, numerical analysis, vectorization with inline operator overloading, and optimization using SDKs. The speed of the algorithm has been increased in 120000 times with 0.1 ms/track, running in parallel on 16 SPEs of a Cell Blade computer. Running on a Nehalem CPU with 8 cores it shows the processing speed of 52 ns/track using the Intel Threading Building Blocks. The same KF algorithm running on an Nvidia GTX 280 in the CUDA frameworkprovi...
The Modeling of the ERP Systems within Parallel Calculus
Directory of Open Access Journals (Sweden)
Loredana MOCEAN
2011-01-01
Full Text Available As we know from a few years, the basic characteristics of ERP systems are: modular-design, central common database, integration of the modules, data transfer between modules done automatically, complex systems and flexible configuration. Because this, is obviously a parallel approach to design and implement them within parallel algorithms, parallel calculus and distributed databases. This paper aims to support these assertions and provide a model, in summary, what could be an ERP system based on parallel computing and algorithms.
Design considerations for parallel graphics libraries
Crockett, Thomas W.
1994-01-01
Applications which run on parallel supercomputers are often characterized by massive datasets. Converting these vast collections of numbers to visual form has proven to be a powerful aid to comprehension. For a variety of reasons, it may be desirable to provide this visual feedback at runtime. One way to accomplish this is to exploit the available parallelism to perform graphics operations in place. In order to do this, we need appropriate parallel rendering algorithms and library interfaces. This paper provides a tutorial introduction to some of the issues which arise in designing parallel graphics libraries and their underlying rendering algorithms. The focus is on polygon rendering for distributed memory message-passing systems. We illustrate our discussion with examples from PGL, a parallel graphics library which has been developed on the Intel family of parallel systems.
Contact-impact simulations on massively parallel SIMD supercomputers
Energy Technology Data Exchange (ETDEWEB)
Plaskacz, E. J. [Argonne National Lab., IL (United States); Belytscko, T.; Chiang, H. Y. [Northwestern Univ., Evanston, IL (United States)
1992-01-01
The implementation of explicit finite element methods with contact-impact on massively parallel SIMD computers is described. The basic parallel finite element algorithm employs an exchange process which minimizes interprocessor communication at the expense of redundant computations and storage. The contact-impact algorithm is based on the pinball method in which compatibility is enforced by preventing interpenetration on spheres embedded in elements adjacent to surfaces. The enhancements to the pinball algorithm include a parallel assembled surface normal algorithm and a parallel detection of interpenetrating pairs. Some timings with and without contact-impact are given.
Parallel approach in RDF query processing
Vajgl, Marek; Parenica, Jan
2017-07-01
Parallel approach is nowadays a very cheap solution to increase computational power due to possibility of usage of multithreaded computational units. This hardware became typical part of nowadays personal computers or notebooks and is widely spread. This contribution deals with experiments how evaluation of computational complex algorithm of the inference over RDF data can be parallelized over graphical cards to decrease computational time.
Pattern-Driven Automatic Parallelization
Directory of Open Access Journals (Sweden)
Christoph W. Kessler
1996-01-01
Full Text Available This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.
Cellular automata a parallel model
Mazoyer, J
1999-01-01
Cellular automata can be viewed both as computational models and modelling systems of real processes. This volume emphasises the first aspect. In articles written by leading researchers, sophisticated massive parallel algorithms (firing squad, life, Fischer's primes recognition) are treated. Their computational power and the specific complexity classes they determine are surveyed, while some recent results in relation to chaos from a new dynamic systems point of view are also presented. Audience: This book will be of interest to specialists of theoretical computer science and the parallelism challenge.
Experiments with the auction algorithm for the shortest path problem
DEFF Research Database (Denmark)
Larsen, Jesper; Pedersen, Ib
1999-01-01
The auction approach for the shortest path problem (SPP) as introduced by Bertsekas is tested experimentally. Parallel algorithms using the auction approach are developed and tested. Both the sequential and parallel auction algorithms perform significantly worse than a state-of-the-art Dijkstra-l......-like reference algorithm. Experiments are run on a distributed-memory MIMD class Meiko parallel computer....
Unitary Quantum Lattice Algorithms for Turbulence
2016-05-23
and is ideally parallelized. The algorithm is benchmarked against exact one dimensional vector inelastic soliton collision solutions. Three...release. 10 Parallelization of QLG algorithms The beauty of the QLG algorithm will not only run on a quantum computer when they become...timings : in strong scaling one fixes the grid and increases the number of cores. For ideal parallelization, the wallclock time will decrease by a
Advances in randomized parallel computing
Rajasekaran, Sanguthevar
1999-01-01
The technique of randomization has been employed to solve numerous prob lems of computing both sequentially and in parallel. Examples of randomized algorithms that are asymptotically better than their deterministic counterparts in solving various fundamental problems abound. Randomized algorithms have the advantages of simplicity and better performance both in theory and often in practice. This book is a collection of articles written by renowned experts in the area of randomized parallel computing. A brief introduction to randomized algorithms In the aflalysis of algorithms, at least three different measures of performance can be used: the best case, the worst case, and the average case. Often, the average case run time of an algorithm is much smaller than the worst case. 2 For instance, the worst case run time of Hoare's quicksort is O(n ), whereas its average case run time is only O( n log n). The average case analysis is conducted with an assumption on the input space. The assumption made to arrive at t...
Parallel Algebraic Multigrid Methods - High Performance Preconditioners
Energy Technology Data Exchange (ETDEWEB)
Yang, U M
2004-11-11
The development of high performance, massively parallel computers and the increasing demands of computationally challenging applications have necessitated the development of scalable solvers and preconditioners. One of the most effective ways to achieve scalability is the use of multigrid or multilevel techniques. Algebraic multigrid (AMG) is a very efficient algorithm for solving large problems on unstructured grids. While much of it can be parallelized in a straightforward way, some components of the classical algorithm, particularly the coarsening process and some of the most efficient smoothers, are highly sequential, and require new parallel approaches. This chapter presents the basic principles of AMG and gives an overview of various parallel implementations of AMG, including descriptions of parallel coarsening schemes and smoothers, some numerical results as well as references to existing software packages.
Productive Parallel Programming: The PCN Approach
Directory of Open Access Journals (Sweden)
Ian Foster
1992-01-01
Full Text Available We describe the PCN programming system, focusing on those features designed to improve the productivity of scientists and engineers using parallel supercomputers. These features include a simple notation for the concise specification of concurrent algorithms, the ability to incorporate existing Fortran and C code into parallel applications, facilities for reusing parallel program components, a portable toolkit that allows applications to be developed on a workstation or small parallel computer and run unchanged on supercomputers, and integrated debugging and performance analysis tools. We survey representative scientific applications and identify problem classes for which PCN has proved particularly useful.
Parallel auto-correlative statistics with VTK.
Energy Technology Data Exchange (ETDEWEB)
Pebay, Philippe Pierre; Bennett, Janine Camille
2013-08-01
This report summarizes existing statistical engines in VTK and presents both the serial and parallel auto-correlative statistics engines. It is a sequel to [PT08, BPRT09b, PT09, BPT09, PT10] which studied the parallel descriptive, correlative, multi-correlative, principal component analysis, contingency, k-means, and order statistics engines. The ease of use of the new parallel auto-correlative statistics engine is illustrated by the means of C++ code snippets and algorithm verification is provided. This report justifies the design of the statistics engines with parallel scalability in mind, and provides scalability and speed-up analysis results for the autocorrelative statistics engine.
Structured Parallel Programming Patterns for Efficient Computation
McCool, Michael; Robison, Arch
2012-01-01
Programming is now parallel programming. Much as structured programming revolutionized traditional serial programming decades ago, a new kind of structured programming, based on patterns, is relevant to parallel programming today. Parallel computing experts and industry insiders Michael McCool, Arch Robison, and James Reinders describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give detailed concrete examples using multiple programming models. Examples are primarily given using two of th
Foundations of genetic algorithms 1991
1991-01-01
Foundations of Genetic Algorithms 1991 (FOGA 1) discusses the theoretical foundations of genetic algorithms (GA) and classifier systems.This book compiles research papers on selection and convergence, coding and representation, problem hardness, deception, classifier system design, variation and recombination, parallelization, and population divergence. Other topics include the non-uniform Walsh-schema transform; spurious correlations and premature convergence in genetic algorithms; and variable default hierarchy separation in a classifier system. The grammar-based genetic algorithm; condition
Fuzzy clustering in parallel universes
Wiswedel, Bernd; Berthold, Michael R.
2007-01-01
We present an extension of the fuzzy c-Means algorithm, which operates simultaneously on different feature spaces so-called parallel universes and also incorporates noise detection. The method assigns membership values of patterns to different universes, which are then adopted throughout the training. This leads to better clustering results since patterns not contributing to clustering in a universe are (completely or partially) ignored. The method also uses an auxiliary universe to capt...
Northeast Parallel Architectures Center (NPAC)
1992-07-01
networks using parallel algorithms for detection of lineaments in remotely sensed LandSat date of the Canadian Shield and for detection of abnormalities in...University Ashok K. Joshi Ashok K. Joshi, Syracuse University 211 Link Hall Syracuse NY 13244 Abstract of Research Remotely sensed data from satelite ...are not readily visible in the imageries. Photo- interpretation of these satelite images is the more commonly used technique and not much emphasis
Fast data parallel polygon rendering
Energy Technology Data Exchange (ETDEWEB)
Ortega, F.A.; Hansen, C.D.
1993-09-01
This paper describes a parallel method for polygonal rendering on a massively parallel SIMD machine. This method, based on a simple shading model, is targeted for applications which require very fast polygon rendering for extremely large sets of polygons such as is found in many scientific visualization applications. The algorithms described in this paper are incorporated into a library of 3D graphics routines written for the Connection Machine. The routines are implemented on both the CM-200 and the CM-5. This library enables a scientists to display 3D shaded polygons directly from a parallel machine without the need to transmit huge amounts of data to a post-processing rendering system.
DEFF Research Database (Denmark)
Mahnke, Martina; Uprichard, Emma
2014-01-01
changes: it’s not the ocean, it’s the internet we’re talking about, and it’s not a TV show producer, but algorithms that constitute a sort of invisible wall. Building on this assumption, most research is trying to ‘tame the algorithmic tiger’. While this is a valuable and often inspiring approach, we...... would like to emphasize another side to the algorithmic everyday life. We argue that algorithms can instigate and facilitate imagination, creativity, and frivolity, while saying something that is simultaneously old and new, always almost repeating what was before but never quite returning. We show...... this by threading together stimulating quotes and screenshots from Google’s autocomplete algorithms. In doing so, we invite the reader to re-explore Google’s autocomplete algorithms in a creative, playful, and reflexive way, thereby rendering more visible some of the excitement and frivolity that comes from being...
SWAMP+: multiple subsequence alignment using associative massive parallelism
Energy Technology Data Exchange (ETDEWEB)
Steinfadt, Shannon Irene [Los Alamos National Laboratory; Baker, Johnnie W [KENT STATE UNIV.
2010-10-18
A new parallel algorithm SWAMP+ incorporates the Smith-Waterman sequence alignment on an associative parallel model known as ASC. It is a highly sensitive parallel approach that expands traditional pairwise sequence alignment. This is the first parallel algorithm to provide multiple non-overlapping, non-intersecting subsequence alignments with the accuracy of Smith-Waterman. The efficient algorithm provides multiple alignments similar to BLAST while creating a better workflow for the end users. The parallel portions of the code run in O(m+n) time using m processors. When m = n, the algorithmic analysis becomes O(n) with a coefficient of two, yielding a linear speedup. Implementation of the algorithm on the SIMD ClearSpeed CSX620 confirms this theoretical linear speedup with real timings.
A Coupling Tool for Parallel Molecular Dynamics-Continuum Simulations
Neumann, Philipp
2012-06-01
We present a tool for coupling Molecular Dynamics and continuum solvers. It is written in C++ and is meant to support the developers of hybrid molecular - continuum simulations in terms of both realisation of the respective coupling algorithm as well as parallel execution of the hybrid simulation. We describe the implementational concept of the tool and its parallel extensions. We particularly focus on the parallel execution of particle insertions into dense molecular systems and propose a respective parallel algorithm. Our implementations are validated for serial and parallel setups in two and three dimensions. © 2012 IEEE.
Distributed memory parallel computers and computational fluid dynamics
Roose, D.; Vandriessche, R.
A tutorial on aspects of parallel computing that are important for the development of efficient parallel algorithms and software for Computational Fluid Dynamics (CFD) is presented. Some important concepts concerning distributed memory parallel computers and parallel algorithms and the parallelization of CFD algorithms on structured grids are given. Many techniques used in CFD are shown to be suited for parallelization. The minimization of work load imbalance and communication and the expectation of a general high speedup and parallel efficiency are outlined. Some methods for partitioning and mapping unstructured grids are described. These methods range from simple heuristics to global optimization methods. Methods that show a good tradeoff between execution time and quality of the result, such as inertial recursive bisection and eigenvector recursive bisection, are considered.
Scalable Parallel Density-based Clustering and Applications
Patwary, Mostofa Ali
2014-04-01
Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts
Energy Technology Data Exchange (ETDEWEB)
NONE
1997-12-31
This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.
Algorithms Introduction to Algorithms
Indian Academy of Sciences (India)
Home; Journals; Resonance – Journal of Science Education; Volume 1; Issue 1. Algorithms Introduction to Algorithms. R K Shyamasundar. Series Article Volume 1 Issue 1 January 1996 pp 20-27. Fulltext. Click here to view fulltext PDF. Permanent link: http://www.ias.ac.in/article/fulltext/reso/001/01/0020-0027 ...
Data communications in a parallel active messaging interface of a parallel computer
Davis, Kristan D; Faraj, Daniel A
2013-07-09
Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
A review on quantum search algorithms
Giri, Pulak Ranjan; Korepin, Vladimir E.
2017-12-01
The use of superposition of states in quantum computation, known as quantum parallelism, has significant advantage in terms of speed over the classical computation. It is evident from the early invented quantum algorithms such as Deutsch's algorithm, Deutsch-Jozsa algorithm and its variation as Bernstein-Vazirani algorithm, Simon algorithm, Shor's algorithms, etc. Quantum parallelism also significantly speeds up the database search algorithm, which is important in computer science because it comes as a subroutine in many important algorithms. Quantum database search of Grover achieves the task of finding the target element in an unsorted database in a time quadratically faster than the classical computer. We review Grover's quantum search algorithms for a singe and multiple target elements in a database. The partial search algorithm of Grover and Radhakrishnan and its optimization by Korepin called GRK algorithm are also discussed.
Parallel Programming with Intel Parallel Studio XE
Blair-Chappell , Stephen
2012-01-01
Optimize code for multi-core processors with Intel's Parallel Studio Parallel programming is rapidly becoming a "must-know" skill for developers. Yet, where to start? This teach-yourself tutorial is an ideal starting point for developers who already know Windows C and C++ and are eager to add parallelism to their code. With a focus on applying tools, techniques, and language extensions to implement parallelism, this essential resource teaches you how to write programs for multicore and leverage the power of multicore in your programs. Sharing hands-on case studies and real-world examples, the
PARALLEL SOLUTION METHODS OF PARTIAL DIFFERENTIAL EQUATIONS
Directory of Open Access Journals (Sweden)
Korhan KARABULUT
1998-03-01
Full Text Available Partial differential equations arise in almost all fields of science and engineering. Computer time spent in solving partial differential equations is much more than that of in any other problem class. For this reason, partial differential equations are suitable to be solved on parallel computers that offer great computation power. In this study, parallel solution to partial differential equations with Jacobi, Gauss-Siedel, SOR (Succesive OverRelaxation and SSOR (Symmetric SOR algorithms is studied.
HOPSPACK: Hybrid Optimization Parallel Search Package.
Energy Technology Data Exchange (ETDEWEB)
Gray, Genetha Anne.; Kolda, Tamara G.; Griffin, Joshua; Taddy, Matt; Martinez-Canales, Monica L.
2008-12-01
In this paper, we describe the technical details of HOPSPACK (Hybrid Optimization Parallel SearchPackage), a new software platform which facilitates combining multiple optimization routines into asingle, tightly-coupled, hybrid algorithm that supports parallel function evaluations. The frameworkis designed such that existing optimization source code can be easily incorporated with minimalcode modification. By maintaining the integrity of each individual solver, the strengths and codesophistication of the original optimization package are retained and exploited.4
Massively parallel evolutionary computation on GPGPUs
Tsutsui, Shigeyoshi
2013-01-01
Evolutionary algorithms (EAs) are metaheuristics that learn from natural collective behavior and are applied to solve optimization problems in domains such as scheduling, engineering, bioinformatics, and finance. Such applications demand acceptable solutions with high-speed execution using finite computational resources. Therefore, there have been many attempts to develop platforms for running parallel EAs using multicore machines, massively parallel cluster machines, or grid computing environments. Recent advances in general-purpose computing on graphics processing units (GPGPU) have opened u
On parallel Branch and Bound frameworks for Global Optimization
Herrera, Juan F.R.; Salmerón, José M.G.; Hendrix, Eligius M.T.; Asenjo, Rafael; Casado, Leocadio G.
2017-01-01
Branch and Bound (B&B) algorithms are known to exhibit an irregularity of the search tree. Therefore, developing a parallel approach for this kind of algorithms is a challenge. The efficiency of a B&B algorithm depends on the chosen Branching, Bounding, Selection, Rejection, and Termination
Frequent Pairs in Data Streams: Exploiting Parallelism and Skew
DEFF Research Database (Denmark)
Campagna, Andrea; Kutzkow, Konstantin; Pagh, Rasmus
2011-01-01
, confirmed for several real-world datasets. Additionally, the algorithm parallelizes easily, which opens up for real-time processing of large transactions. Unlike previous algorithms we make no assumptions on the order of arrival of transactions and pairs. Our algorithm builds upon approaches for frequent...
A Massively Parallel Face Recognition System
Directory of Open Access Journals (Sweden)
Ari Paasio
2006-12-01
Full Text Available We present methods for processing the LBPs (local binary patterns with a massively parallel hardware, especially with CNN-UM (cellular nonlinear network-universal machine. In particular, we present a framework for implementing a massively parallel face recognition system, including a dedicated highly accurate algorithm suitable for various types of platforms (e.g., CNN-UM and digital FPGA. We study in detail a dedicated mixed-mode implementation of the algorithm and estimate its implementation cost in the view of its performance and accuracy restrictions.
A Massively Parallel Face Recognition System
Directory of Open Access Journals (Sweden)
Lahdenoja Olli
2007-01-01
Full Text Available We present methods for processing the LBPs (local binary patterns with a massively parallel hardware, especially with CNN-UM (cellular nonlinear network-universal machine. In particular, we present a framework for implementing a massively parallel face recognition system, including a dedicated highly accurate algorithm suitable for various types of platforms (e.g., CNN-UM and digital FPGA. We study in detail a dedicated mixed-mode implementation of the algorithm and estimate its implementation cost in the view of its performance and accuracy restrictions.
Parallel RANSAC for Point Cloud Registration
Directory of Open Access Journals (Sweden)
Koguciuk Daniel
2017-09-01
Full Text Available In this paper, a project and implementation of the parallel RANSAC algorithm in CUDA architecture for point cloud registration are presented. At the beginning, a serial state of the art method with several heuristic improvements from the literature compared to basic RANSAC is introduced. Subsequently, its algorithmic parallelization and CUDA implementation details are discussed. The comparative test has proven a significant program execution acceleration. The result is finding of the local coordinate system of the object in the scene in the near real-time conditions. The source code is shared on the Internet as a part of the Heuros system.
Parallel sparse direct solver for integrated circuit simulation
Chen, Xiaoming; Yang, Huazhong
2017-01-01
This book describes algorithmic methods and parallelization techniques to design a parallel sparse direct solver which is specifically targeted at integrated circuit simulation problems. The authors describe a complete flow and detailed parallel algorithms of the sparse direct solver. They also show how to improve the performance by simple but effective numerical techniques. The sparse direct solver techniques described can be applied to any SPICE-like integrated circuit simulator and have been proven to be high-performance in actual circuit simulation. Readers will benefit from the state-of-the-art parallel integrated circuit simulation techniques described in this book, especially the latest parallel sparse matrix solution techniques. · Introduces complicated algorithms of sparse linear solvers, using concise principles and simple examples, without complex theory or lengthy derivations; · Describes a parallel sparse direct solver that can be adopted to accelerate any SPICE-like integrated circuit simulato...
Parallel programming with Easy Java Simulations
Esquembre, F.; Christian, W.; Belloni, M.
2018-01-01
Nearly all of today's processors are multicore, and ideally programming and algorithm development utilizing the entire processor should be introduced early in the computational physics curriculum. Parallel programming is often not introduced because it requires a new programming environment and uses constructs that are unfamiliar to many teachers. We describe how we decrease the barrier to parallel programming by using a java-based programming environment to treat problems in the usual undergraduate curriculum. We use the easy java simulations programming and authoring tool to create the program's graphical user interface together with objects based on those developed by Kaminsky [Building Parallel Programs (Course Technology, Boston, 2010)] to handle common parallel programming tasks. Shared-memory parallel implementations of physics problems, such as time evolution of the Schrödinger equation, are available as source code and as ready-to-run programs from the AAPT-ComPADRE digital library.
Parallel distributed computing using Python
Dalcin, Lisandro D.; Paz, Rodrigo R.; Kler, Pablo A.; Cosimo, Alejandro
2011-09-01
This work presents two software components aimed to relieve the costs of accessing high-performance parallel computing resources within a Python programming environment: MPI for Python and PETSc for Python. MPI for Python is a general-purpose Python package that provides bindings for the Message Passing Interface (MPI) standard using any back-end MPI implementation. Its facilities allow parallel Python programs to easily exploit multiple processors using the message passing paradigm. PETSc for Python provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries. Its facilities allow sequential and parallel Python applications to exploit state of the art algorithms and data structures readily available in PETSc for the solution of large-scale problems in science and engineering. MPI for Python and PETSc for Python are fully integrated to PETSc-FEM, an MPI and PETSc based parallel, multiphysics, finite elements code developed at CIMEC laboratory. This software infrastructure supports research activities related to simulation of fluid flows with applications ranging from the design of microfluidic devices for biochemical analysis to modeling of large-scale stream/aquifer interactions.
Morse, H Stephen
1994-01-01
Practical Parallel Computing provides information pertinent to the fundamental aspects of high-performance parallel processing. This book discusses the development of parallel applications on a variety of equipment.Organized into three parts encompassing 12 chapters, this book begins with an overview of the technology trends that converge to favor massively parallel hardware over traditional mainframes and vector machines. This text then gives a tutorial introduction to parallel hardware architectures. Other chapters provide worked-out examples of programs using several parallel languages. Thi
A parallel Fast Fourier transform
Morante, S; Salina, G
1999-01-01
In this paper we discuss the general problem of implementing the multidimensional Fast Fourier Transform algorithm on parallel computers. We show that, on a machine with P processors and fully parallel node communications, the optimal asymptotic scaling behavior of the total computational time with the number of data points, N, given in d dimensions by the formula aN/Plog(N/P)+bN/P/sup (d-1)/d/, can actually be achieved on realistic platforms. As a concrete realization of our strategy, we have produced codes efficiently running on machines of the APE family and on Cray T3E. On the former for asymptotic values of N our codes attain the above optimal result. (16 refs).
Semi-coarsening multigrid methods for parallel computing
Energy Technology Data Exchange (ETDEWEB)
Jones, J.E.
1996-12-31
Standard multigrid methods are not well suited for problems with anisotropic coefficients which can occur, for example, on grids that are stretched to resolve a boundary layer. There are several different modifications of the standard multigrid algorithm that yield efficient methods for anisotropic problems. In the paper, we investigate the parallel performance of these multigrid algorithms. Multigrid algorithms which work well for anisotropic problems are based on line relaxation and/or semi-coarsening. In semi-coarsening multigrid algorithms a grid is coarsened in only one of the coordinate directions unlike standard or full-coarsening multigrid algorithms where a grid is coarsened in each of the coordinate directions. When both semi-coarsening and line relaxation are used, the resulting multigrid algorithm is robust and automatic in that it requires no knowledge of the nature of the anisotropy. This is the basic multigrid algorithm whose parallel performance we investigate in the paper. The algorithm is currently being implemented on an IBM SP2 and its performance is being analyzed. In addition to looking at the parallel performance of the basic semi-coarsening algorithm, we present algorithmic modifications with potentially better parallel efficiency. One modification reduces the amount of computational work done in relaxation at the expense of using multiple coarse grids. This modification is also being implemented with the aim of comparing its performance to that of the basic semi-coarsening algorithm.
Faraj, Daniel A
2013-07-16
Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Scalable Parallel Algebraic Multigrid Solvers
Energy Technology Data Exchange (ETDEWEB)
Bank, R; Lu, S; Tong, C; Vassilevski, P
2005-03-23
The authors propose a parallel algebraic multilevel algorithm (AMG), which has the novel feature that the subproblem residing in each processor is defined over the entire partition domain, although the vast majority of unknowns for each subproblem are associated with the partition owned by the corresponding processor. This feature ensures that a global coarse description of the problem is contained within each of the subproblems. The advantages of this approach are that interprocessor communication is minimized in the solution process while an optimal order of convergence rate is preserved; and the speed of local subproblem solvers can be maximized using the best existing sequential algebraic solvers.
Learning in Parallel Universes
Berthold, Michael R.; Wiswedel, Bernd
2007-01-01
This abstract summarizes a brief, preliminary formalization of learning in parallel universes. It also attempts to highlight a few neighboring learning paradigms to illustrate how parallel learning fits into the greater picture.
Parallel alternating direction preconditioner for isogeometric simulations of explicit dynamics
Łoś, Marcin
2015-04-27
In this paper we present a parallel implementation of the alternating direction preconditioner for isogeometric simulations of explicit dynamics. The Alternating Direction Implicit (ADI) algorithm, belongs to the category of matrix-splitting iterative methods, was proposed almost six decades ago for solving parabolic and elliptic partial differential equations, see [1–4]. The new version of this algorithm has been recently developed for isogeometric simulations of two dimensional explicit dynamics [5] and steady-state diffusion equations with orthotropic heterogenous coefficients [6]. In this paper we present a parallel version of the alternating direction implicit algorithm for three dimensional simulations. The algorithm has been incorporated as a part of PETIGA an isogeometric framework [7] build on top of PETSc [8]. We show the scalability of the parallel algorithm on STAMPEDE linux cluster up to 10,000 processors, as well as the convergence rate of the PCG solver with ADI algorithm as preconditioner.
Fox, Geoffrey C; Messina, Guiseppe C
2014-01-01
A clear illustration of how parallel computers can be successfully appliedto large-scale scientific computations. This book demonstrates how avariety of applications in physics, biology, mathematics and other scienceswere implemented on real parallel computers to produce new scientificresults. It investigates issues of fine-grained parallelism relevant forfuture supercomputers with particular emphasis on hypercube architecture. The authors describe how they used an experimental approach to configuredifferent massively parallel machines, design and implement basic systemsoftware, and develop
Energy Technology Data Exchange (ETDEWEB)
Fan, W.C.; Halbleib, J.A. Sr.
1996-09-01
This report provides a users` guide for parallel processing ITS on a UNIX workstation network, a shared-memory multiprocessor or a massively-parallel processor. The parallelized version of ITS is based on a master/slave model with message passing. Parallel issues such as random number generation, load balancing, and communication software are briefly discussed. Timing results for example problems are presented for demonstration purposes.
Fahnestock, Jeanne
2003-01-01
This study investigates the practice of presenting multiple supporting examples in parallel form. The elements of parallelism and its use in argument were first illustrated by Aristotle. Although real texts may depart from the ideal form for presenting multiple examples, rhetorical theory offers a rationale for minimal, parallel presentation. The…
Scheduling Parallel Jobs Using Migration and Consolidation in the Cloud
Directory of Open Access Journals (Sweden)
Xiaocheng Liu
2012-01-01
Full Text Available An increasing number of high performance computing parallel applications leverages the power of the cloud for parallel processing. How to schedule the parallel applications to improve the quality of service is the key to the successful host of parallel applications in the cloud. The large scale of the cloud makes the parallel job scheduling more complicated as even simple parallel job scheduling problem is NP-complete. In this paper, we propose a parallel job scheduling algorithm named MEASY. MEASY adopts migration and consolidation to enhance the most popular EASY scheduling algorithm. Our extensive experiments on well-known workloads show that our algorithm takes very good care of the quality of service. For two common parallel job scheduling objectives, our algorithm produces an up to 41.1% and an average of 23.1% improvement on the average response time; an up to 82.9% and an average of 69.3% improvement on the average slowdown. Our algorithm is robust even in terms that it allows inaccurate CPU usage estimation and high migration cost. Our approach involves trivial modification on EASY and requires no additional technique; it is practical and effective in the cloud environment.
Parallel Computing Methods For Particle Accelerator Design
Popescu, Diana Andreea; Hersch, Roger
We present methods for parallelizing the transport map construction for multi-core processors and for Graphics Processing Units (GPUs). We provide an efficient implementation of the transport map construction. We describe a method for multi-core processors using the OpenMP framework which brings performance improvement over the serial version of the map construction. We developed a novel and efficient algorithm for multivariate polynomial multiplication for GPUs and we implemented it using the CUDA framework. We show the benefits of using the multivariate polynomial multiplication algorithm for GPUs in the map composition operation for high orders. Finally, we present an algorithm for map composition for GPUs.
Asynchronous parallel search in global optimization problems
Energy Technology Data Exchange (ETDEWEB)
Archetti, F.; Schoen, F.
1982-01-01
A class of asynchronous parallel search methods is proposed in order to solve the global optimization problem on a multiprocessor system, consisting of several processors which can communicate through a set of global variables contained in a memory shared by all processors. The speed-up ratio and memory contension effects are experimentally analyzed for some algorithms of this class. 6 references.
Parallel Monte Carlo simulation of aerosol dynamics
Zhou, K.
2014-01-01
A highly efficient Monte Carlo (MC) algorithm is developed for the numerical simulation of aerosol dynamics, that is, nucleation, surface growth, and coagulation. Nucleation and surface growth are handled with deterministic means, while coagulation is simulated with a stochastic method (Marcus-Lushnikov stochastic process). Operator splitting techniques are used to synthesize the deterministic and stochastic parts in the algorithm. The algorithm is parallelized using the Message Passing Interface (MPI). The parallel computing efficiency is investigated through numerical examples. Near 60% parallel efficiency is achieved for the maximum testing case with 3.7 million MC particles running on 93 parallel computing nodes. The algorithm is verified through simulating various testing cases and comparing the simulation results with available analytical and/or other numerical solutions. Generally, it is found that only small number (hundreds or thousands) of MC particles is necessary to accurately predict the aerosol particle number density, volume fraction, and so forth, that is, low order moments of the Particle Size Distribution (PSD) function. Accurately predicting the high order moments of the PSD needs to dramatically increase the number of MC particles. 2014 Kun Zhou et al.
Heuristic framework for parallel sorting computations | Nwanze ...
African Journals Online (AJOL)
The decreasing cost of these processors will probably in the future, make the solutions that are derived thereof to be more appealing. Efficient algorithms for sorting scheme that are encountered in a number of operations are considered for multi-user machines. A heuristic framework for exploiting parallelism inherent in ...
Lock-free parallel garbage collection
H. Gao; J.F. Groote (Jan Friso); W.H. Hesselink (Wim)
2005-01-01
htmlabstract This paper presents a lock-free parallel algorithm for mark&sweep garbage collection (GC) in a realistic model using synchronization primitives compare-and-swap (CAS) and load-linked/store-conditional (LL/SC) offered by machine architectures. Mutators and collectors can simultaneously
Lock-free parallel garbage collection
Gao, H.; Groote, J.F.; Hesselink, W.H.; Pan, Y; Chen, D; Guo, M; Cao, JN; Dongarra, J
2005-01-01
This paper presents a lock-free parallel algorithm for garbage collection in a realistic model using synchronization primitives offered by machine architectures. Mutators and collectors can simultaneously operate on the data structure. In particular no strict alternation between usage and cleaning
Experience with a clustered parallel reduction machine
Beemster, M.; Hartel, Pieter H.; Hertzberger, L.O.; Hofman, R.F.H.; Langendoen, K.G.; Li, L.L.; Milikowski, R.; Vree, W.G.; Barendregt, H.P.; Mulder, J.C.
A clustered architecture has been designed to exploit divide and conquer parallelism in functional programs. The programming methodology developed for the machine is based on explicit annotations and program transformations. It has been successfully applied to a number of algorithms resulting in a
DEFF Research Database (Denmark)
Sitchinava, Nodar; Zeh, Norbert
2012-01-01
We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number...... of available processor cores compared to its sequential counterpart, thereby taking full advantage of multicore parallelism. The parallel buffer tree is a search tree data structure that supports the batched parallel processing of a sequence of N insertions, deletions, membership queries, and range queries...
Parallel Monte Carlo Search for Hough Transform
Lopes, Raul H. C.; Franqueira, Virginia N. L.; Reid, Ivan D.; Hobson, Peter R.
2017-10-01
We investigate the problem of line detection in digital image processing and in special how state of the art algorithms behave in the presence of noise and whether CPU efficiency can be improved by the combination of a Monte Carlo Tree Search, hierarchical space decomposition, and parallel computing. The starting point of the investigation is the method introduced in 1962 by Paul Hough for detecting lines in binary images. Extended in the 1970s to the detection of space forms, what came to be known as Hough Transform (HT) has been proposed, for example, in the context of track fitting in the LHC ATLAS and CMS projects. The Hough Transform transfers the problem of line detection, for example, into one of optimization of the peak in a vote counting process for cells which contain the possible points of candidate lines. The detection algorithm can be computationally expensive both in the demands made upon the processor and on memory. Additionally, it can have a reduced effectiveness in detection in the presence of noise. Our first contribution consists in an evaluation of the use of a variation of the Radon Transform as a form of improving theeffectiveness of line detection in the presence of noise. Then, parallel algorithms for variations of the Hough Transform and the Radon Transform for line detection are introduced. An algorithm for Parallel Monte Carlo Search applied to line detection is also introduced. Their algorithmic complexities are discussed. Finally, implementations on multi-GPU and multicore architectures are discussed.
Scaling up machine learning: parallel and distributed approaches
National Research Council Canada - National Science Library
Bekkerman, Ron; Bilenko, Mikhail; Langford, John
2012-01-01
.... Demand for parallelizing learning algorithms is highly task-specific: in some settings it is driven by the enormous dataset sizes, in others by model complexity or by real-time performance requirements...
Parallel Adaptive Mesh Refinement
Energy Technology Data Exchange (ETDEWEB)
Diachin, L; Hornung, R; Plassmann, P; WIssink, A
2005-03-04
As large-scale, parallel computers have become more widely available and numerical models and algorithms have advanced, the range of physical phenomena that can be simulated has expanded dramatically. Many important science and engineering problems exhibit solutions with localized behavior where highly-detailed salient features or large gradients appear in certain regions which are separated by much larger regions where the solution is smooth. Examples include chemically-reacting flows with radiative heat transfer, high Reynolds number flows interacting with solid objects, and combustion problems where the flame front is essentially a two-dimensional sheet occupying a small part of a three-dimensional domain. Modeling such problems numerically requires approximating the governing partial differential equations on a discrete domain, or grid. Grid spacing is an important factor in determining the accuracy and cost of a computation. A fine grid may be needed to resolve key local features while a much coarser grid may suffice elsewhere. Employing a fine grid everywhere may be inefficient at best and, at worst, may make an adequately resolved simulation impractical. Moreover, the location and resolution of fine grid required for an accurate solution is a dynamic property of a problem's transient features and may not be known a priori. Adaptive mesh refinement (AMR) is a technique that can be used with both structured and unstructured meshes to adjust local grid spacing dynamically to capture solution features with an appropriate degree of resolution. Thus, computational resources can be focused where and when they are needed most to efficiently achieve an accurate solution without incurring the cost of a globally-fine grid. Figure 1.1 shows two example computations using AMR; on the left is a structured mesh calculation of a impulsively-sheared contact surface and on the right is the fuselage and volume discretization of an RAH-66 Comanche helicopter [35]. Note the
Kleinberg, Jon
2006-01-01
Algorithm Design introduces algorithms by looking at the real-world problems that motivate them. The book teaches students a range of design and analysis techniques for problems that arise in computing applications. The text encourages an understanding of the algorithm design process and an appreciation of the role of algorithms in the broader field of computer science.
Parallel circuit simulation on supercomputers
Energy Technology Data Exchange (ETDEWEB)
Saleh, R.A.; Gallivan, K.A. (Illinois Univ., Urbana, IL (USA). Center for Supercomputing Research and Development); Chang, M.C. (Texas Instruments, Inc., Dallas, TX (USA)); Hajj, I.N.; Trick, T.N. (Illinois Univ., Urbana, IL (USA). Coordinated Science Lab.); Smart, D. (Semiconductor Div., Analog Devices, Wilmington, MA (US))
1989-12-01
Circuit simulation is a very time-consuming and numerically intensive application, especially when the problem size is large as in the case of VLSI circuits. To improve the performance of circuit simulators without sacrificing accuracy, a variety of parallel processing algorithms have been investigated due to the recent availability of a number of commercial multiprocessor machines. In this paper, research in the field of parallel circuit simulation is surveyed and the ongoing research in this area at the University of Illinois is described. Both standard and relaxation-based approaches are considered. In particular, the forms of parallelism available within the direct method approach, used in programs such as SPICE2 and SLATE, and within the relaxation-based approaches, such as waveform relaxation, iterated timing analysis, and waveform-relaxation-Newton, are described. The specific implementation issues addressed here are primarily related to general-purpose multiprocessors with a shared-memory architecture having a limited number of processors, although many of the comments also apply to a number of other architectures.
Algorithms and programming tools for image processing on the MPP, introduction. Thesis
1985-01-01
The programming tools and parallel algorithms created for the Massively Parallel Processor (MPP) located at the NASA Goddard Space Center are discussed. A user-friendly environment for high level language parallel algorithm development was developed. The issues involved in implementing certain algorithms on the MPP were researched. The expected results were compared with the actual results.
A Multi-Gigabit Parallel Demodulator and Its FPGA Implementation
Lin, Changxing; Zhang, Jian; Shao, Beibei
This letter presents the architecture of multi-gigabit parallel demodulator suitable for demodulating high order QAM modulated signal and easy to implement on FPGA platform. The parallel architecture is based on frequency domain implementation of matched filter and timing phase correction. Parallel FIFO based delete-keep algorithm is proposed for timing synchronization, while a kind of reduced constellation phase-frequency detector based parallel decision feedback PLL is designed for carrier synchronization. A fully pipelined parallel adaptive blind equalization algorithm is also proposed. Their parallel implementation structures suitable for FPGA platform are investigated. Besides, in the demonstration of 2Gbps demodulator for 16QAM modulation, the architecture is implemented and validated on a Xilinx V6 FPGA platform with performance loss less than 2dB.
Cluster algorithms and computational complexity
Li, Xuenan
Cluster algorithms for the 2D Ising model with a staggered field have been studied and a new cluster algorithm for path sampling has been worked out. The complexity properties of Bak-Seppen model and the Growing network model have been studied by using the Computational Complexity Theory. The dynamic critical behavior of the two-replica cluster algorithm is studied. Several versions of the algorithm are applied to the two-dimensional, square lattice Ising model with a staggered field. The dynamic exponent for the full algorithm is found to be less than 0.5. It is found that odd translations of one replica with respect to the other together with global flips are essential for obtaining a small value of the dynamic exponent. The path sampling problem for the 1D Ising model is studied using both a local algorithm and a novel cluster algorithm. The local algorithm is extremely inefficient at low temperature, where the integrated autocorrelation time is found to be proportional to the fourth power of correlation length. The dynamic exponent of the cluster algorithm is found to be zero and therefore proved to be much more efficient than the local algorithm. The parallel computational complexity of the Bak-Sneppen evolution model is studied. It is shown that Bak-Sneppen histories can be generated by a massively parallel computer in a time that is polylog in the length of the history, which means that the logical depth of producing a Bak-Sneppen history is exponentially less than the length of the history. The parallel dynamics for generating Bak-Sneppen histories is contrasted to standard Bak-Sneppen dynamics. The parallel computational complexity of the Growing Network model is studied. The growth of the network with linear kernels is shown to be not complex and an algorithm with polylog parallel running time is found. The growth of the network with gamma ≥ 2 super-linear kernels can be realized by a randomized parallel algorithm with polylog expected running time.
Massively Parallel Computing: A Sandia Perspective
Energy Technology Data Exchange (ETDEWEB)
Dosanjh, Sudip S.; Greenberg, David S.; Hendrickson, Bruce; Heroux, Michael A.; Plimpton, Steve J.; Tomkins, James L.; Womble, David E.
1999-05-06
The computing power available to scientists and engineers has increased dramatically in the past decade, due in part to progress in making massively parallel computing practical and available. The expectation for these machines has been great. The reality is that progress has been slower than expected. Nevertheless, massively parallel computing is beginning to realize its potential for enabling significant break-throughs in science and engineering. This paper provides a perspective on the state of the field, colored by the authors' experiences using large scale parallel machines at Sandia National Laboratories. We address trends in hardware, system software and algorithms, and we also offer our view of the forces shaping the parallel computing industry.
Binary image segmentation based on optimized parallel K-means
Qiu, Xiao-bing; Zhou, Yong; Lin, Li
2015-07-01
K-means is a classic unsupervised learning clustering algorithm. In theory, it can work well in the field of image segmentation. But compared with other segmentation algorithms, this algorithm needs much more computation, and segmentation speed is slow. This limits its application. With the emergence of general-purpose computing on the GPU and the release of CUDA, some scholars try to implement K-means algorithm in parallel on the GPU, and applied to image segmentation at the same time. They have achieved some results, but the approach they use is not completely parallel, not take full advantage of GPU's super computing power. K-means algorithm has two core steps: label and update, in current parallel realization of K-means, only labeling is parallel, update operation is still serial. In this paper, both of the two steps in K-means will be parallel to improve the degree of parallelism and accelerate this algorithm. Experimental results show that this improvement has reached a much quicker speed than the previous research.
Parallel object-oriented term rewriting : the booleans
Rodenburg, P.H.; Vrancken, J.L.M.
As a first case study in parallel object-oriented term rewriting, we give two implementations of term rewriting algorithms for boolean terms, using the parallel object-oriented features of the language Pool-T. The term rewriting systems are specified in the specification formalism
Minimisation of total tardiness for identical parallel machine ...
Indian Academy of Sciences (India)
In recent years research on parallel machine scheduling has received an increased attention. This paper considers minimisation of total tardiness for scheduling of n jobs on a set of m parallel machines. A spread-sheet-based genetic algorithm (GA) approach is proposed for the problem. The proposed approach is a ...
Restricted Parallelism in Object-Oriented Lexical Parsing
Neuhaus, P; Neuhaus, Peter; Hahn, Udo
1996-01-01
We present an approach to parallel natural language parsing which is based on a concurrent, object-oriented model of computation. A depth-first, yet incomplete parsing algorithm for a dependency grammar is specified and several restrictions on the degree of its parallelization are discussed.
Match and Move, an Approach to Data Parallel Computing
1992-10-01
Machines Incorporated, 1990. [DNS81] Eliezer Dekel, David Nassimi, and Sartaj Sahni. Parallel matrix and graph algorithms. SIAM J. Comput., 10(4):657...series in artificial intelli- gence. MIT Press, Cambridge, Mass., 1985. [HJ88] R. W. Hockney and C. R. Jesshope. Parallel Computers 2: Architecture
CERN. Geneva
2016-01-01
The traditionally used and well established parallel programming models OpenMP and MPI are both targeting lower level parallelism and are meant to be as language agnostic as possible. For a long time, those models were the only widely available portable options for developing parallel C++ applications beyond using plain threads. This has strongly limited the optimization capabilities of compilers, has inhibited extensibility and genericity, and has restricted the use of those models together with other, modern higher level abstractions introduced by the C++11 and C++14 standards. The recent revival of interest in the industry and wider community for the C++ language has also spurred a remarkable amount of standardization proposals and technical specifications being developed. Those efforts however have so far failed to build a vision on how to seamlessly integrate various types of parallelism, such as iterative parallel execution, task-based parallelism, asynchronous many-task execution flows, continuation s...
Parallel phase model : a programming model for high-end parallel machines with manycores.
Energy Technology Data Exchange (ETDEWEB)
Wu, Junfeng (Syracuse University, Syracuse, NY); Wen, Zhaofang; Heroux, Michael Allen; Brightwell, Ronald Brian
2009-04-01
This paper presents a parallel programming model, Parallel Phase Model (PPM), for next-generation high-end parallel machines based on a distributed memory architecture consisting of a networked cluster of nodes with a large number of cores on each node. PPM has a unified high-level programming abstraction that facilitates the design and implementation of parallel algorithms to exploit both the parallelism of the many cores and the parallelism at the cluster level. The programming abstraction will be suitable for expressing both fine-grained and coarse-grained parallelism. It includes a few high-level parallel programming language constructs that can be added as an extension to an existing (sequential or parallel) programming language such as C; and the implementation of PPM also includes a light-weight runtime library that runs on top of an existing network communication software layer (e.g. MPI). Design philosophy of PPM and details of the programming abstraction are also presented. Several unstructured applications that inherently require high-volume random fine-grained data accesses have been implemented in PPM with very promising results.
An implementation of ray tracing algorithm for the multiprocessor machines
Directory of Open Access Journals (Sweden)
Samardžić Aleksandar B.
2006-01-01
Full Text Available Ray Tracing is an algorithm for generating photo-realistic pictures of the 3D scenes, given scene description, lighting condition and viewing parameters as inputs. The algorithm is inherently convenient for parallelization and the simplest parallelization scheme is for the shared-memory parallel machines (multiprocessors. This paper presents two implementations of the algorithm developed by the authors for alike machines, one using the POSIX threads API and another one using the OpenMP API. The paper also presents results of rendering some test scenes using these implementations and discusses our parallel algorithm version efficiency.
New Combustion CFD Algorithms Designed for Rapid GPU Computations Project
National Aeronautics and Space Administration — We propose development of new algorithms specifically designed to exploit the highly parallel structure of graphics processing units (GPUs) for performing the...
Kalman Filter Tracking on Parallel Architectures
Directory of Open Access Journals (Sweden)
Cerati Giuseppe
2016-01-01
Full Text Available Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors such as GPGPU, ARM and Intel MIC. In order to achieve the theoretical performance gains of these processors, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High-Luminosity Large Hadron Collider (HL-LHC, for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques such as Cellular Automata or Hough Transforms. The most common track finding techniques in use today, however, are those based on a Kalman filter approach. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust, and are in use today at the LHC. Given the utility of the Kalman filter in track finding, we have begun to port these algorithms to parallel architectures, namely Intel Xeon and Xeon Phi. We report here on our progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a simplified experimental environment.
Joux, Antoine
2009-01-01
Illustrating the power of algorithms, Algorithmic Cryptanalysis describes algorithmic methods with cryptographically relevant examples. Focusing on both private- and public-key cryptographic algorithms, it presents each algorithm either as a textual description, in pseudo-code, or in a C code program.Divided into three parts, the book begins with a short introduction to cryptography and a background chapter on elementary number theory and algebra. It then moves on to algorithms, with each chapter in this section dedicated to a single topic and often illustrated with simple cryptographic applic
Parallel discrete event simulation
Overeinder, B.J.; Hertzberger, L.O.; Sloot, P.M.A.; Withagen, W.J.
1991-01-01
In simulating applications for execution on specific computing systems, the simulation performance figures must be known in a short period of time. One basic approach to the problem of reducing the required simulation time is the exploitation of parallelism. However, in parallelizing the simulation
Patterns For Parallel Programming
Mattson, Timothy G; Massingill, Berna L
2005-01-01
From grids and clusters to next-generation game consoles, parallel computing is going mainstream. Innovations such as Hyper-Threading Technology, HyperTransport Technology, and multicore microprocessors from IBM, Intel, and Sun are accelerating the movement's growth. Only one thing is missing: programmers with the skills to meet the soaring demand for parallel software.
CALTRANS: A parallel, deterministic, 3D neutronics code
Energy Technology Data Exchange (ETDEWEB)
Carson, L.; Ferguson, J.; Rogers, J.
1994-04-01
Our efforts to parallelize the deterministic solution of the neutron transport equation has culminated in a new neutronics code CALTRANS, which has full 3D capability. In this article, we describe the layout and algorithms of CALTRANS and present performance measurements of the code on a variety of platforms. Explicit implementation of the parallel algorithms of CALTRANS using both the function calls of the Parallel Virtual Machine software package (PVM 3.2) and the Meiko CS-2 tagged message passing library (based on the Intel NX/2 interface) are provided in appendices.
Iterative Schemes for Time Parallelization with Application to Reservoir Simulation
Energy Technology Data Exchange (ETDEWEB)
Garrido, I; Fladmark, G E; Espedal, M S; Lee, B
2005-04-18
Parallel methods are usually not applied to the time domain because of the inherit sequentialness of time evolution. But for many evolutionary problems, computer simulation can benefit substantially from time parallelization methods. In this paper, they present several such algorithms that actually exploit the sequential nature of time evolution through a predictor-corrector procedure. This sequentialness ensures convergence of a parallel predictor-corrector scheme within a fixed number of iterations. The performance of these novel algorithms, which are derived from the classical alternating Schwarz method, are illustrated through several numerical examples using the reservoir simulator Athena.
CERN. Geneva; PUNZI, Giovanni
2015-01-01
Charge particle reconstruction is one of the most demanding computational tasks found in HEP, and it becomes increasingly important to perform it in real time. We envision that HEP would greatly benefit from achieving a long-term goal of making track reconstruction happen transparently as part of the detector readout ("detector-embedded tracking"). We describe here a track-reconstruction approach based on a massively parallel pattern-recognition algorithm, inspired by studies of the processing of visual images by the brain as it happens in nature ('RETINA algorithm'). It turns out that high-quality tracking in large HEP detectors is possible with very small latencies, when this algorithm is implemented in specialized processors, based on current state-of-the-art, high-speed/high-bandwidth digital devices.
Hougardy, Stefan
2016-01-01
Algorithms play an increasingly important role in nearly all fields of mathematics. This book allows readers to develop basic mathematical abilities, in particular those concerning the design and analysis of algorithms as well as their implementation. It presents not only fundamental algorithms like the sieve of Eratosthenes, the Euclidean algorithm, sorting algorithms, algorithms on graphs, and Gaussian elimination, but also discusses elementary data structures, basic graph theory, and numerical questions. In addition, it provides an introduction to programming and demonstrates in detail how to implement algorithms in C++. This textbook is suitable for students who are new to the subject and covers a basic mathematical lecture course, complementing traditional courses on analysis and linear algebra. Both authors have given this "Algorithmic Mathematics" course at the University of Bonn several times in recent years.
Parallel programming practical aspects, models and current limitations
Tarkov, Mikhail S
2014-01-01
Parallel programming is designed for the use of parallel computer systems for solving time-consuming problems that cannot be solved on a sequential computer in a reasonable time. These problems can be divided into two classes: 1. Processing large data arrays (including processing images and signals in real time)2. Simulation of complex physical processes and chemical reactions For each of these classes, prospective methods are designed for solving problems. For data processing, one of the most promising technologies is the use of artificial neural networks. Particles-in-cell method and cellular automata are very useful for simulation. Problems of scalability of parallel algorithms and the transfer of existing parallel programs to future parallel computers are very acute now. An important task is to optimize the use of the equipment (including the CPU cache) of parallel computers. Along with parallelizing information processing, it is essential to ensure the processing reliability by the relevant organization ...
Implementation of a Parallel Protein Structure Alignment Service on Cloud
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform. PMID:23671842
Implementation of a Parallel Protein Structure Alignment Service on Cloud
Directory of Open Access Journals (Sweden)
Che-Lun Hung
2013-01-01
Full Text Available Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform.
Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains
Directory of Open Access Journals (Sweden)
Ashwani Jha
2016-01-01
Full Text Available Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10–15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection, an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors.
Tel, G.
We define the notion of total algorithms for networks of processes. A total algorithm enforces that a "decision" is taken by a subset of the processes, and that participation of all processes is required to reach this decision. Total algorithms are an important building block in the design of
Energy Technology Data Exchange (ETDEWEB)
Dubois, J.
2011-10-13
In science, simulation is a key process for research or validation. Modern computer technology allows faster numerical experiments, which are cheaper than real models. In the field of neutron simulation, the calculation of eigenvalues is one of the key challenges. The complexity of these problems is such that a lot of computing power may be necessary. The work of this thesis is first the evaluation of new computing hardware such as graphics card or massively multi-core chips, and their application to eigenvalue problems for neutron simulation. Then, in order to address the massive parallelism of supercomputers national, we also study the use of asynchronous hybrid methods for solving eigenvalue problems with this very high level of parallelism. Then we experiment the work of this research on several national supercomputers such as the Titane hybrid machine of the Computing Center, Research and Technology (CCRT), the Curie machine of the Very Large Computing Centre (TGCC), currently being installed, and the Hopper machine at the Lawrence Berkeley National Laboratory (LBNL). We also do our experiments on local workstations to illustrate the interest of this research in an everyday use with local computing resources. (author) [French] Les travaux de cette these concernent dans un premier temps l'evaluation des nouveaux materiels de calculs tels que les cartes graphiques ou les puces massivement multicoeurs, et leur application aux problemes de valeurs propres pour la neutronique. Ensuite, dans le but d'utiliser le parallelisme massif des supercalculateurs, nous etudions egalement l'utilisation de methodes hybrides asynchrones pour resoudre des problemes a valeur propre avec ce tres haut niveau de parallelisme. Nous experimentons ensuite le resultat de ces recherches sur plusieurs supercalculateurs nationaux tels que la machine hybride Titane du Centre de Calcul, Recherche et Technologies (CCRT), la machine Curie du Tres Grand Centre de Calcul (TGCC) qui
Efficient Partitioning of Algorithms for Long Convolutions and their Mapping onto Architectures
Bierens, L.; Deprettere, E.
1998-01-01
We present an efficient approach for the partitioning of algorithms implementing long convolutions. The dependence graph (DG) of a convolution algorithm is locally sequential globally parallel (LSGP) partitioned into smaller, less complex convolution algorithms. The LSGP partitioned DG is mapped
A privacy-preserving parallel and homomorphic encryption scheme
Min, Zhaoe; Yang, Geng; Shi, Jingqi
2017-04-01
In order to protect data privacy whilst allowing efficient access to data in multi-nodes cloud environments, a parallel homomorphic encryption (PHE) scheme is proposed based on the additive homomorphism of the Paillier encryption algorithm. In this paper we propose a PHE algorithm, in which plaintext is divided into several blocks and blocks are encrypted with a parallel mode. Experiment results demonstrate that the encryption algorithm can reach a speed-up ratio at about 7.1 in the MapReduce environment with 16 cores and 4 nodes.
A privacy-preserving parallel and homomorphic encryption scheme
Directory of Open Access Journals (Sweden)
Min Zhaoe
2017-04-01
Full Text Available In order to protect data privacy whilst allowing efficient access to data in multi-nodes cloud environments, a parallel homomorphic encryption (PHE scheme is proposed based on the additive homomorphism of the Paillier encryption algorithm. In this paper we propose a PHE algorithm, in which plaintext is divided into several blocks and blocks are encrypted with a parallel mode. Experiment results demonstrate that the encryption algorithm can reach a speed-up ratio at about 7.1 in the MapReduce environment with 16 cores and 4 nodes.
Scotland, Robert W
2011-01-01
Although parallel and convergent evolution are discussed extensively in technical articles and textbooks, their meaning can be overlapping, imprecise, and contradictory. The meaning of parallel evolution in much of the evolutionary literature grapples with two separate hypotheses in relation to phenotype and genotype, but often these two hypotheses have been inferred from only one hypothesis, and a number of subsidiary but problematic criteria, in relation to the phenotype. However, examples of parallel evolution of genetic traits that underpin or are at least associated with convergent phenotypes are now emerging. Four criteria for distinguishing parallelism from convergence are reviewed. All are found to be incompatible with any single proposition of homoplasy. Therefore, all homoplasy is equivalent to a broad view of convergence. Based on this concept, all phenotypic homoplasy can be described as convergence and all genotypic homoplasy as parallelism, which can be viewed as the equivalent concept of convergence for molecular data. Parallel changes of molecular traits may or may not be associated with convergent phenotypes but if so describe homoplasy at two biological levels-genotype and phenotype. Parallelism is not an alternative to convergence, but rather it entails homoplastic genetics that can be associated with and potentially explain, at the molecular level, how convergent phenotypes evolve. © 2011 Wiley Periodicals, Inc.
Compositional C++: Compositional Parallel Programming
Chandy, K. Mani; Kesselman, Carl
1992-01-01
A compositional parallel program is a program constructed by composing component programs in parallel, where the composed program inherits properties of its components. In this paper, we describe a small extension of C++ called Compositional C++ or CC++ which is an object-oriented notation that supports compositional parallel programming. CC++ integrates different paradigms of parallel programming: data-parallel, task-parallel and object-parallel paradigms; imperative and declarative programm...
The Xyce Parallel Electronic Simulator - An Overview
Energy Technology Data Exchange (ETDEWEB)
HUTCHINSON,SCOTT A.; KEITER,ERIC R.; HOEKSTRA,ROBERT J.; WATTS,HERMAN A.; WATERS,ARLON J.; SCHELLS,REGINA L.; WIX,STEVEN D.
2000-12-08
The Xyce{trademark} Parallel Electronic Simulator has been written to support the simulation needs of the Sandia National Laboratories electrical designers. As such, the development has focused on providing the capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). In addition, they are providing improved performance for numerical kernels using state-of-the-art algorithms, support for modeling circuit phenomena at a variety of abstraction levels and using object-oriented and modern coding-practices that ensure the code will be maintainable and extensible far into the future. The code is a parallel code in the most general sense of the phrase--a message passing parallel implementation--which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Furthermore, careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved even as the number of processors grows.
Fitting equation of state parameters in parallel computers
Directory of Open Access Journals (Sweden)
M. Castier
2014-12-01
Full Text Available This work compares two strategies to fit parameters of equations of state in parallel computers, emphasizing solutions that require few changes to existing sequential programs. One strategy uses the conventional Nelder-Mead algorithm coupled with parallel objective function evaluation (SSPO. The other strategy uses a parallel Nelder-Mead algorithm coupled with sequential objective function evaluation (PSSO. The PSSO strategy, which executes parallel one-dimensional searches during each iteration, is simpler to implement and converged to parameter sets with objective functions smaller than those obtained by the SSPO strategy. The SSPO strategy produced speedups consistent with the number of processes used and is more suitable when many processors are available. Both strategies are potentially useful and choosing between them is a matter of convenience, depending on the problem at hand. With parallel computers increasingly available, the easy implementation and convenience of these two strategies should appeal to developers and users of thermodynamic models.
Parallel Rayleigh Quotient Optimization with FSAI-Based Preconditioning
Directory of Open Access Journals (Sweden)
Luca Bergamaschi
2012-01-01
Full Text Available The present paper describes a parallel preconditioned algorithm for the solution of partial eigenvalue problems for large sparse symmetric matrices, on parallel computers. Namely, we consider the Deflation-Accelerated Conjugate Gradient (DACG algorithm accelerated by factorized-sparse-approximate-inverse- (FSAI- type preconditioners. We present an enhanced parallel implementation of the FSAI preconditioner and make use of the recently developed Block FSAI-IC preconditioner, which combines the FSAI and the Block Jacobi-IC preconditioners. Results onto matrices of large size arising from finite element discretization of geomechanical models reveal that DACG accelerated by these type of preconditioners is competitive with respect to the available public parallel hypre package, especially in the computation of a few of the leftmost eigenpairs. The parallel DACG code accelerated by FSAI is written in MPI-Fortran 90 language and exhibits good scalability up to one thousand processors.
Parallelized seeded region growing using CUDA.
Park, Seongjin; Lee, Jeongjin; Lee, Hyunna; Shin, Juneseuk; Seo, Jinwook; Lee, Kyoung Ho; Shin, Yeong-Gil; Kim, Bohyoung
2014-01-01
This paper presents a novel method for parallelizing the seeded region growing (SRG) algorithm using Compute Unified Device Architecture (CUDA) technology, with intention to overcome the theoretical weakness of SRG algorithm of its computation time being directly proportional to the size of a segmented region. The segmentation performance of the proposed CUDA-based SRG is compared with SRG implementations on single-core CPUs, quad-core CPUs, and shader language programming, using synthetic datasets and 20 body CT scans. Based on the experimental results, the CUDA-based SRG outperforms the other three implementations, advocating that it can substantially assist the segmentation during massive CT screening tests.
A temperature predictor for parallel tempering simulations.
Patriksson, Alexandra; van der Spoel, David
2008-04-21
An algorithm is proposed that generates a set of temperatures for use in parallel tempering simulations (also known as temperature-replica exchange molecular dynamics simulations) of proteins to obtain a desired exchange probability Pdes. The input consists of the number of protein atoms and water molecules in the system, information about the use of constraints and virtual sites and the lower temperature limits. The temperatures generated yield probabilities which are very close to Pdes (correlation 97%), independent of force field and over a wide temperature range. To facilitate its use, the algorithm has been implemented as a web server at .
IMPLEMENTATION OF SERIAL AND PARALLEL BUBBLE SORT ON FPGA
Directory of Open Access Journals (Sweden)
Dwi Marhaendro Jati Purnomo
2016-06-01
Full Text Available Sorting is common process in computational world. Its utilization are on many fields from research to industry. There are many sorting algorithm in nowadays. One of the simplest yet powerful is bubble sort. In this study, bubble sort is implemented on FPGA. The implementation was taken on serial and parallel approach. Serial and parallel bubble sort then compared by means of its memory, execution time, and utility which comprises slices and LUTs. The experiments show that serial bubble sort required smaller memory as well as utility compared to parallel bubble sort. Meanwhile, parallel bubble sort performed faster than serial bubble sort
The implementation of bit-parallelism for DNA sequence alignment
Setyorini; Kuspriyanto; Widyantoro, D. H.; Pancoro, A.
2017-05-01
Dynamic Programming (DP) remain the central algorithm of biological sequence alignment. Matching score computation is the most time-consuming process. Bit-parallelism is one of approximate string matching techniques that transform DP matrix cell unit processing into word unit (groups of cell). Bit-parallelism computate the scores column-wise. Adopting from word processing in computer system work, this technique promise reducing time in score computing process in DP matrix. In this paper, we implement bit-parallelism technique for DNA sequence alignment. Our bit-parallelism implementation have less time for score computational process but still need improvement for there construction process.
Capacity Bounds for Parallel Optical Wireless Channels
Chaaban, Anas
2016-01-01
A system consisting of parallel optical wireless channels with a total average intensity constraint is studied. Capacity upper and lower bounds for this system are derived. Under perfect channel-state information at the transmitter (CSIT), the bounds have to be optimized with respect to the power allocation over the parallel channels. The optimization of the lower bound is non-convex, however, the KKT conditions can be used to find a list of possible solutions one of which is optimal. The optimal solution can then be found by an exhaustive search algorithm, which is computationally expensive. To overcome this, we propose low-complexity power allocation algorithms which are nearly optimal. The optimized capacity lower bound nearly coincides with the capacity at high SNR. Without CSIT, our capacity bounds lead to upper and lower bounds on the outage probability. The outage probability bounds meet at high SNR. The system with average and peak intensity constraints is also discussed.
Parallel Jacobi EVD Methods on Integrated Circuits
Directory of Open Access Journals (Sweden)
Chi-Chia Sun
2014-01-01
Full Text Available Design strategies for parallel iterative algorithms are presented. In order to further study different tradeoff strategies in design criteria for integrated circuits, A 10 × 10 Jacobi Brent-Luk-EVD array with the simplified μ-CORDIC processor is used as an example. The experimental results show that using the μ-CORDIC processor is beneficial for the design criteria as it yields a smaller area, faster overall computation time, and less energy consumption than the regular CORDIC processor. It is worth to notice that the proposed parallel EVD method can be applied to real-time and low-power array signal processing algorithms performing beamforming or DOA estimation.
Hydrologic Terrain Processing Using Parallel Computing
Tarboton, D. G.; Watson, D. W.; Wallace, R. M.; Schreuders, K.; Tesfa, T. K.
2009-12-01
Topography in the form of Digital Elevation Models (DEMs), is widely used to derive information for the modeling of hydrologic processes. Hydrologic terrain analysis augments the information content of digital elevation data by removing spurious pits, deriving a structured flow field, and calculating surfaces of hydrologic information derived from the flow field. The increasing availability of high-resolution terrain datasets for large areas poses a challenge for existing algorithms that process terrain data to extract this hydrologic information. This paper will describe parallel algorithms that have been developed to enhance hydrologic terrain pre-processing so that larger datasets can be more efficiently computed. Message Passing Interface (MPI) parallel implementations have been developed for pit removal, flow direction, and generalized flow accumulation methods within the Terrain Analysis Using Digital Elevation Models (TauDEM) package. The parallel algorithm works by decomposing the domain into striped or tiled data partitions where each tile is processed by a separate processor. This method also reduces the memory requirements of each processor so that larger size grids can be processed. The parallel pit removal algorithm is adapted from the method of Planchon and Darboux that starts from a high elevation then progressively scans the grid, lowering each grid cell to the maximum of the original elevation or the lowest neighbor. The MPI implementation reconciles elevations along process domain edges after each scan. Generalized flow accumulation extends flow accumulation approaches commonly available in GIS through the integration of multiple inputs and a broad class of algebraic rules into the calculation of flow related quantities. It is based on establishing a flow field through DEM grid cells, that is then used to evaluate any mathematical function that incorporates dependence on values of the quantity being evaluated at upslope (or downslope) grid cells
Energy Technology Data Exchange (ETDEWEB)
Foster, I.; Tuecke, S.
1991-09-01
PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, a set of tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory at info.mcs.anl.gov.
Energy Technology Data Exchange (ETDEWEB)
Foster, I.; Tuecke, S.
1991-12-01
PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. In includes both tutorial and reference material. It also presents the basic concepts that underly PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory in the directory pub/pcn at info.mcs.anl.gov (c.f. Appendix A).
Muhammad, Naeem; Boucké, Nelis; Berbers, Yolande
2010-01-01
The use of parallelism enhances the performance of a software system. However, its excessive use can degrade the system performance. In this report we propose a parallelism viewpoint to optimize the use of parallelism by eliminating unnecessarily used parallelism in legacy systems. The parallelism viewpoint describes parallelism of the system in order to analyze multiple overheads associated with its threads. We use the proposed viewpoint to find parallelism specific performance overheads of ...
Scalable parallel communications
Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.
1992-01-01
Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth