Massively parallel Fokker-Planck code ALLAp
International Nuclear Information System (INIS)
Batishcheva, A.A.; Krasheninnikov, S.I.; Craddock, G.G.; Djordjevic, V.
1996-01-01
The recently developed for workstations Fokker-Planck code ALLA simulates the temporal evolution of 1V, 2V and 1D2V collisional edge plasmas. In this work we present the results of code parallelization on the CRI T3D massively parallel platform (ALLAp version). Simultaneously we benchmark the 1D2V parallel vesion against an analytic self-similar solution of the collisional kinetic equation. This test is not trivial as it demands a very strong spatial temperature and density variation within the simulation domain. (orig.)
Biferale, L.; Mantovani, F.; Pivanti, M.; Pozzati, F.; Sbragaglia, M.; Schifano, S.F.; Toschi, F.; Tripiccione, R.
2011-01-01
We develop a Lattice Boltzmann code for computational fluid-dynamics and optimize it for massively parallel systems based on multi-core processors. Our code describes 2D multi-phase compressible flows. We analyze the performance bottlenecks that we find as we gradually expose a larger fraction of
The language parallel Pascal and other aspects of the massively parallel processor
Reeves, A. P.; Bruner, J. D.
1982-01-01
A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.
CHOLLA: A NEW MASSIVELY PARALLEL HYDRODYNAMICS CODE FOR ASTROPHYSICAL SIMULATION
International Nuclear Information System (INIS)
Schneider, Evan E.; Robertson, Brant E.
2015-01-01
We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳256 3 ) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density
CHOLLA: A NEW MASSIVELY PARALLEL HYDRODYNAMICS CODE FOR ASTROPHYSICAL SIMULATION
Energy Technology Data Exchange (ETDEWEB)
Schneider, Evan E.; Robertson, Brant E. [Steward Observatory, University of Arizona, 933 North Cherry Avenue, Tucson, AZ 85721 (United States)
2015-04-15
We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳256{sup 3}) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density.
User's Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code
International Nuclear Information System (INIS)
Earth Sciences Division; Zhang, Keni; Zhang, Keni; Wu, Yu-Shu; Pruess, Karsten
2008-01-01
TOUGH2-MP is a massively parallel (MP) version of the TOUGH2 code, designed for computationally efficient parallel simulation of isothermal and nonisothermal flows of multicomponent, multiphase fluids in one, two, and three-dimensional porous and fractured media. In recent years, computational requirements have become increasingly intensive in large or highly nonlinear problems for applications in areas such as radioactive waste disposal, CO2 geological sequestration, environmental assessment and remediation, reservoir engineering, and groundwater hydrology. The primary objective of developing the parallel-simulation capability is to significantly improve the computational performance of the TOUGH2 family of codes. The particular goal for the parallel simulator is to achieve orders-of-magnitude improvement in computational time for models with ever-increasing complexity. TOUGH2-MP is designed to perform parallel simulation on multi-CPU computational platforms. An earlier version of TOUGH2-MP (V1.0) was based on the TOUGH2 Version 1.4 with EOS3, EOS9, and T2R3D modules, a software previously qualified for applications in the Yucca Mountain project, and was designed for execution on CRAY T3E and IBM SP supercomputers. The current version of TOUGH2-MP (V2.0) includes all fluid property modules of the standard version TOUGH2 V2.0. It provides computationally efficient capabilities using supercomputers, Linux clusters, or multi-core PCs, and also offers many user-friendly features. The parallel simulator inherits all process capabilities from V2.0 together with additional capabilities for handling fractured media from V1.4. This report provides a quick starting guide on how to set up and run the TOUGH2-MP program for users with a basic knowledge of running the (standard) version TOUGH2 code. The report also gives a brief technical description of the code, including a discussion of parallel methodology, code structure, as well as mathematical and numerical methods used
International Nuclear Information System (INIS)
Idomura, Yasuhiro; Jolliet, Sebastien
2010-01-01
A gyrokinetic toroidal five dimensional Eulerian code GT5D is ported on six advanced massively parallel platforms and comprehensive benchmark tests are performed. A parallelisation technique based on physical properties of the gyrokinetic equation is presented. By extending the parallelisation technique with a hybrid parallel model, the scalability of the code is improved on platforms with multi-core processors. In the benchmark tests, a good salability is confirmed up to several thousands cores on every platforms, and the maximum sustained performance of ∼18.6 Tflops is achieved using 16384 cores of BX900. (author)
First massively parallel algorithm to be implemented in Apollo-II code
International Nuclear Information System (INIS)
Stankovski, Z.
1994-01-01
The collision probability (CP) method in neutron transport, as applied to arbitrary 2D XY geometries, like the TDT module in APOLLO-II, is very time consuming. Consequently RZ or 3D extensions became prohibitive. Fortunately, this method is very suitable for parallelization. Massively parallel computer architectures, especially MIMD machines, bring a new breath to this method. In this paper we present a CM5 implementation of the CP method. Parallelization is applied to the energy groups, using the CMMD message passing library. In our case we use 32 processors for the standard 99-group APOLLIB-II library. The real advantage of this algorithm will appear in the calculation of the future fine multigroup library (about 8000 groups) of the SAPHYR project with a massively parallel computer (to the order of hundreds of processors). (author). 3 tabs., 4 figs., 4 refs
First massively parallel algorithm to be implemented in APOLLO-II code
International Nuclear Information System (INIS)
Stankovski, Z.
1994-01-01
The collision probability method in neutron transport, as applied to arbitrary 2-dimensional geometries, like the two dimensional transport module in APOLLO-II is very time consuming. Consequently 3-dimensional extension became prohibitive. Fortunately, this method is very suitable for parallelization. Massively parallel computer architectures, especially MIMD machines, bring a new breath to this method. In this paper we present a CM5 implementation of the collision probability method. Parallelization is applied to the energy groups, using the CMMD massage passing library. In our case we used 32 processors for the standard 99-group APOLLIB-II library. The real advantage of this algorithm will appear in the calculation of the future multigroup library (about 8000 groups) of the SAPHYR project with a massively parallel computer (to the order of hundreds of processors). (author). 4 refs., 4 figs., 3 tabs
Computational fluid dynamics on a massively parallel computer
Jespersen, Dennis C.; Levit, Creon
1989-01-01
A finite difference code was implemented for the compressible Navier-Stokes equations on the Connection Machine, a massively parallel computer. The code is based on the ARC2D/ARC3D program and uses the implicit factored algorithm of Beam and Warming. The codes uses odd-even elimination to solve linear systems. Timings and computation rates are given for the code, and a comparison is made with a Cray XMP.
A Massively Parallel Code for Polarization Calculations
Akiyama, Shizuka; Höflich, Peter
2001-03-01
We present an implementation of our Monte-Carlo radiation transport method for rapidly expanding, NLTE atmospheres for massively parallel computers which utilizes both the distributed and shared memory models. This allows us to take full advantage of the fast communication and low latency inherent to nodes with multiple CPUs, and to stretch the limits of scalability with the number of nodes compared to a version which is based on the shared memory model. Test calculations on a local 20-node Beowulf cluster with dual CPUs showed an improved scalability by about 40%.
Parallel computing by Monte Carlo codes MVP/GMVP
International Nuclear Information System (INIS)
Nagaya, Yasunobu; Nakagawa, Masayuki; Mori, Takamasa
2001-01-01
General-purpose Monte Carlo codes MVP/GMVP are well-vectorized and thus enable us to perform high-speed Monte Carlo calculations. In order to achieve more speedups, we parallelized the codes on the different types of parallel computing platforms or by using a standard parallelization library MPI. The platforms used for benchmark calculations are a distributed-memory vector-parallel computer Fujitsu VPP500, a distributed-memory massively parallel computer Intel paragon and a distributed-memory scalar-parallel computer Hitachi SR2201, IBM SP2. As mentioned generally, linear speedup could be obtained for large-scale problems but parallelization efficiency decreased as the batch size per a processing element(PE) was smaller. It was also found that the statistical uncertainty for assembly powers was less than 0.1% by the PWR full-core calculation with more than 10 million histories and it took about 1.5 hours by massively parallel computing. (author)
Parallelization of quantum molecular dynamics simulation code
International Nuclear Information System (INIS)
Kato, Kaori; Kunugi, Tomoaki; Shibahara, Masahiko; Kotake, Susumu
1998-02-01
A quantum molecular dynamics simulation code has been developed for the analysis of the thermalization of photon energies in the molecule or materials in Kansai Research Establishment. The simulation code is parallelized for both Scalar massively parallel computer (Intel Paragon XP/S75) and Vector parallel computer (Fujitsu VPP300/12). Scalable speed-up has been obtained with a distribution to processor units by division of particle group in both parallel computers. As a result of distribution to processor units not only by particle group but also by the particles calculation that is constructed with fine calculations, highly parallelization performance is achieved in Intel Paragon XP/S75. (author)
International Nuclear Information System (INIS)
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; Davidson, Gregory G.; Hamilton, Steven P.; Godfrey, Andrew T.
2015-01-01
This paper discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package developed and maintained at Oak Ridge National Laboratory. It has been developed to scale well from laptop to small computing clusters to advanced supercomputers. Special features of Shift include hybrid capabilities for variance reduction such as CADIS and FW-CADIS, and advanced parallel decomposition and tally methods optimized for scalability on supercomputing architectures. Shift has been validated and verified against various reactor physics benchmarks and compares well to other state-of-the-art Monte Carlo radiation transport codes such as MCNP5, CE KENO-VI, and OpenMC. Some specific benchmarks used for verification and validation include the CASL VERA criticality test suite and several Westinghouse AP1000 ® problems. These benchmark and scaling studies show promising results
Massively parallel multicanonical simulations
Gross, Jonathan; Zierenberg, Johannes; Weigel, Martin; Janke, Wolfhard
2018-03-01
Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged free-energy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 104 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field.
Engineering-Based Thermal CFD Simulations on Massive Parallel Systems
Frisch, Jérôme
2015-05-22
The development of parallel Computational Fluid Dynamics (CFD) codes is a challenging task that entails efficient parallelization concepts and strategies in order to achieve good scalability values when running those codes on modern supercomputers with several thousands to millions of cores. In this paper, we present a hierarchical data structure for massive parallel computations that supports the coupling of a Navier–Stokes-based fluid flow code with the Boussinesq approximation in order to address complex thermal scenarios for energy-related assessments. The newly designed data structure is specifically designed with the idea of interactive data exploration and visualization during runtime of the simulation code; a major shortcoming of traditional high-performance computing (HPC) simulation codes. We further show and discuss speed-up values obtained on one of Germany’s top-ranked supercomputers with up to 140,000 processes and present simulation results for different engineering-based thermal problems.
International Nuclear Information System (INIS)
Stankovski, Z.
1995-01-01
The collision probability method in neutron transport, as applied to 2D geometries, consume a great amount of computer time, for a typical 2D assembly calculation evaluations. Consequently RZ or 3D calculations became prohibitive. In this paper we present a simple but efficient parallel algorithm based on the message passing host/node programing model. Parallelization was applied to the energy group treatment. Such approach permits parallelization of the existing code, requiring only limited modifications. Sequential/parallel computer portability is preserved, witch is a necessary condition for a industrial code. Sequential performances are also preserved. The algorithm is implemented on a CRAY 90 coupled to a 128 processor T3D computer, a 16 processor IBM SP1 and a network of workstations, using the Public Domain PVM library. The tests were executed for a 2D geometry with the standard 99-group library. All results were very satisfactory, the best ones with IBM SP1. Because of heterogeneity of the workstation network, we did ask high performances for this architecture. The same source code was used for all computers. A more impressive advantage of this algorithm will appear in the calculations of the SAPHYR project (with the future fine multigroup library of about 8000 groups) with a massively parallel computer, using several hundreds of processors. (author). 5 refs., 6 figs., 2 tabs
International Nuclear Information System (INIS)
Stankovski, Z.
1995-01-01
The collision probability method in neutron transport, as applied to 2D geometries, consume a great amount of computer time, for a typical 2D assembly calculation about 90% of the computing time is consumed in the collision probability evaluations. Consequently RZ or 3D calculations became prohibitive. In this paper the author presents a simple but efficient parallel algorithm based on the message passing host/node programmation model. Parallelization was applied to the energy group treatment. Such approach permits parallelization of the existing code, requiring only limited modifications. Sequential/parallel computer portability is preserved, which is a necessary condition for a industrial code. Sequential performances are also preserved. The algorithm is implemented on a CRAY 90 coupled to a 128 processor T3D computer, a 16 processor IBM SPI and a network of workstations, using the Public Domain PVM library. The tests were executed for a 2D geometry with the standard 99-group library. All results were very satisfactory, the best ones with IBM SPI. Because of heterogeneity of the workstation network, the author did not ask high performances for this architecture. The same source code was used for all computers. A more impressive advantage of this algorithm will appear in the calculations of the SAPHYR project (with the future fine multigroup library of about 8000 groups) with a massively parallel computer, using several hundreds of processors
Energy Technology Data Exchange (ETDEWEB)
Gentzsch, W.; Ferstl, F.; Paap, H.G.; Riedel, E.
1998-03-20
In the ARTS project, system software has been developed to support smog and fluid dynamic applications on massively parallel systems. The aim is to implement and test specific software structures within an adaptive run-time system to separate the parallel core algorithms of the applications from the platform independent runtime aspects. Only slight modifications is existing Fortran and C code are necessary to integrate the application code into the new object oriented parallel integrated ARTS framework. The OO-design offers easy control, re-use and adaptation of the system services, resulting in a dramatic decrease in development time of the application and in ease of maintainability of the application software in the future. (orig.) [Deutsch] Im Projekt ARTS wird Basissoftware zur Unterstuetzung von Anwendungen aus den Bereichen Smoganalyse und Stroemungsmechanik auf massiv parallelen Systemen entwickelt und optimiert. Im Vordergrund steht die Erprobung geeigneter Strukturen, um systemnahe Funktionalitaeten in einer Laufzeitumgebung anzusiedeln und dadurch die parallelen Kernalgorithmen der Anwendungsprogramme von den plattformunabhaengigen Laufzeitaspekten zu trennen. Es handelt sich dabei um herkoemmlich strukturierten Fortran-Code, der unter minimalen Aenderungen auch weiterhin nutzbar sein muss, sowie um objektbasiert entworfenen C-Code, der die volle Funktionalitaet der ARTS-Plattform ausnutzen kann. Ein objektorientiertes Design erlaubt eine einfache Kontrolle, Wiederverwendung und Adaption der vom System vorgegebenen Basisdienste. Daraus resultiert ein deutlich reduzierter Entwicklungs- und Laufzeitaufwand fuer die Anwendung. ARTS schafft eine integrierende Plattform, die moderne Technologien aus dem Bereich objektorientierter Laufzeitsysteme mit praxisrelevanten Anforderungen aus dem Bereich des wissenschaftlichen Hoechstleistungsrechnens kombiniert. (orig.)
Development of parallel Fokker-Planck code ALLAp
International Nuclear Information System (INIS)
Batishcheva, A.A.; Sigmar, D.J.; Koniges, A.E.
1996-01-01
We report on our ongoing development of the 3D Fokker-Planck code ALLA for a highly collisional scrape-off-layer (SOL) plasma. A SOL with strong gradients of density and temperature in the spatial dimension is modeled. Our method is based on a 3-D adaptive grid (in space, magnitude of the velocity, and cosine of the pitch angle) and a second order conservative scheme. Note that the grid size is typically 100 x 257 x 65 nodes. It was shown in our previous work that only these capabilities make it possible to benchmark a 3D code against a spatially-dependent self-similar solution of a kinetic equation with the Landau collision term. In the present work we show results of a more precise benchmarking against the exact solutions of the kinetic equation using a new parallel code ALLAp with an improved method of parallelization and a modified boundary condition at the plasma edge. We also report first results from the code parallelization using Message Passing Interface for a Massively Parallel CRI T3D platform. We evaluate the ALLAp code performance versus the number of T3D processors used and compare its efficiency against a Work/Data Sharing parallelization scheme and a workstation version
Directory of Open Access Journals (Sweden)
Matthew O'keefe
1995-01-01
Full Text Available Massively parallel processors (MPPs hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. We have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.
Massively parallel computation of PARASOL code on the Origin 3800 system
International Nuclear Information System (INIS)
Hosokawa, Masanari; Takizuka, Tomonori
2001-10-01
The divertor particle simulation code named PARASOL simulates open-field plasmas between divertor walls self-consistently by using an electrostatic PIC method and a binary collision Monte Carlo model. The PARASOL parallelized with MPI-1.1 for scalar parallel computer worked on Intel Paragon XP/S system. A system SGI Origin 3800 was newly installed (May, 2001). The parallel programming was improved at this switchover. As a result of the high-performance new hardware and this improvement, the PARASOL is speeded up by about 60 times with the same number of processors. (author)
Massively Parallel and Scalable Implicit Time Integration Algorithms for Structural Dynamics
Farhat, Charbel
1997-01-01
Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because of the following additional facts: (a) explicit schemes are easier to parallelize than implicit ones, and (b) explicit schemes induce short range interprocessor communications that are relatively inexpensive, while the factorization methods used in most implicit schemes induce long range interprocessor communications that often ruin the sought-after speed-up. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet be offset by the speed of the currently available parallel hardware. Therefore, it is essential to develop efficient alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating the low-frequency dynamics of aerospace structures.
Massively Parallel Computing: A Sandia Perspective
Energy Technology Data Exchange (ETDEWEB)
Dosanjh, Sudip S.; Greenberg, David S.; Hendrickson, Bruce; Heroux, Michael A.; Plimpton, Steve J.; Tomkins, James L.; Womble, David E.
1999-05-06
The computing power available to scientists and engineers has increased dramatically in the past decade, due in part to progress in making massively parallel computing practical and available. The expectation for these machines has been great. The reality is that progress has been slower than expected. Nevertheless, massively parallel computing is beginning to realize its potential for enabling significant break-throughs in science and engineering. This paper provides a perspective on the state of the field, colored by the authors' experiences using large scale parallel machines at Sandia National Laboratories. We address trends in hardware, system software and algorithms, and we also offer our view of the forces shaping the parallel computing industry.
Sylwestrzak, Marcin; Szlag, Daniel; Marchand, Paul J.; Kumar, Ashwin S.; Lasser, Theo
2017-08-01
We present an application of massively parallel processing of quantitative flow measurements data acquired using spectral optical coherence microscopy (SOCM). The need for massive signal processing of these particular datasets has been a major hurdle for many applications based on SOCM. In view of this difficulty, we implemented and adapted quantitative total flow estimation algorithms on graphics processing units (GPU) and achieved a 150 fold reduction in processing time when compared to a former CPU implementation. As SOCM constitutes the microscopy counterpart to spectral optical coherence tomography (SOCT), the developed processing procedure can be applied to both imaging modalities. We present the developed DLL library integrated in MATLAB (with an example) and have included the source code for adaptations and future improvements. Catalogue identifier: AFBT_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFBT_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPLv3 No. of lines in distributed program, including test data, etc.: 913552 No. of bytes in distributed program, including test data, etc.: 270876249 Distribution format: tar.gz Programming language: CUDA/C, MATLAB. Computer: Intel x64 CPU, GPU supporting CUDA technology. Operating system: 64-bit Windows 7 Professional. Has the code been vectorized or parallelized?: Yes, CPU code has been vectorized in MATLAB, CUDA code has been parallelized. RAM: Dependent on users parameters, typically between several gigabytes and several tens of gigabytes Classification: 6.5, 18. Nature of problem: Speed up of data processing in optical coherence microscopy Solution method: Utilization of GPU for massively parallel data processing Additional comments: Compiled DLL library with source code and documentation, example of utilization (MATLAB script with raw data) Running time: 1,8 s for one B-scan (150 × faster in comparison to the CPU
Massively Parallel Algorithms for Solution of Schrodinger Equation
Fijany, Amir; Barhen, Jacob; Toomerian, Nikzad
1994-01-01
In this paper massively parallel algorithms for solution of Schrodinger equation are developed. Our results clearly indicate that the Crank-Nicolson method, in addition to its excellent numerical properties, is also highly suitable for massively parallel computation.
Massively parallel mathematical sieves
Energy Technology Data Exchange (ETDEWEB)
Montry, G.R.
1989-01-01
The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.
Adapting algorithms to massively parallel hardware
Sioulas, Panagiotis
2016-01-01
In the recent years, the trend in computing has shifted from delivering processors with faster clock speeds to increasing the number of cores per processor. This marks a paradigm shift towards parallel programming in which applications are programmed to exploit the power provided by multi-cores. Usually there is gain in terms of the time-to-solution and the memory footprint. Specifically, this trend has sparked an interest towards massively parallel systems that can provide a large number of processors, and possibly computing nodes, as in the GPUs and MPPAs (Massively Parallel Processor Arrays). In this project, the focus was on two distinct computing problems: k-d tree searches and track seeding cellular automata. The goal was to adapt the algorithms to parallel systems and evaluate their performance in different cases.
A Massively Parallel Face Recognition System
Directory of Open Access Journals (Sweden)
Lahdenoja Olli
2007-01-01
Full Text Available We present methods for processing the LBPs (local binary patterns with a massively parallel hardware, especially with CNN-UM (cellular nonlinear network-universal machine. In particular, we present a framework for implementing a massively parallel face recognition system, including a dedicated highly accurate algorithm suitable for various types of platforms (e.g., CNN-UM and digital FPGA. We study in detail a dedicated mixed-mode implementation of the algorithm and estimate its implementation cost in the view of its performance and accuracy restrictions.
A Massively Parallel Face Recognition System
Directory of Open Access Journals (Sweden)
Ari Paasio
2006-12-01
Full Text Available We present methods for processing the LBPs (local binary patterns with a massively parallel hardware, especially with CNN-UM (cellular nonlinear network-universal machine. In particular, we present a framework for implementing a massively parallel face recognition system, including a dedicated highly accurate algorithm suitable for various types of platforms (e.g., CNN-UM and digital FPGA. We study in detail a dedicated mixed-mode implementation of the algorithm and estimate its implementation cost in the view of its performance and accuracy restrictions.
International Nuclear Information System (INIS)
Soltz, R; Vranas, P; Blumrich, M; Chen, D; Gara, A; Giampap, M; Heidelberger, P; Salapura, V; Sexton, J; Bhanot, G
2007-01-01
The theory of the strong nuclear force, Quantum Chromodynamics (QCD), can be numerically simulated from first principles on massively-parallel supercomputers using the method of Lattice Gauge Theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures that it suggests. We demonstrate these methods on the BlueGene massively-parallel supercomputer and argue that LQCD and the BlueGene architecture are a natural match. This can be traced to the simple fact that LQCD is a regular lattice discretization of space into lattice sites while the BlueGene supercomputer is a discretization of space into compute nodes, and that both are constrained by requirements of locality. This simple relation is both technologically important and theoretically intriguing. The main result of this paper is the speedup of LQCD using up to 131,072 CPUs on the largest BlueGene/L supercomputer. The speedup is perfect with sustained performance of about 20% of peak. This corresponds to a maximum of 70.5 sustained TFlop/s. At these speeds LQCD and BlueGene are poised to produce the next generation of strong interaction physics theoretical results
New adaptive differencing strategy in the PENTRAN 3-d parallel Sn code
International Nuclear Information System (INIS)
Sjoden, G.E.; Haghighat, A.
1996-01-01
It is known that three-dimensional (3-D) discrete ordinates (S n ) transport problems require an immense amount of storage and computational effort to solve. For this reason, parallel codes that offer a capability to completely decompose the angular, energy, and spatial domains among a distributed network of processors are required. One such code recently developed is PENTRAN, which iteratively solves 3-D multi-group, anisotropic S n problems on distributed-memory platforms, such as the IBM-SP2. Because large problems typically contain several different material zones with various properties, available differencing schemes should automatically adapt to the transport physics in each material zone. To minimize the memory and message-passing overhead required for massively parallel S n applications, available differencing schemes in an adaptive strategy should also offer reasonable accuracy and positivity, yet require only the zeroth spatial moment of the transport equation; differencing schemes based on higher spatial moments, in spite of their greater accuracy, require at least twice the amount of storage and communication cost for implementation in a massively parallel transport code. This paper discusses a new adaptive differencing strategy that uses increasingly accurate schemes with low parallel memory and communication overhead. This strategy, implemented in PENTRAN, includes a new scheme, exponential directional averaged (EDA) differencing
Resolution of the neutron transport equation by massively parallel computer in the Cronos code
International Nuclear Information System (INIS)
Zardini, D.M.
1996-01-01
The feasibility of neutron transport problems parallel resolution by CRONOS code's SN module is here studied. In this report we give the first data about the parallel resolution by angular variable decomposition of the transport equation. Problems about parallel resolution by spatial variable decomposition and memory stage limits are also explained here. (author)
Programming massively parallel processors a hands-on approach
Kirk, David B
2010-01-01
Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs. Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA. The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who ...
Neural Parallel Engine: A toolbox for massively parallel neural signal processing.
Tam, Wing-Kin; Yang, Zhi
2018-05-01
Large-scale neural recordings provide detailed information on neuronal activities and can help elicit the underlying neural mechanisms of the brain. However, the computational burden is also formidable when we try to process the huge data stream generated by such recordings. In this study, we report the development of Neural Parallel Engine (NPE), a toolbox for massively parallel neural signal processing on graphical processing units (GPUs). It offers a selection of the most commonly used routines in neural signal processing such as spike detection and spike sorting, including advanced algorithms such as exponential-component-power-component (EC-PC) spike detection and binary pursuit spike sorting. We also propose a new method for detecting peaks in parallel through a parallel compact operation. Our toolbox is able to offer a 5× to 110× speedup compared with its CPU counterparts depending on the algorithms. A user-friendly MATLAB interface is provided to allow easy integration of the toolbox into existing workflows. Previous efforts on GPU neural signal processing only focus on a few rudimentary algorithms, are not well-optimized and often do not provide a user-friendly programming interface to fit into existing workflows. There is a strong need for a comprehensive toolbox for massively parallel neural signal processing. A new toolbox for massively parallel neural signal processing has been created. It can offer significant speedup in processing signals from large-scale recordings up to thousands of channels. Copyright © 2018 Elsevier B.V. All rights reserved.
OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)
1998-01-01
This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).
Load-balancing techniques for a parallel electromagnetic particle-in-cell code
Energy Technology Data Exchange (ETDEWEB)
PLIMPTON,STEVEN J.; SEIDEL,DAVID B.; PASIK,MICHAEL F.; COATS,REBECCA S.
2000-01-01
QUICKSILVER is a 3-d electromagnetic particle-in-cell simulation code developed and used at Sandia to model relativistic charged particle transport. It models the time-response of electromagnetic fields and low-density-plasmas in a self-consistent manner: the fields push the plasma particles and the plasma current modifies the fields. Through an LDRD project a new parallel version of QUICKSILVER was created to enable large-scale plasma simulations to be run on massively-parallel distributed-memory supercomputers with thousands of processors, such as the Intel Tflops and DEC CPlant machines at Sandia. The new parallel code implements nearly all the features of the original serial QUICKSILVER and can be run on any platform which supports the message-passing interface (MPI) standard as well as on single-processor workstations. This report describes basic strategies useful for parallelizing and load-balancing particle-in-cell codes, outlines the parallel algorithms used in this implementation, and provides a summary of the modifications made to QUICKSILVER. It also highlights a series of benchmark simulations which have been run with the new code that illustrate its performance and parallel efficiency. These calculations have up to a billion grid cells and particles and were run on thousands of processors. This report also serves as a user manual for people wishing to run parallel QUICKSILVER.
Load-balancing techniques for a parallel electromagnetic particle-in-cell code
International Nuclear Information System (INIS)
Plimpton, Steven J.; Seidel, David B.; Pasik, Michael F.; Coats, Rebecca S.
2000-01-01
QUICKSILVER is a 3-d electromagnetic particle-in-cell simulation code developed and used at Sandia to model relativistic charged particle transport. It models the time-response of electromagnetic fields and low-density-plasmas in a self-consistent manner: the fields push the plasma particles and the plasma current modifies the fields. Through an LDRD project a new parallel version of QUICKSILVER was created to enable large-scale plasma simulations to be run on massively-parallel distributed-memory supercomputers with thousands of processors, such as the Intel Tflops and DEC CPlant machines at Sandia. The new parallel code implements nearly all the features of the original serial QUICKSILVER and can be run on any platform which supports the message-passing interface (MPI) standard as well as on single-processor workstations. This report describes basic strategies useful for parallelizing and load-balancing particle-in-cell codes, outlines the parallel algorithms used in this implementation, and provides a summary of the modifications made to QUICKSILVER. It also highlights a series of benchmark simulations which have been run with the new code that illustrate its performance and parallel efficiency. These calculations have up to a billion grid cells and particles and were run on thousands of processors. This report also serves as a user manual for people wishing to run parallel QUICKSILVER
RAMA: A file system for massively parallel computers
Miller, Ethan L.; Katz, Randy H.
1993-01-01
This paper describes a file system design for massively parallel computers which makes very efficient use of a few disks per processor. This overcomes the traditional I/O bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA, requires little inter-node synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated in lo the file system; in fact, RAMA runs most efficiently when tertiary storage is used.
Sandalski, Stou
Smooth particle hydrodynamics is an efficient method for modeling the dynamics of fluids. It is commonly used to simulate astrophysical processes such as binary mergers. We present a newly developed GPU accelerated smooth particle hydrodynamics code for astrophysical simulations. The code is named neptune after the Roman god of water. It is written in OpenMP parallelized C++ and OpenCL and includes octree based hydrodynamic and gravitational acceleration. The design relies on object-oriented methodologies in order to provide a flexible and modular framework that can be easily extended and modified by the user. Several pre-built scenarios for simulating collisions of polytropes and black-hole accretion are provided. The code is released under the MIT Open Source license and publicly available at http://code.google.com/p/neptune-sph/.
GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES
Energy Technology Data Exchange (ETDEWEB)
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal [Department of Astronomy, University of Arizona, 933 N. Cherry Ave., Tucson, AZ 85721 (United States)
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
Visualizing Network Traffic to Understand the Performance of Massively Parallel Simulations
Landge, A. G.
2012-12-01
The performance of massively parallel applications is often heavily impacted by the cost of communication among compute nodes. However, determining how to best use the network is a formidable task, made challenging by the ever increasing size and complexity of modern supercomputers. This paper applies visualization techniques to aid parallel application developers in understanding the network activity by enabling a detailed exploration of the flow of packets through the hardware interconnect. In order to visualize this large and complex data, we employ two linked views of the hardware network. The first is a 2D view, that represents the network structure as one of several simplified planar projections. This view is designed to allow a user to easily identify trends and patterns in the network traffic. The second is a 3D view that augments the 2D view by preserving the physical network topology and providing a context that is familiar to the application developers. Using the massively parallel multi-physics code pF3D as a case study, we demonstrate that our tool provides valuable insight that we use to explain and optimize pF3D-s performance on an IBM Blue Gene/P system. © 1995-2012 IEEE.
Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A L
2012-06-13
Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A.; Oliveira, Micael J. T.; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G.; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A. L.
2012-06-01
Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
International Nuclear Information System (INIS)
Andrade, Xavier; Aspuru-Guzik, Alán; Alberdi-Rodriguez, Joseba; Rubio, Angel; Strubbe, David A; Louie, Steven G; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Marques, Miguel A L
2012-01-01
Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures. (topical review)
Frontiers of massively parallel scientific computation
International Nuclear Information System (INIS)
Fischer, J.R.
1987-07-01
Practical applications using massively parallel computer hardware first appeared during the 1980s. Their development was motivated by the need for computing power orders of magnitude beyond that available today for tasks such as numerical simulation of complex physical and biological processes, generation of interactive visual displays, satellite image analysis, and knowledge based systems. Representative of the first generation of this new class of computers is the Massively Parallel Processor (MPP). A team of scientists was provided the opportunity to test and implement their algorithms on the MPP. The first results are presented. The research spans a broad variety of applications including Earth sciences, physics, signal and image processing, computer science, and graphics. The performance of the MPP was very good. Results obtained using the Connection Machine and the Distributed Array Processor (DAP) are presented
Massively parallel quantum computer simulator
De Raedt, K.; Michielsen, K.; De Raedt, H.; Trieu, B.; Arnold, G.; Richter, M.; Lippert, Th.; Watanabe, H.; Ito, N.
2007-01-01
We describe portable software to simulate universal quantum computers on massive parallel Computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray
The 2nd Symposium on the Frontiers of Massively Parallel Computations
Mills, Ronnie (Editor)
1988-01-01
Programming languages, computer graphics, neural networks, massively parallel computers, SIMD architecture, algorithms, digital terrain models, sort computation, simulation of charged particle transport on the massively parallel processor and image processing are among the topics discussed.
Impact analysis on a massively parallel computer
International Nuclear Information System (INIS)
Zacharia, T.; Aramayo, G.A.
1994-01-01
Advanced mathematical techniques and computer simulation play a major role in evaluating and enhancing the design of beverage cans, industrial, and transportation containers for improved performance. Numerical models are used to evaluate the impact requirements of containers used by the Department of Energy (DOE) for transporting radioactive materials. Many of these models are highly compute-intensive. An analysis may require several hours of computational time on current supercomputers despite the simplicity of the models being studied. As computer simulations and materials databases grow in complexity, massively parallel computers have become important tools. Massively parallel computational research at the Oak Ridge National Laboratory (ORNL) and its application to the impact analysis of shipping containers is briefly described in this paper
Massively parallel evolutionary computation on GPGPUs
Tsutsui, Shigeyoshi
2013-01-01
Evolutionary algorithms (EAs) are metaheuristics that learn from natural collective behavior and are applied to solve optimization problems in domains such as scheduling, engineering, bioinformatics, and finance. Such applications demand acceptable solutions with high-speed execution using finite computational resources. Therefore, there have been many attempts to develop platforms for running parallel EAs using multicore machines, massively parallel cluster machines, or grid computing environments. Recent advances in general-purpose computing on graphics processing units (GPGPU) have opened u
Vanderbauwhede, Wim; Davidson, Gavin
2017-01-01
Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent a powerful and affordable tool for scientists who look to speed up simulations of complex systems. However, porting code to such devices requires a detailed understanding of heterogeneous programming tools and effective strategies for parallelization. In this paper we present a source to source compilation approach with whole-program analysis to automatically transform single-threaded FORTRAN 77 legacy code into Ope...
Massively parallel sequencing of forensic STRs
DEFF Research Database (Denmark)
Parson, Walther; Ballard, David; Budowle, Bruce
2016-01-01
The DNA Commission of the International Society for Forensic Genetics (ISFG) is reviewing factors that need to be considered ahead of the adoption by the forensic community of short tandem repeat (STR) genotyping by massively parallel sequencing (MPS) technologies. MPS produces sequence data that...
A discrete ordinate response matrix method for massively parallel computers
International Nuclear Information System (INIS)
Hanebutte, U.R.; Lewis, E.E.
1991-01-01
A discrete ordinate response matrix method is formulated for the solution of neutron transport problems on massively parallel computers. The response matrix formulation eliminates iteration on the scattering source. The nodal matrices which result from the diamond-differenced equations are utilized in a factored form which minimizes memory requirements and significantly reduces the required number of algorithm utilizes massive parallelism by assigning each spatial node to a processor. The algorithm is accelerated effectively by a synthetic method in which the low-order diffusion equations are also solved by massively parallel red/black iterations. The method has been implemented on a 16k Connection Machine-2, and S 8 and S 16 solutions have been obtained for fixed-source benchmark problems in X--Y geometry
International Nuclear Information System (INIS)
Thomas, B.; Domain, Ch.; Souffez, Y.; Eon-Duval, P.
1998-01-01
Harnessing the power of many computers, to solve concurrently difficult scientific problems, is one of the most innovative trend in High Performance Computing. At EDF, we have invested in parallel computing and have achieved significant results. First we improved the processing speed of strategic codes, in order to extend their scope. Then we turned to numerical simulations at the atomic scale. These computations, we never dreamt of before, provided us with a better understanding of metallurgic phenomena. More precisely we were able to trace defects in alloys that are used in nuclear power plants. (author)
Template based parallel checkpointing in a massively parallel computer system
Archer, Charles Jens [Rochester, MN; Inglett, Todd Alan [Rochester, MN
2009-01-13
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
Parallelization of 2-D lattice Boltzmann codes
International Nuclear Information System (INIS)
Suzuki, Soichiro; Kaburaki, Hideo; Yokokawa, Mitsuo.
1996-03-01
Lattice Boltzmann (LB) codes to simulate two dimensional fluid flow are developed on vector parallel computer Fujitsu VPP500 and scalar parallel computer Intel Paragon XP/S. While a 2-D domain decomposition method is used for the scalar parallel LB code, a 1-D domain decomposition method is used for the vector parallel LB code to be vectorized along with the axis perpendicular to the direction of the decomposition. High parallel efficiency of 95.1% by the vector parallel calculation on 16 processors with 1152x1152 grid and 88.6% by the scalar parallel calculation on 100 processors with 800x800 grid are obtained. The performance models are developed to analyze the performance of the LB codes. It is shown by our performance models that the execution speed of the vector parallel code is about one hundred times faster than that of the scalar parallel code with the same number of processors up to 100 processors. We also analyze the scalability in keeping the available memory size of one processor element at maximum. Our performance model predicts that the execution time of the vector parallel code increases about 3% on 500 processors. Although the 1-D domain decomposition method has in general a drawback in the interprocessor communication, the vector parallel LB code is still suitable for the large scale and/or high resolution simulations. (author)
Parallelization of 2-D lattice Boltzmann codes
Energy Technology Data Exchange (ETDEWEB)
Suzuki, Soichiro; Kaburaki, Hideo; Yokokawa, Mitsuo
1996-03-01
Lattice Boltzmann (LB) codes to simulate two dimensional fluid flow are developed on vector parallel computer Fujitsu VPP500 and scalar parallel computer Intel Paragon XP/S. While a 2-D domain decomposition method is used for the scalar parallel LB code, a 1-D domain decomposition method is used for the vector parallel LB code to be vectorized along with the axis perpendicular to the direction of the decomposition. High parallel efficiency of 95.1% by the vector parallel calculation on 16 processors with 1152x1152 grid and 88.6% by the scalar parallel calculation on 100 processors with 800x800 grid are obtained. The performance models are developed to analyze the performance of the LB codes. It is shown by our performance models that the execution speed of the vector parallel code is about one hundred times faster than that of the scalar parallel code with the same number of processors up to 100 processors. We also analyze the scalability in keeping the available memory size of one processor element at maximum. Our performance model predicts that the execution time of the vector parallel code increases about 3% on 500 processors. Although the 1-D domain decomposition method has in general a drawback in the interprocessor communication, the vector parallel LB code is still suitable for the large scale and/or high resolution simulations. (author).
A Programming Model for Massive Data Parallelism with Data Dependencies
International Nuclear Information System (INIS)
Cui, Xiaohui; Mueller, Frank; Potok, Thomas E.; Zhang, Yongpeng
2009-01-01
Accelerating processors can often be more cost and energy effective for a wide range of data-parallel computing problems than general-purpose processors. For graphics processor units (GPUs), this is particularly the case when program development is aided by environments such as NVIDIA s Compute Unified Device Architecture (CUDA), which dramatically reduces the gap between domain-specific architectures and general purpose programming. Nonetheless, general-purpose GPU (GPGPU) programming remains subject to several restrictions. Most significantly, the separation of host (CPU) and accelerator (GPU) address spaces requires explicit management of GPU memory resources, especially for massive data parallelism that well exceeds the memory capacity of GPUs. One solution to this problem is to transfer data between the GPU and host memories frequently. In this work, we investigate another approach. We run massively data-parallel applications on GPU clusters. We further propose a programming model for massive data parallelism with data dependencies for this scenario. Experience from micro benchmarks and real-world applications shows that our model provides not only ease of programming but also significant performance gains
International Nuclear Information System (INIS)
Liu, H.
1996-01-01
Computer simulations using the multi-particle code PARMELA with a three-dimensional point-by-point space charge algorithm have turned out to be very helpful in supporting injector commissioning and operations at Thomas Jefferson National Accelerator Facility (Jefferson Lab, formerly called CEBAF). However, this algorithm, which defines a typical N 2 problem in CPU time scaling, is very time-consuming when N, the number of macro-particles, is large. Therefore, it is attractive to use massively parallel processors (MPPs) to speed up the simulations. Motivated by this, the authors modified the space charge subroutine for using the MPPs of the Cray T3D. The techniques used to parallelize and optimize the code on the T3D are discussed in this paper. The performance of the code on the T3D is examined in comparison with a Parallel Vector Processing supercomputer of the Cray C90 and an HP 735/15 high-end workstation
Phase space simulation of collisionless stellar systems on the massively parallel processor
International Nuclear Information System (INIS)
White, R.L.
1987-01-01
A numerical technique for solving the collisionless Boltzmann equation describing the time evolution of a self gravitating fluid in phase space was implemented on the Massively Parallel Processor (MPP). The code performs calculations for a two dimensional phase space grid (with one space and one velocity dimension). Some results from calculations are presented. The execution speed of the code is comparable to the speed of a single processor of a Cray-XMP. Advantages and disadvantages of the MPP architecture for this type of problem are discussed. The nearest neighbor connectivity of the MPP array does not pose a significant obstacle. Future MPP-like machines should have much more local memory and easier access to staging memory and disks in order to be effective for this type of problem
Parallelization of Subchannel Analysis Code MATRA
International Nuclear Information System (INIS)
Kim, Seongjin; Hwang, Daehyun; Kwon, Hyouk
2014-01-01
A stand-alone calculation of MATRA code used up pertinent computing time for the thermal margin calculations while a relatively considerable time is needed to solve the whole core pin-by-pin problems. In addition, it is strongly required to improve the computation speed of the MATRA code to satisfy the overall performance of the multi-physics coupling calculations. Therefore, a parallel approach to improve and optimize the computability of the MATRA code is proposed and verified in this study. The parallel algorithm is embodied in the MATRA code using the MPI communication method and the modification of the previous code structure was minimized. An improvement is confirmed by comparing the results between the single and multiple processor algorithms. The speedup and efficiency are also evaluated when increasing the number of processors. The parallel algorithm was implemented to the subchannel code MATRA using the MPI. The performance of the parallel algorithm was verified by comparing the results with those from the MATRA with the single processor. It is also noticed that the performance of the MATRA code was greatly improved by implementing the parallel algorithm for the 1/8 core and whole core problems
Revealing the Physics of Galactic Winds Through Massively-Parallel Hydrodynamics Simulations
Schneider, Evan Elizabeth
This thesis documents the hydrodynamics code Cholla and a numerical study of multiphase galactic winds. Cholla is a massively-parallel, GPU-based code designed for astrophysical simulations that is freely available to the astrophysics community. A static-mesh Eulerian code, Cholla is ideally suited to carrying out massive simulations (> 20483 cells) that require very high resolution. The code incorporates state-of-the-art hydrodynamics algorithms including third-order spatial reconstruction, exact and linearized Riemann solvers, and unsplit integration algorithms that account for transverse fluxes on multidimensional grids. Operator-split radiative cooling and a dual-energy formalism for high mach number flows are also included. An extensive test suite demonstrates Cholla's superior ability to model shocks and discontinuities, while the GPU-native design makes the code extremely computationally efficient - speeds of 5-10 million cell updates per GPU-second are typical on current hardware for 3D simulations with all of the aforementioned physics. The latter half of this work comprises a comprehensive study of the mixing between a hot, supernova-driven wind and cooler clouds representative of those observed in multiphase galactic winds. Both adiabatic and radiatively-cooling clouds are investigated. The analytic theory of cloud-crushing is applied to the problem, and adiabatic turbulent clouds are found to be mixed with the hot wind on similar timescales as the classic spherical case (4-5 t cc) with an appropriate rescaling of the cloud-crushing time. Radiatively cooling clouds survive considerably longer, and the differences in evolution between turbulent and spherical clouds cannot be reconciled with a simple rescaling. The rapid incorporation of low-density material into the hot wind implies efficient mass-loading of hot phases of galactic winds. At the same time, the extreme compression of high-density cloud material leads to long-lived but slow-moving clumps
Massively parallel red-black algorithms for x-y-z response matrix equations
International Nuclear Information System (INIS)
Hanebutte, U.R.; Laurin-Kovitz, K.; Lewis, E.E.
1992-01-01
Recently, both discrete ordinates and spherical harmonic (S n and P n ) methods have been cast in the form of response matrices. In x-y geometry, massively parallel algorithms have been developed to solve the resulting response matrix equations on the Connection Machine family of parallel computers, the CM-2, CM-200, and CM-5. These algorithms utilize two-cycle iteration on a red-black checkerboard. In this work we examine the use of massively parallel red-black algorithms to solve response matric equations in three dimensions. This longer term objective is to utilize massively parallel algorithms to solve S n and/or P n response matrix problems. In this exploratory examination, however, we consider the simple 6 x 6 response matrices that are derivable from fine-mesh diffusion approximations in three dimensions
A massively parallel method of characteristic neutral particle transport code for GPUs
International Nuclear Information System (INIS)
Boyd, W. R.; Smith, K.; Forget, B.
2013-01-01
Over the past 20 years, parallel computing has enabled computers to grow ever larger and more powerful while scientific applications have advanced in sophistication and resolution. This trend is being challenged, however, as the power consumption for conventional parallel computing architectures has risen to unsustainable levels and memory limitations have come to dominate compute performance. Heterogeneous computing platforms, such as Graphics Processing Units (GPUs), are an increasingly popular paradigm for solving these issues. This paper explores the applicability of GPUs for deterministic neutron transport. A 2D method of characteristics (MOC) code - OpenMOC - has been developed with solvers for both shared memory multi-core platforms as well as GPUs. The multi-threading and memory locality methodologies for the GPU solver are presented. Performance results for the 2D C5G7 benchmark demonstrate 25-35 x speedup for MOC on the GPU. The lessons learned from this case study will provide the basis for further exploration of MOC on GPUs as well as design decisions for hardware vendors exploring technologies for the next generation of machines for scientific computing. (authors)
A massively parallel discrete ordinates response matrix method for neutron transport
International Nuclear Information System (INIS)
Hanebutte, U.R.; Lewis, E.E.
1992-01-01
In this paper a discrete ordinates response matrix method is formulated with anisotropic scattering for the solution of neutron transport problems on massively parallel computers. The response matrix formulation eliminates iteration on the scattering source. The nodal matrices that result from the diamond-differenced equations are utilized in a factored form that minimizes memory requirements and significantly reduces the number of arithmetic operations required per node. The red-black solution algorithm utilizes massive parallelism by assigning each spatial node to one or more processors. The algorithm is accelerated by a synthetic method in which the low-order diffusion equations are also solved by massively parallel red-black iterations. The method is implemented on a 16K Connection Machine-2, and S 8 and S 16 solutions are obtained for fixed-source benchmark problems in x-y geometry
Increasing the reach of forensic genetics with massively parallel sequencing.
Budowle, Bruce; Schmedes, Sarah E; Wendt, Frank R
2017-09-01
The field of forensic genetics has made great strides in the analysis of biological evidence related to criminal and civil matters. More so, the discipline has set a standard of performance and quality in the forensic sciences. The advent of massively parallel sequencing will allow the field to expand its capabilities substantially. This review describes the salient features of massively parallel sequencing and how it can impact forensic genetics. The features of this technology offer increased number and types of genetic markers that can be analyzed, higher throughput of samples, and the capability of targeting different organisms, all by one unifying methodology. While there are many applications, three are described where massively parallel sequencing will have immediate impact: molecular autopsy, microbial forensics and differentiation of monozygotic twins. The intent of this review is to expose the forensic science community to the potential enhancements that have or are soon to arrive and demonstrate the continued expansion the field of forensic genetics and its service in the investigation of legal matters.
A highly scalable massively parallel fast marching method for the Eikonal equation
Yang, Jianming; Stern, Frederick
2017-03-01
The fast marching method is a widely used numerical method for solving the Eikonal equation arising from a variety of scientific and engineering fields. It is long deemed inherently sequential and an efficient parallel algorithm applicable to large-scale practical applications is not available in the literature. In this study, we present a highly scalable massively parallel implementation of the fast marching method using a domain decomposition approach. Central to this algorithm is a novel restarted narrow band approach that coordinates the frequency of communications and the amount of computations extra to a sequential run for achieving an unprecedented parallel performance. Within each restart, the narrow band fast marching method is executed; simple synchronous local exchanges and global reductions are adopted for communicating updated data in the overlapping regions between neighboring subdomains and getting the latest front status, respectively. The independence of front characteristics is exploited through special data structures and augmented status tags to extract the masked parallelism within the fast marching method. The efficiency, flexibility, and applicability of the parallel algorithm are demonstrated through several examples. These problems are extensively tested on six grids with up to 1 billion points using different numbers of processes ranging from 1 to 65536. Remarkable parallel speedups are achieved using tens of thousands of processes. Detailed pseudo-codes for both the sequential and parallel algorithms are provided to illustrate the simplicity of the parallel implementation and its similarity to the sequential narrow band fast marching algorithm.
Neural nets for massively parallel optimization
Dixon, Laurence C. W.; Mills, David
1992-07-01
To apply massively parallel processing systems to the solution of large scale optimization problems it is desirable to be able to evaluate any function f(z), z (epsilon) Rn in a parallel manner. The theorem of Cybenko, Hecht Nielsen, Hornik, Stinchcombe and White, and Funahasi shows that this can be achieved by a neural network with one hidden layer. In this paper we address the problem of the number of nodes required in the layer to achieve a given accuracy in the function and gradient values at all points within a given n dimensional interval. The type of activation function needed to obtain nonsingular Hessian matrices is described and a strategy for obtaining accurate minimal networks presented.
Engineering-Based Thermal CFD Simulations on Massive Parallel Systems
Frisch, Jé rô me; Mundani, Ralf-Peter; Rank, Ernst; van Treeck, Christoph
2015-01-01
The development of parallel Computational Fluid Dynamics (CFD) codes is a challenging task that entails efficient parallelization concepts and strategies in order to achieve good scalability values when running those codes on modern supercomputers
International Nuclear Information System (INIS)
Lichters, R.; Pfund, R.E.W.; Meyer-ter-Vehn, J.
1997-08-01
The code LPIC++ presented here, is based on a one-dimensional, electromagnetic, relativistic PIC code that has originally been developed by one of the authors during a PhD thesis at the Max-Planck-Institut fuer Quantenoptik for kinetic simulations of high harmonic generation from overdense plasma surfaces. The code uses essentially the algorithm of Birdsall and Langdon and Villasenor and Bunemann. It is written in C++ in order to be easily extendable and has been parallelized to be able to grow in power linearly with the size of accessable hardware, e.g. massively parallel machines like Cray T3E. The parallel LPIC++ version uses PVM for communication between processors. PVM is public domain software, can be downloaded from the world wide web. A particular strength of LPIC++ lies in its clear program and data structure, which uses chained lists for the organization of grid cells and enables dynamic adjustment of spatial domain sizes in a very convenient way, and therefore easy balancing of processor loads. Also particles belonging to one cell are linked in a chained list and are immediately accessable from this cell. In addition to this convenient type of data organization in a PIC code, the code shows excellent performance in both its single processor and parallel version. (orig.)
International Nuclear Information System (INIS)
Ishizuki, Shigeru; Kawai, Wataru; Nemoto, Toshiyuki; Ogasawara, Shinobu; Kume, Etsuo; Adachi, Masaaki; Kawasaki, Nobuo; Yatake, Yo-ichi
2000-03-01
Several computer codes in the nuclear field have been vectorized, parallelized and transported on the FUJITSU VPP500 system, the AP3000 system and the Paragon system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. We dealt with 12 codes in fiscal 1998. These results are reported in 3 parts, i.e., the vectorization and parallelization on vector processors part, the parallelization on scalar processors part and the porting part. In this report, we describe the vectorization and parallelization on vector processors. In this vectorization and parallelization on vector processors part, the vectorization of General Tokamak Circuit Simulation Program code GTCSP, the vectorization and parallelization of Molecular Dynamics NTV (n-particle, Temperature and Velocity) Simulation code MSP2, Eddy Current Analysis code EDDYCAL, Thermal Analysis Code for Test of Passive Cooling System by HENDEL T2 code THANPACST2 and MHD Equilibrium code SELENEJ on the VPP500 are described. In the parallelization on scalar processors part, the parallelization of Monte Carlo N-Particle Transport code MCNP4B2, Plasma Hydrodynamics code using Cubic Interpolated Propagation Method PHCIP and Vectorized Monte Carlo code (continuous energy model / multi-group model) MVP/GMVP on the Paragon are described. In the porting part, the porting of Monte Carlo N-Particle Transport code MCNP4B2 and Reactor Safety Analysis code RELAP5 on the AP3000 are described. (author)
Parallel processing of structural integrity analysis codes
International Nuclear Information System (INIS)
Swami Prasad, P.; Dutta, B.K.; Kushwaha, H.S.
1996-01-01
Structural integrity analysis forms an important role in assessing and demonstrating the safety of nuclear reactor components. This analysis is performed using analytical tools such as Finite Element Method (FEM) with the help of digital computers. The complexity of the problems involved in nuclear engineering demands high speed computation facilities to obtain solutions in reasonable amount of time. Parallel processing systems such as ANUPAM provide an efficient platform for realising the high speed computation. The development and implementation of software on parallel processing systems is an interesting and challenging task. The data and algorithm structure of the codes plays an important role in exploiting the parallel processing system capabilities. Structural analysis codes based on FEM can be divided into two categories with respect to their implementation on parallel processing systems. The first category codes such as those used for harmonic analysis, mechanistic fuel performance codes need not require the parallelisation of individual modules of the codes. The second category of codes such as conventional FEM codes require parallelisation of individual modules. In this category, parallelisation of equation solution module poses major difficulties. Different solution schemes such as domain decomposition method (DDM), parallel active column solver and substructuring method are currently used on parallel processing systems. Two codes, FAIR and TABS belonging to each of these categories have been implemented on ANUPAM. The implementation details of these codes and the performance of different equation solvers are highlighted. (author). 5 refs., 12 figs., 1 tab
Massively Parallel Finite Element Programming
Heister, Timo; Kronbichler, Martin; Bangerth, Wolfgang
2010-01-01
Today's large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.
Massively Parallel Finite Element Programming
Heister, Timo
2010-01-01
Today\\'s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.
Parallelization methods study of thermal-hydraulics codes
International Nuclear Information System (INIS)
Gaudart, Catherine
2000-01-01
The variety of parallelization methods and machines leads to a wide selection for programmers. In this study we suggest, in an industrial context, some solutions from the experience acquired through different parallelization methods. The study is about several scientific codes which simulate a large variety of thermal-hydraulics phenomena. A bibliography on parallelization methods and a first analysis of the codes showed the difficulty of our process on the whole applications to study. Therefore, it would be necessary to identify and extract a representative part of these applications and parallelization methods. The linear solver part of the codes forced itself. On this particular part several parallelization methods had been used. From these developments one could estimate the necessary work for a non initiate programmer to parallelize his application, and the impact of the development constraints. The different methods of parallelization tested are the numerical library PETSc, the parallelizer PAF, the language HPF, the formalism PEI and the communications library MPI and PYM. In order to test several methods on different applications and to follow the constraint of minimization of the modifications in codes, a tool called SPS (Server of Parallel Solvers) had be developed. We propose to describe the different constraints about the optimization of codes in an industrial context, to present the solutions given by the tool SPS, to show the development of the linear solver part with the tested parallelization methods and lastly to compare the results against the imposed criteria. (author) [fr
Vectorization, parallelization and porting of nuclear codes. 2001
International Nuclear Information System (INIS)
Akiyama, Mitsunaga; Katakura, Fumishige; Kume, Etsuo; Nemoto, Toshiyuki; Tsuruoka, Takuya; Adachi, Masaaki
2003-07-01
Several computer codes in the nuclear field have been vectorized, parallelized and transported on the super computer system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. We dealt with 10 codes in fiscal 2001. In this report, the parallelization of Neutron Radiography for 3 Dimensional CT code NR3DCT, the vectorization of unsteady-state heat conduction code THERMO3D, the porting of initial program of MHD simulation, the tuning of Heat And Mass Balance Analysis Code HAMBAC, the porting and parallelization of Monte Carlo N-Particle transport code MCNP4C3, the porting and parallelization of Monte Carlo N-Particle transport code system MCNPX2.1.5, the porting of induced activity calculation code CINAC-V4, the use of VisLink library in multidimensional two-fluid model code ACD3D and the porting of experiment data processing code from GS8500 to SR8000 are described. (author)
Energy Technology Data Exchange (ETDEWEB)
Adachi, Masaaki; Ogasawara, Shinobu; Kume, Etsuo [Japan Atomic Energy Research Inst., Tokai, Ibaraki (Japan). Tokai Research Establishment; Ishizuki, Shigeru; Nemoto, Toshiyuki; Kawasaki, Nobuo; Kawai, Wataru [Fujitsu Ltd., Tokyo (Japan); Yatake, Yo-ichi [Hitachi Ltd., Tokyo (Japan)
2001-02-01
Several computer codes in the nuclear field have been vectorized, parallelized and trans-ported on the FUJITSU VPP500 system, the AP3000 system, the SX-4 system and the Paragon system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. We dealt with 18 codes in fiscal 1999. These results are reported in 3 parts, i.e., the vectorization and the parallelization part on vector processors, the parallelization part on scalar processors and the porting part. In this report, we describe the vectorization and parallelization on vector processors. In this vectorization and parallelization on vector processors part, the vectorization of Relativistic Molecular Orbital Calculation code RSCAT, a microscopic transport code for high energy nuclear collisions code JAM, three-dimensional non-steady thermal-fluid analysis code STREAM, Relativistic Density Functional Theory code RDFT and High Speed Three-Dimensional Nodal Diffusion code MOSRA-Light on the VPP500 system and the SX-4 system are described. (author)
PARALLEL SPATIOTEMPORAL SPECTRAL CLUSTERING WITH MASSIVE TRAJECTORY DATA
Directory of Open Access Journals (Sweden)
Y. Z. Gu
2017-09-01
Full Text Available Massive trajectory data contains wealth useful information and knowledge. Spectral clustering, which has been shown to be effective in finding clusters, becomes an important clustering approaches in the trajectory data mining. However, the traditional spectral clustering lacks the temporal expansion on the algorithm and limited in its applicability to large-scale problems due to its high computational complexity. This paper presents a parallel spatiotemporal spectral clustering based on multiple acceleration solutions to make the algorithm more effective and efficient, the performance is proved due to the experiment carried out on the massive taxi trajectory dataset in Wuhan city, China.
cellGPU: Massively parallel simulations of dynamic vertex models
Sussman, Daniel M.
2017-10-01
Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation
Development of massively parallel quantum chemistry program SMASH
International Nuclear Information System (INIS)
Ishimura, Kazuya
2015-01-01
A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C 150 H 30 ) 2 with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer
Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems
Albutiu, Martina-Cezara; Kemper, Alfons; Neumann, Thomas
2012-01-01
Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research in scalable massively parallel mult...
Parallelization characteristics of the DeCART code
International Nuclear Information System (INIS)
Cho, J. Y.; Joo, H. G.; Kim, H. Y.; Lee, C. C.; Chang, M. H.; Zee, S. Q.
2003-12-01
This report is to describe the parallelization characteristics of the DeCART code and also examine its parallel performance. Parallel computing algorithms are implemented to DeCART to reduce the tremendous computational burden and memory requirement involved in the three-dimensional whole core transport calculation. In the parallelization of the DeCART code, the axial domain decomposition is first realized by using MPI (Message Passing Interface), and then the azimuthal angle domain decomposition by using either MPI or OpenMP. When using the MPI for both the axial and the angle domain decomposition, the concept of MPI grouping is employed for convenient communication in each communication world. For the parallel computation, most of all the computing modules except for the thermal hydraulic module are parallelized. These parallelized computing modules include the MOC ray tracing, CMFD, NEM, region-wise cross section preparation and cell homogenization modules. For the distributed allocation, most of all the MOC and CMFD/NEM variables are allocated only for the assigned planes, which reduces the required memory by a ratio of the number of the assigned planes to the number of all planes. The parallel performance of the DeCART code is evaluated by solving two problems, a rodded variation of the C5G7 MOX three-dimensional benchmark problem and a simplified three-dimensional SMART PWR core problem. In the aspect of parallel performance, the DeCART code shows a good speedup of about 40.1 and 22.4 in the ray tracing module and about 37.3 and 20.2 in the total computing time when using 48 CPUs on the IBM Regatta and 24 CPUs on the LINUX cluster, respectively. In the comparison between the MPI and OpenMP, OpenMP shows a somewhat better performance than MPI. Therefore, it is concluded that the first priority in the parallel computation of the DeCART code is in the axial domain decomposition by using MPI, and then in the angular domain using OpenMP, and finally the angular
Proxy-equation paradigm: A strategy for massively parallel asynchronous computations
Mittal, Ankita; Girimaji, Sharath
2017-09-01
Massively parallel simulations of transport equation systems call for a paradigm change in algorithm development to achieve efficient scalability. Traditional approaches require time synchronization of processing elements (PEs), which severely restricts scalability. Relaxing synchronization requirement introduces error and slows down convergence. In this paper, we propose and develop a novel "proxy equation" concept for a general transport equation that (i) tolerates asynchrony with minimal added error, (ii) preserves convergence order and thus, (iii) expected to scale efficiently on massively parallel machines. The central idea is to modify a priori the transport equation at the PE boundaries to offset asynchrony errors. Proof-of-concept computations are performed using a one-dimensional advection (convection) diffusion equation. The results demonstrate the promise and advantages of the present strategy.
Improvement of Parallel Algorithm for MATRA Code
International Nuclear Information System (INIS)
Kim, Seong-Jin; Seo, Kyong-Won; Kwon, Hyouk; Hwang, Dae-Hyun
2014-01-01
The feasibility study to parallelize the MATRA code was conducted in KAERI early this year. As a result, a parallel algorithm for the MATRA code has been developed to decrease a considerably required computing time to solve a bigsize problem such as a whole core pin-by-pin problem of a general PWR reactor and to improve an overall performance of the multi-physics coupling calculations. It was shown that the performance of the MATRA code was greatly improved by implementing the parallel algorithm using MPI communication. For problems of a 1/8 core and whole core for SMART reactor, a speedup was evaluated as about 10 when the numbers of used processor were 25. However, it was also shown that the performance deteriorated as the axial node number increased. In this paper, the procedure of a communication between processors is optimized to improve the previous parallel algorithm.. To improve the performance deterioration of the parallelized MATRA code, the communication algorithm between processors was newly presented. It was shown that the speedup was improved and stable regardless of the axial node number
Massive parallel 3D PIC simulation of negative ion extraction
Revel, Adrien; Mochalskyy, Serhiy; Montellano, Ivar Mauricio; Wünderlich, Dirk; Fantz, Ursel; Minea, Tiberiu
2017-09-01
The 3D PIC-MCC code ONIX is dedicated to modeling Negative hydrogen/deuterium Ion (NI) extraction and co-extraction of electrons from radio-frequency driven, low pressure plasma sources. It provides valuable insight on the complex phenomena involved in the extraction process. In previous calculations, a mesh size larger than the Debye length was used, implying numerical electron heating. Important steps have been achieved in terms of computation performance and parallelization efficiency allowing successful massive parallel calculations (4096 cores), imperative to resolve the Debye length. In addition, the numerical algorithms have been improved in terms of grid treatment, i.e., the electric field near the complex geometry boundaries (plasma grid) is calculated more accurately. The revised model preserves the full 3D treatment, but can take advantage of a highly refined mesh. ONIX was used to investigate the role of the mesh size, the re-injection scheme for lost particles (extracted or wall absorbed), and the electron thermalization process on the calculated extracted current and plasma characteristics. It is demonstrated that all numerical schemes give the same NI current distribution for extracted ions. Concerning the electrons, the pair-injection technique is found well-adapted to simulate the sheath in front of the plasma grid.
Development of massively parallel quantum chemistry program SMASH
Energy Technology Data Exchange (ETDEWEB)
Ishimura, Kazuya [Department of Theoretical and Computational Molecular Science, Institute for Molecular Science 38 Nishigo-Naka, Myodaiji, Okazaki, Aichi 444-8585 (Japan)
2015-12-31
A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C{sub 150}H{sub 30}){sub 2} with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer.
Massively parallel sparse matrix function calculations with NTPoly
Dawson, William; Nakajima, Takahito
2018-04-01
We present NTPoly, a massively parallel library for computing the functions of sparse, symmetric matrices. The theory of matrix functions is a well developed framework with a wide range of applications including differential equations, graph theory, and electronic structure calculations. One particularly important application area is diagonalization free methods in quantum chemistry. When the input and output of the matrix function are sparse, methods based on polynomial expansions can be used to compute matrix functions in linear time. We present a library based on these methods that can compute a variety of matrix functions. Distributed memory parallelization is based on a communication avoiding sparse matrix multiplication algorithm. OpenMP task parallellization is utilized to implement hybrid parallelization. We describe NTPoly's interface and show how it can be integrated with programs written in many different programming languages. We demonstrate the merits of NTPoly by performing large scale calculations on the K computer.
High fidelity thermal-hydraulic analysis using CFD and massively parallel computers
International Nuclear Information System (INIS)
Weber, D.P.; Wei, T.Y.C.; Brewster, R.A.; Rock, Daniel T.; Rizwan-uddin
2000-01-01
Thermal-hydraulic analyses play an important role in design and reload analysis of nuclear power plants. These analyses have historically relied on early generation computational fluid dynamics capabilities, originally developed in the 1960s and 1970s. Over the last twenty years, however, dramatic improvements in both computational fluid dynamics codes in the commercial sector and in computing power have taken place. These developments offer the possibility of performing large scale, high fidelity, core thermal hydraulics analysis. Such analyses will allow a determination of the conservatism employed in traditional design approaches and possibly justify the operation of nuclear power systems at higher powers without compromising safety margins. The objective of this work is to demonstrate such a large scale analysis approach using a state of the art CFD code, STAR-CD, and the computing power of massively parallel computers, provided by IBM. A high fidelity representation of a current generation PWR was analyzed with the STAR-CD CFD code and the results were compared to traditional analyses based on the VIPRE code. Current design methodology typically involves a simplified representation of the assemblies, where a single average pin is used in each assembly to determine the hot assembly from a whole core analysis. After determining this assembly, increased refinement is used in the hot assembly, and possibly some of its neighbors, to refine the analysis for purposes of calculating DNBR. This latter calculation is performed with sub-channel codes such as VIPRE. The modeling simplifications that are used involve the approximate treatment of surrounding assemblies and coarse representation of the hot assembly, where the subchannel is the lowest level of discretization. In the high fidelity analysis performed in this study, both restrictions have been removed. Within the hot assembly, several hundred thousand to several million computational zones have been used, to
Massively parallel Fokker-Planck calculations
International Nuclear Information System (INIS)
Mirin, A.A.
1990-01-01
This paper reports that the Fokker-Planck package FPPAC, which solves the complete nonlinear multispecies Fokker-Planck collision operator for a plasma in two-dimensional velocity space, has been rewritten for the Connection Machine 2. This has involved allocation of variables either to the front end or the CM2, minimization of data flow, and replacement of Cray-optimized algorithms with ones suitable for a massively parallel architecture. Calculations have been carried out on various Connection Machines throughout the country. Results and timings on these machines have been compared to each other and to those on the static memory Cray-2. For large problem size, the Connection Machine 2 is found to be cost-efficient
Solving the Stokes problem on a massively parallel computer
DEFF Research Database (Denmark)
Axelsson, Owe; Barker, Vincent A.; Neytcheva, Maya
2001-01-01
boundary value problem for each velocity component, are solved by the conjugate gradient method with a preconditioning based on the algebraic multi‐level iteration (AMLI) technique. The velocity is found from the computed pressure. The method is optimal in the sense that the computational work...... is proportional to the number of unknowns. Further, it is designed to exploit a massively parallel computer with distributed memory architecture. Numerical experiments on a Cray T3E computer illustrate the parallel performance of the method....
Performance of Air Pollution Models on Massively Parallel Computers
DEFF Research Database (Denmark)
Brown, John; Hansen, Per Christian; Wasniewski, Jerzy
1996-01-01
To compare the performance and use of three massively parallel SIMD computers, we implemented a large air pollution model on the computers. Using a realistic large-scale model, we gain detailed insight about the performance of the three computers when used to solve large-scale scientific problems...
Nyx: Adaptive mesh, massively-parallel, cosmological simulation code
Almgren, Ann; Beckner, Vince; Friesen, Brian; Lukic, Zarija; Zhang, Weiqun
2017-12-01
Nyx code solves equations of compressible hydrodynamics on an adaptive grid hierarchy coupled with an N-body treatment of dark matter. The gas dynamics in Nyx use a finite volume methodology on an adaptive set of 3-D Eulerian grids; dark matter is represented as discrete particles moving under the influence of gravity. Particles are evolved via a particle-mesh method, using Cloud-in-Cell deposition/interpolation scheme. Both baryonic and dark matter contribute to the gravitational field. In addition, Nyx includes physics for accurately modeling the intergalactic medium; in optically thin limits and assuming ionization equilibrium, the code calculates heating and cooling processes of the primordial-composition gas in an ionizing ultraviolet background radiation field.
Massively parallel computation of conservation laws
Energy Technology Data Exchange (ETDEWEB)
Garbey, M [Univ. Claude Bernard, Villeurbanne (France); Levine, D [Argonne National Lab., IL (United States)
1990-01-01
The authors present a new method for computing solutions of conservation laws based on the use of cellular automata with the method of characteristics. The method exploits the high degree of parallelism available with cellular automata and retains important features of the method of characteristics. It yields high numerical accuracy and extends naturally to adaptive meshes and domain decomposition methods for perturbed conservation laws. They describe the method and its implementation for a Dirichlet problem with a single conservation law for the one-dimensional case. Numerical results for the one-dimensional law with the classical Burgers nonlinearity or the Buckley-Leverett equation show good numerical accuracy outside the neighborhood of the shocks. The error in the area of the shocks is of the order of the mesh size. The algorithm is well suited for execution on both massively parallel computers and vector machines. They present timing results for an Alliant FX/8, Connection Machine Model 2, and CRAY X-MP.
A solution for automatic parallelization of sequential assembly code
Directory of Open Access Journals (Sweden)
Kovačević Đorđe
2013-01-01
Full Text Available Since modern multicore processors can execute existing sequential programs only on a single core, there is a strong need for automatic parallelization of program code. Relying on existing algorithms, this paper describes one new software solution tool for parallelization of sequential assembly code. The main goal of this paper is to develop the parallelizator which reads sequential assembler code and at the output provides parallelized code for MIPS processor with multiple cores. The idea is the following: the parser translates assembler input file to program objects suitable for further processing. After that the static single assignment is done. Based on the data flow graph, the parallelization algorithm separates instructions on different cores. Once sequential code is parallelized by the parallelization algorithm, registers are allocated with the algorithm for linear allocation, and the result at the end of the program is distributed assembler code on each of the cores. In the paper we evaluate the speedup of the matrix multiplication example, which was processed by the parallelizator of assembly code. The result is almost linear speedup of code execution, which increases with the number of cores. The speed up on the two cores is 1.99, while on 16 cores the speed up is 13.88.
Massively Parallel Computing at Sandia and Its Application to National Defense
National Research Council Canada - National Science Library
Dosanjh, Sudip
1991-01-01
Two years ago, researchers at Sandia National Laboratories showed that a massively parallel computer with 1024 processors could solve scientific problems more than 1000 times faster than a single processor...
Energy Technology Data Exchange (ETDEWEB)
Lichtner, Peter C. [OFM Research, Redmond, WA (United States); Hammond, Glenn E. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Lu, Chuan [Idaho National Lab. (INL), Idaho Falls, ID (United States); Karra, Satish [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Bisht, Gautam [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Andre, Benjamin [National Center for Atmospheric Research, Boulder, CO (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Mills, Richard [Intel Corporation, Portland, OR (United States); Univ. of Tennessee, Knoxville, TN (United States); Kumar, Jitendra [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
2015-01-20
PFLOTRAN solves a system of generally nonlinear partial differential equations describing multi-phase, multicomponent and multiscale reactive flow and transport in porous materials. The code is designed to run on massively parallel computing architectures as well as workstations and laptops (e.g. Hammond et al., 2011). Parallelization is achieved through domain decomposition using the PETSc (Portable Extensible Toolkit for Scientific Computation) libraries for the parallelization framework (Balay et al., 1997). PFLOTRAN has been developed from the ground up for parallel scalability and has been run on up to 218 processor cores with problem sizes up to 2 billion degrees of freedom. Written in object oriented Fortran 90, the code requires the latest compilers compatible with Fortran 2003. At the time of this writing this requires gcc 4.7.x, Intel 12.1.x and PGC compilers. As a requirement of running problems with a large number of degrees of freedom, PFLOTRAN allows reading input data that is too large to fit into memory allotted to a single processor core. The current limitation to the problem size PFLOTRAN can handle is the limitation of the HDF5 file format used for parallel IO to 32 bit integers. Noting that 2^{32} = 4; 294; 967; 296, this gives an estimate of the maximum problem size that can be currently run with PFLOTRAN. Hopefully this limitation will be remedied in the near future.
An Automatic Instruction-Level Parallelization of Machine Code
Directory of Open Access Journals (Sweden)
MARINKOVIC, V.
2018-02-01
Full Text Available Prevailing multicores and novel manycores have made a great challenge of modern day - parallelization of embedded software that is still written as sequential. In this paper, automatic code parallelization is considered, focusing on developing a parallelization tool at the binary level as well as on the validation of this approach. The novel instruction-level parallelization algorithm for assembly code which uses the register names after SSA to find independent blocks of code and then to schedule independent blocks using METIS to achieve good load balance is developed. The sequential consistency is verified and the validation is done by measuring the program execution time on the target architecture. Great speedup, taken as the performance measure in the validation process, and optimal load balancing are achieved for multicore RISC processors with 2 to 16 cores (e.g. MIPS, MicroBlaze, etc.. In particular, for 16 cores, the average speedup is 7.92x, while in some cases it reaches 14x. An approach to automatic parallelization provided by this paper is useful to researchers and developers in the area of parallelization as the basis for further optimizations, as the back-end of a compiler, or as the code parallelization tool for an embedded system.
Massively-parallel best subset selection for ordinary least-squares regression
DEFF Research Database (Denmark)
Gieseke, Fabian; Polsterer, Kai Lars; Mahabal, Ashish
2017-01-01
Selecting an optimal subset of k out of d features for linear regression models given n training instances is often considered intractable for feature spaces with hundreds or thousands of dimensions. We propose an efficient massively-parallel implementation for selecting such optimal feature...
Massive Asynchronous Parallelization of Sparse Matrix Factorizations
Energy Technology Data Exchange (ETDEWEB)
Chow, Edmond [Georgia Inst. of Technology, Atlanta, GA (United States)
2018-01-08
Solving sparse problems is at the core of many DOE computational science applications. We focus on the challenge of developing sparse algorithms that can fully exploit the parallelism in extreme-scale computing systems, in particular systems with massive numbers of cores per node. Our approach is to express a sparse matrix factorization as a large number of bilinear constraint equations, and then solving these equations via an asynchronous iterative method. The unknowns in these equations are the matrix entries of the factorization that is desired.
Implementation of PHENIX trigger algorithms on massively parallel computers
International Nuclear Information System (INIS)
Petridis, A.N.; Wohn, F.K.
1995-01-01
The event selection requirements of contemporary high energy and nuclear physics experiments are met by the introduction of on-line trigger algorithms which identify potentially interesting events and reduce the data acquisition rate to levels that are manageable by the electronics. Such algorithms being parallel in nature can be simulated off-line using massively parallel computers. The PHENIX experiment intends to investigate the possible existence of a new phase of matter called the quark gluon plasma which has been theorized to have existed in very early stages of the evolution of the universe by studying collisions of heavy nuclei at ultra-relativistic energies. Such interactions can also reveal important information regarding the structure of the nucleus and mandate a thorough investigation of the simpler proton-nucleus collisions at the same energies. The complexity of PHENIX events and the need to analyze and also simulate them at rates similar to the data collection ones imposes enormous computation demands. This work is a first effort to implement PHENIX trigger algorithms on parallel computers and to study the feasibility of using such machines to run the complex programs necessary for the simulation of the PHENIX detector response. Fine and coarse grain approaches have been studied and evaluated. Depending on the application the performance of a massively parallel computer can be much better or much worse than that of a serial workstation. A comparison between single instruction and multiple instruction computers is also made and possible applications of the single instruction machines to high energy and nuclear physics experiments are outlined. copyright 1995 American Institute of Physics
Parallel and vector implementation of APROS simulator code
International Nuclear Information System (INIS)
Niemi, J.; Tommiska, J.
1990-01-01
In this paper the vector and parallel processing implementation of a general purpose simulator code is discussed. In this code the utilization of vector processing is straightforward. In addition to the loop level parallel processing, the functional decomposition and the domain decomposition have been considered. Results represented for a PWR-plant simulation illustrate the potential speed-up factors of the alternatives. It turns out that the loop level parallelism and the domain decomposition are the most promising alternative to employ the parallel processing. (author)
Analysis of multigrid methods on massively parallel computers: Architectural implications
Matheson, Lesley R.; Tarjan, Robert E.
1993-01-01
We study the potential performance of multigrid algorithms running on massively parallel computers with the intent of discovering whether presently envisioned machines will provide an efficient platform for such algorithms. We consider the domain parallel version of the standard V cycle algorithm on model problems, discretized using finite difference techniques in two and three dimensions on block structured grids of size 10(exp 6) and 10(exp 9), respectively. Our models of parallel computation were developed to reflect the computing characteristics of the current generation of massively parallel multicomputers. These models are based on an interconnection network of 256 to 16,384 message passing, 'workstation size' processors executing in an SPMD mode. The first model accomplishes interprocessor communications through a multistage permutation network. The communication cost is a logarithmic function which is similar to the costs in a variety of different topologies. The second model allows single stage communication costs only. Both models were designed with information provided by machine developers and utilize implementation derived parameters. With the medium grain parallelism of the current generation and the high fixed cost of an interprocessor communication, our analysis suggests an efficient implementation requires the machine to support the efficient transmission of long messages, (up to 1000 words) or the high initiation cost of a communication must be significantly reduced through an alternative optimization technique. Furthermore, with variable length message capability, our analysis suggests the low diameter multistage networks provide little or no advantage over a simple single stage communications network.
Research in Parallel Algorithms and Software for Computational Aerosciences
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Calafiura, Paolo; The ATLAS collaboration; Seuster, Rolf; Tsulaia, Vakhtang; van Gemmeren, Peter
2015-01-01
AthenaMP is a multi-process version of the ATLAS reconstruction and data analysis framework Athena. By leveraging Linux fork and copy-on-write, it allows the sharing of memory pages between event processors running on the same compute node with little to no change in the application code. Originally targeted to optimize the memory footprint of reconstruction jobs, AthenaMP has demonstrated that it can reduce the memory usage of certain confugurations of ATLAS production jobs by a factor of 2. AthenaMP has also evolved to become the parallel event-processing core of the recently developed ATLAS infrastructure for fine-grained event processing (Event Service) which allows to run AthenaMP inside massively parallel distributed applications on hundreds of compute nodes simultaneously. We present the architecture of AthenaMP, various strategies implemented by AthenaMP for scheduling workload to worker processes (for example: Shared Event Queue and Shared Distributor of Event Tokens) and the usage of AthenaMP in the...
Calafiura, Paolo; Seuster, Rolf; Tsulaia, Vakhtang; van Gemmeren, Peter
2015-01-01
AthenaMP is a multi-process version of the ATLAS reconstruction, simulation and data analysis framework Athena. By leveraging Linux fork and copy-on-write, it allows for sharing of memory pages between event processors running on the same compute node with little to no change in the application code. Originally targeted to optimize the memory footprint of reconstruction jobs, AthenaMP has demonstrated that it can reduce the memory usage of certain configurations of ATLAS production jobs by a factor of 2. AthenaMP has also evolved to become the parallel event-processing core of the recently developed ATLAS infrastructure for fine-grained event processing (Event Service) which allows to run AthenaMP inside massively parallel distributed applications on hundreds of compute nodes simultaneously. We present the architecture of AthenaMP, various strategies implemented by AthenaMP for scheduling workload to worker processes (for example: Shared Event Queue and Shared Distributor of Event Tokens) and the usage of Ath...
SWAMP+: multiple subsequence alignment using associative massive parallelism
Energy Technology Data Exchange (ETDEWEB)
Steinfadt, Shannon Irene [Los Alamos National Laboratory; Baker, Johnnie W [KENT STATE UNIV.
2010-10-18
A new parallel algorithm SWAMP+ incorporates the Smith-Waterman sequence alignment on an associative parallel model known as ASC. It is a highly sensitive parallel approach that expands traditional pairwise sequence alignment. This is the first parallel algorithm to provide multiple non-overlapping, non-intersecting subsequence alignments with the accuracy of Smith-Waterman. The efficient algorithm provides multiple alignments similar to BLAST while creating a better workflow for the end users. The parallel portions of the code run in O(m+n) time using m processors. When m = n, the algorithmic analysis becomes O(n) with a coefficient of two, yielding a linear speedup. Implementation of the algorithm on the SIMD ClearSpeed CSX620 confirms this theoretical linear speedup with real timings.
A massively-parallel electronic-structure calculations based on real-space density functional theory
International Nuclear Information System (INIS)
Iwata, Jun-Ichi; Takahashi, Daisuke; Oshiyama, Atsushi; Boku, Taisuke; Shiraishi, Kenji; Okada, Susumu; Yabana, Kazuhiro
2010-01-01
Based on the real-space finite-difference method, we have developed a first-principles density functional program that efficiently performs large-scale calculations on massively-parallel computers. In addition to efficient parallel implementation, we also implemented several computational improvements, substantially reducing the computational costs of O(N 3 ) operations such as the Gram-Schmidt procedure and subspace diagonalization. Using the program on a massively-parallel computer cluster with a theoretical peak performance of several TFLOPS, we perform electronic-structure calculations for a system consisting of over 10,000 Si atoms, and obtain a self-consistent electronic-structure in a few hundred hours. We analyze in detail the costs of the program in terms of computation and of inter-node communications to clarify the efficiency, the applicability, and the possibility for further improvements.
Parallel processing of Monte Carlo code MCNP for particle transport problem
Energy Technology Data Exchange (ETDEWEB)
Higuchi, Kenji; Kawasaki, Takuji
1996-06-01
It is possible to vectorize or parallelize Monte Carlo codes (MC code) for photon and neutron transport problem, making use of independency of the calculation for each particle. Applicability of existing MC code to parallel processing is mentioned. As for parallel computer, we have used both vector-parallel processor and scalar-parallel processor in performance evaluation. We have made (i) vector-parallel processing of MCNP code on Monte Carlo machine Monte-4 with four vector processors, (ii) parallel processing on Paragon XP/S with 256 processors. In this report we describe the methodology and results for parallel processing on two types of parallel or distributed memory computers. In addition, we mention the evaluation of parallel programming environments for parallel computers used in the present work as a part of the work developing STA (Seamless Thinking Aid) Basic Software. (author)
Massive hybrid parallelism for fully implicit multiphysics
International Nuclear Information System (INIS)
Gaston, D. R.; Permann, C. J.; Andrs, D.; Peterson, J. W.
2013-01-01
As hardware advances continue to modify the supercomputing landscape, traditional scientific software development practices will become more outdated, ineffective, and inefficient. The process of rewriting/retooling existing software for new architectures is a Sisyphean task, and results in substantial hours of development time, effort, and money. Software libraries which provide an abstraction of the resources provided by such architectures are therefore essential if the computational engineering and science communities are to continue to flourish in this modern computing environment. The Multiphysics Object Oriented Simulation Environment (MOOSE) framework enables complex multiphysics analysis tools to be built rapidly by scientists, engineers, and domain specialists, while also allowing them to both take advantage of current HPC architectures, and efficiently prepare for future supercomputer designs. MOOSE employs a hybrid shared-memory and distributed-memory parallel model and provides a complete and consistent interface for creating multiphysics analysis tools. In this paper, a brief discussion of the mathematical algorithms underlying the framework and the internal object-oriented hybrid parallel design are given. Representative massively parallel results from several applications areas are presented, and a brief discussion of future areas of research for the framework are provided. (authors)
Massive hybrid parallelism for fully implicit multiphysics
Energy Technology Data Exchange (ETDEWEB)
Gaston, D. R.; Permann, C. J.; Andrs, D.; Peterson, J. W. [Idaho National Laboratory, 2525 N. Fremont Ave., Idaho Falls, ID 83415 (United States)
2013-07-01
As hardware advances continue to modify the supercomputing landscape, traditional scientific software development practices will become more outdated, ineffective, and inefficient. The process of rewriting/retooling existing software for new architectures is a Sisyphean task, and results in substantial hours of development time, effort, and money. Software libraries which provide an abstraction of the resources provided by such architectures are therefore essential if the computational engineering and science communities are to continue to flourish in this modern computing environment. The Multiphysics Object Oriented Simulation Environment (MOOSE) framework enables complex multiphysics analysis tools to be built rapidly by scientists, engineers, and domain specialists, while also allowing them to both take advantage of current HPC architectures, and efficiently prepare for future supercomputer designs. MOOSE employs a hybrid shared-memory and distributed-memory parallel model and provides a complete and consistent interface for creating multiphysics analysis tools. In this paper, a brief discussion of the mathematical algorithms underlying the framework and the internal object-oriented hybrid parallel design are given. Representative massively parallel results from several applications areas are presented, and a brief discussion of future areas of research for the framework are provided. (authors)
MASSIVE HYBRID PARALLELISM FOR FULLY IMPLICIT MULTIPHYSICS
Energy Technology Data Exchange (ETDEWEB)
Cody J. Permann; David Andrs; John W. Peterson; Derek R. Gaston
2013-05-01
As hardware advances continue to modify the supercomputing landscape, traditional scientific software development practices will become more outdated, ineffective, and inefficient. The process of rewriting/retooling existing software for new architectures is a Sisyphean task, and results in substantial hours of development time, effort, and money. Software libraries which provide an abstraction of the resources provided by such architectures are therefore essential if the computational engineering and science communities are to continue to flourish in this modern computing environment. The Multiphysics Object Oriented Simulation Environment (MOOSE) framework enables complex multiphysics analysis tools to be built rapidly by scientists, engineers, and domain specialists, while also allowing them to both take advantage of current HPC architectures, and efficiently prepare for future supercomputer designs. MOOSE employs a hybrid shared-memory and distributed-memory parallel model and provides a complete and consistent interface for creating multiphysics analysis tools. In this paper, a brief discussion of the mathematical algorithms underlying the framework and the internal object-oriented hybrid parallel design are given. Representative massively parallel results from several applications areas are presented, and a brief discussion of future areas of research for the framework are provided.
Massively parallel Monte Carlo for many-particle simulations on GPUs
Energy Technology Data Exchange (ETDEWEB)
Anderson, Joshua A.; Jankowski, Eric [Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109 (United States); Grubb, Thomas L. [Department of Materials Science and Engineering, University of Michigan, Ann Arbor, MI 48109 (United States); Engel, Michael [Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109 (United States); Glotzer, Sharon C., E-mail: sglotzer@umich.edu [Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109 (United States); Department of Materials Science and Engineering, University of Michigan, Ann Arbor, MI 48109 (United States)
2013-12-01
Current trends in parallel processors call for the design of efficient massively parallel algorithms for scientific computing. Parallel algorithms for Monte Carlo simulations of thermodynamic ensembles of particles have received little attention because of the inherent serial nature of the statistical sampling. In this paper, we present a massively parallel method that obeys detailed balance and implement it for a system of hard disks on the GPU. We reproduce results of serial high-precision Monte Carlo runs to verify the method. This is a good test case because the hard disk equation of state over the range where the liquid transforms into the solid is particularly sensitive to small deviations away from the balance conditions. On a Tesla K20, our GPU implementation executes over one billion trial moves per second, which is 148 times faster than on a single Intel Xeon E5540 CPU core, enables 27 times better performance per dollar, and cuts energy usage by a factor of 13. With this improved performance we are able to calculate the equation of state for systems of up to one million hard disks. These large system sizes are required in order to probe the nature of the melting transition, which has been debated for the last forty years. In this paper we present the details of our computational method, and discuss the thermodynamics of hard disks separately in a companion paper.
Energy Technology Data Exchange (ETDEWEB)
Watanabe, Hideo; Kawai, Wataru; Nemoto, Toshiyuki [Fujitsu Ltd., Tokyo (Japan); and others
1997-12-01
Several computer codes in the nuclear field have been vectorized, parallelized and transported on the FUJITSU VPP500 system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. These results are reported in 3 parts, i.e., the vectorization part, the parallelization part and the porting part. In this report, we describe the parallelization. In this parallelization part, the parallelization of 2-Dimensional relativistic electromagnetic particle code EM2D, Cylindrical Direct Numerical Simulation code CYLDNS and molecular dynamics code for simulating radiation damages in diamond crystals DGR are described. In the vectorization part, the vectorization of two and three dimensional discrete ordinates simulation code DORT-TORT, gas dynamics analysis code FLOWGR and relativistic Boltzmann-Uehling-Uhlenbeck simulation code RBUU are described. And then, in the porting part, the porting of reactor safety analysis code RELAP5/MOD3.2 and RELAP5/MOD3.2.1.2, nuclear data processing system NJOY and 2-D multigroup discrete ordinate transport code TWOTRAN-II are described. And also, a survey for the porting of command-driven interactive data analysis plotting program IPLOT are described. (author)
A massively parallel corpus: the Bible in 100 languages.
Christodouloupoulos, Christos; Steedman, Mark
We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora.
Scientific programming on massively parallel processor CP-PACS
International Nuclear Information System (INIS)
Boku, Taisuke
1998-01-01
The massively parallel processor CP-PACS takes various problems of calculation physics as the object, and it has been designed so that its architecture has been devised to do various numerical processings. In this report, the outline of the CP-PACS and the example of programming in the Kernel CG benchmark in NAS Parallel Benchmarks, version 1, are shown, and the pseudo vector processing mechanism and the parallel processing tuning of scientific and technical computation utilizing the three-dimensional hyper crossbar net, which are two great features of the architecture of the CP-PACS are described. As for the CP-PACS, the PUs based on RISC processor and added with pseudo vector processor are used. Pseudo vector processing is realized as the loop processing by scalar command. The features of the connection net of PUs are explained. The algorithm of the NPB version 1 Kernel CG is shown. The part that takes the time for processing most in the main loop is the product of matrix and vector (matvec), and the parallel processing of the matvec is explained. The time for the computation by the CPU is determined. As the evaluation of the performance, the evaluation of the time for execution, the short vector processing of pseudo vector processor based on slide window, and the comparison with other parallel computers are reported. (K.I.)
Parallelization of simulation code for liquid-gas model of lattice-gas fluid
International Nuclear Information System (INIS)
Kawai, Wataru; Ebihara, Kenichi; Kume, Etsuo; Watanabe, Tadashi
2000-03-01
A simulation code for hydrodynamical phenomena which is based on the liquid-gas model of lattice-gas fluid is parallelized by using MPI (Message Passing Interface) library. The parallelized code can be applied to the larger size of the simulations than the non-parallelized code. The calculation times of the parallelized code on VPP500 (Vector-Parallel super computer with dispersed memory units), AP3000 (Scalar-parallel server with dispersed memory units), and a workstation cluster decreased in inverse proportion to the number of processors. (author)
CALTRANS: A parallel, deterministic, 3D neutronics code
Energy Technology Data Exchange (ETDEWEB)
Carson, L.; Ferguson, J.; Rogers, J.
1994-04-01
Our efforts to parallelize the deterministic solution of the neutron transport equation has culminated in a new neutronics code CALTRANS, which has full 3D capability. In this article, we describe the layout and algorithms of CALTRANS and present performance measurements of the code on a variety of platforms. Explicit implementation of the parallel algorithms of CALTRANS using both the function calls of the Parallel Virtual Machine software package (PVM 3.2) and the Meiko CS-2 tagged message passing library (based on the Intel NX/2 interface) are provided in appendices.
From parallel to distributed computing for reactive scattering calculations
International Nuclear Information System (INIS)
Lagana, A.; Gervasi, O.; Baraglia, R.
1994-01-01
Some reactive scattering codes have been ported on different innovative computer architectures ranging from massively parallel machines to clustered workstations. The porting has required a drastic restructuring of the codes to single out computationally decoupled cpu intensive subsections. The suitability of different theoretical approaches for parallel and distributed computing restructuring is discussed and the efficiency of related algorithms evaluated
Vectorization, parallelization and porting of nuclear codes (porting). Progress report fiscal 1998
International Nuclear Information System (INIS)
Nemoto, Toshiyuki; Kawai, Wataru; Ishizuki, Shigeru; Kawasaki, Nobuo; Kume, Etsuo; Adachi, Masaaki; Ogasawara, Shinobu
2000-03-01
Several computer codes in the nuclear field have been vectorized, parallelized and transported on the FUJITSU VPP500 system, the AP3000 system and the Paragon system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. We dealt with 12 codes in fiscal 1998. These results are reported in 3 parts, i.e., the vectorization and parallelization on vector processors part, the parallelization on scalar processors part and the porting part. In this report, we describe the porting. In this porting part, the porting of Monte Carlo N-Particle Transport code MCNP4B2 and Reactor Safety Analysis code RELAP5 on the AP3000 are described. In the vectorization and parallelization on vector processors part, the vectorization of General Tokamak Circuit Simulation Program code GTCSP, the vectorization and parallelization of Molecular Dynamics Ntv Simulation code MSP2, Eddy Current Analysis code EDDYCAL, Thermal Analysis Code for Test of Passive Cooling System by HENDEL T2 code THANPACST2 and MHD Equilibrium code SELENEJ on the VPP500 are described. In the parallelization on scalar processors part, the parallelization of Monte Carlo N-Particle Transport code MCNP4B2, Plasma Hydrodynamics code using Cubic Interpolated propagation Method PHCIP and Vectorized Monte Carlo code (continuous energy model/multi-group model) MVP/GMVP on the Paragon are described. (author)
Directory of Open Access Journals (Sweden)
Fang Huang
2016-06-01
Full Text Available In some digital Earth engineering applications, spatial interpolation algorithms are required to process and analyze large amounts of data. Due to its powerful computing capacity, heterogeneous computing has been used in many applications for data processing in various fields. In this study, we explore the design and implementation of a parallel universal kriging spatial interpolation algorithm using the OpenCL programming model on heterogeneous computing platforms for massive Geo-spatial data processing. This study focuses primarily on transforming the hotspots in serial algorithms, i.e., the universal kriging interpolation function, into the corresponding kernel function in OpenCL. We also employ parallelization and optimization techniques in our implementation to improve the code performance. Finally, based on the results of experiments performed on two different high performance heterogeneous platforms, i.e., an NVIDIA graphics processing unit system and an Intel Xeon Phi system (MIC, we show that the parallel universal kriging algorithm can achieve the highest speedup of up to 40× with a single computing device and the highest speedup of up to 80× with multiple devices.
International Nuclear Information System (INIS)
1994-08-01
This is the first annual report of the MPP pilot project 93MPR05. In this pilot project four research groups with different, complementary backgrounds collaborate with the aim to develop new algorithms and codes to simulate the magnetohydrodynamics of thermonuclear and astrophysical plasmas on massively parallel machines. The expected speed-up is required to simulate the dynamics of the hot plasmas of interest which are characterized by very large magnetic Reynolds numbers and, hence, require high spatial and temporal resolutions (for details see section 1). The four research groups that collaborated to produce the results reported here are: The MHD group of Prof. Dr. J.P. Goedbloed at the FOM-Institute for Plasma Physics 'Rijnhuizen' in Nieuwegein, the group of Prof. Dr. H. van der Vorst at the Mathematics Institute of Utrecht University, the group of Prof. Dr. A.G. Hearn at the Astronomical Institute of Utrecht University, and the group of Dr. Ir. H.J.J. te Riele at the CWI in Amsterdam. The full project team met frequently during this first project year to discuss progress reports, current problems, etc. (see section 2). The main results of the first project year are: - Proof of the scalability of typical linear and nonlinear MHD codes - development and testing of a parallel version of the Arnoldi algorithm - development and testing of alternative methods for solving large non-Hermitian eigenvalue problems - porting of the 3D nonlinear semi-implicit time evolution code HERA to an MPP system. The steps that were scheduled to reach these intended results are given in section 3. (orig./WL)
Energy Technology Data Exchange (ETDEWEB)
NONE
1994-08-01
This is the first annual report of the MPP pilot project 93MPR05. In this pilot project four research groups with different, complementary backgrounds collaborate with the aim to develop new algorithms and codes to simulate the magnetohydrodynamics of thermonuclear and astrophysical plasmas on massively parallel machines. The expected speed-up is required to simulate the dynamics of the hot plasmas of interest which are characterized by very large magnetic Reynolds numbers and, hence, require high spatial and temporal resolutions (for details see section 1). The four research groups that collaborated to produce the results reported here are: The MHD group of Prof. Dr. J.P. Goedbloed at the FOM-Institute for Plasma Physics `Rijnhuizen` in Nieuwegein, the group of Prof. Dr. H. van der Vorst at the Mathematics Institute of Utrecht University, the group of Prof. Dr. A.G. Hearn at the Astronomical Institute of Utrecht University, and the group of Dr. Ir. H.J.J. te Riele at the CWI in Amsterdam. The full project team met frequently during this first project year to discuss progress reports, current problems, etc. (see section 2). The main results of the first project year are: - Proof of the scalability of typical linear and nonlinear MHD codes - development and testing of a parallel version of the Arnoldi algorithm - development and testing of alternative methods for solving large non-Hermitian eigenvalue problems - porting of the 3D nonlinear semi-implicit time evolution code HERA to an MPP system. The steps that were scheduled to reach these intended results are given in section 3. (orig./WL).
Parallelization of the MAAP-A code neutronics/thermal hydraulics coupling
International Nuclear Information System (INIS)
Froehle, P.H.; Wei, T.Y.C.; Weber, D.P.; Henry, R.E.
1998-01-01
A major new feature, one-dimensional space-time kinetics, has been added to a developmental version of the MAAP code through the introduction of the DIF3D-K module. This code is referred to as MAAP-A. To reduce the overall job time required, a capability has been provided to run the MAAP-A code in parallel. The parallel version of MAAP-A utilizes two machines running in parallel, with the DIF3D-K module executing on one machine and the rest of the MAAP-A code executing on the other machine. Timing results obtained during the development of the capability indicate that reductions in time of 30--40% are possible. The parallel version can be run on two SPARC 20 (SUN OS 5.5) workstations connected through the ethernet. MPI (Message Passing Interface standard) needs to be implemented on the machines. If necessary the parallel version can also be run on only one machine. The results obtained running in this one-machine mode identically match the results obtained from the serial version of the code
Climate models on massively parallel computers
International Nuclear Information System (INIS)
Vitart, F.; Rouvillois, P.
1993-01-01
First results got on massively parallel computers (Multiple Instruction Multiple Data and Simple Instruction Multiple Data) allow to consider building of coupled models with high resolutions. This would make possible simulation of thermoaline circulation and other interaction phenomena between atmosphere and ocean. The increasing of computers powers, and then the improvement of resolution will go us to revise our approximations. Then hydrostatic approximation (in ocean circulation) will not be valid when the grid mesh will be of a dimension lower than a few kilometers: We shall have to find other models. The expert appraisement got in numerical analysis at the Center of Limeil-Valenton (CEL-V) will be used again to imagine global models taking in account atmosphere, ocean, ice floe and biosphere, allowing climate simulation until a regional scale
Vectorization, parallelization and implementation of Quantum molecular dynamics codes (QQQF, MONTEV)
Energy Technology Data Exchange (ETDEWEB)
Kato, Kaori [High Energy Accelerator Research Organization, Tsukuba, Ibaraki (Japan); Kunugi, Tomoaki; Kotake, Susumu; Shibahara, Masahiko
1998-03-01
This report describes parallelization, vectorization and implementation for two simulation codes, Quantum molecular dynamics simulation code QQQF and Photon montecalro molecular dynamics simulation code MONTEV, that have been developed for the analysis of the thermalization of photon energies in the molecule or materials. QQQF has been vectorized and parallelized on Fujitsu VPP and has been implemented from VPP to Intel Paragon XP/S and parallelized. MONTEV has been implemented from VPP to Paragon and parallelized. (author)
Micro-mechanical Simulations of Soils using Massively Parallel Supercomputers
Directory of Open Access Journals (Sweden)
David W. Washington
2004-06-01
Full Text Available In this research a computer program, Trubal version 1.51, based on the Discrete Element Method was converted to run on a Connection Machine (CM-5,a massively parallel supercomputer with 512 nodes, to expedite the computational times of simulating Geotechnical boundary value problems. The dynamic memory algorithm in Trubal program did not perform efficiently in CM-2 machine with the Single Instruction Multiple Data (SIMD architecture. This was due to the communication overhead involving global array reductions, global array broadcast and random data movement. Therefore, a dynamic memory algorithm in Trubal program was converted to a static memory arrangement and Trubal program was successfully converted to run on CM-5 machines. The converted program was called "TRUBAL for Parallel Machines (TPM." Simulating two physical triaxial experiments and comparing simulation results with Trubal simulations validated the TPM program. With a 512 nodes CM-5 machine TPM produced a nine-fold speedup demonstrating the inherent parallelism within algorithms based on the Discrete Element Method.
Directory of Open Access Journals (Sweden)
Cronn Richard
2009-12-01
Full Text Available Abstract Background Molecular evolutionary studies share the common goal of elucidating historical relationships, and the common challenge of adequately sampling taxa and characters. Particularly at low taxonomic levels, recent divergence, rapid radiations, and conservative genome evolution yield limited sequence variation, and dense taxon sampling is often desirable. Recent advances in massively parallel sequencing make it possible to rapidly obtain large amounts of sequence data, and multiplexing makes extensive sampling of megabase sequences feasible. Is it possible to efficiently apply massively parallel sequencing to increase phylogenetic resolution at low taxonomic levels? Results We reconstruct the infrageneric phylogeny of Pinus from 37 nearly-complete chloroplast genomes (average 109 kilobases each of an approximately 120 kilobase genome generated using multiplexed massively parallel sequencing. 30/33 ingroup nodes resolved with ≥ 95% bootstrap support; this is a substantial improvement relative to prior studies, and shows massively parallel sequencing-based strategies can produce sufficient high quality sequence to reach support levels originally proposed for the phylogenetic bootstrap. Resampling simulations show that at least the entire plastome is necessary to fully resolve Pinus, particularly in rapidly radiating clades. Meta-analysis of 99 published infrageneric phylogenies shows that whole plastome analysis should provide similar gains across a range of plant genera. A disproportionate amount of phylogenetic information resides in two loci (ycf1, ycf2, highlighting their unusual evolutionary properties. Conclusion Plastome sequencing is now an efficient option for increasing phylogenetic resolution at lower taxonomic levels in plant phylogenetic and population genetic analyses. With continuing improvements in sequencing capacity, the strategies herein should revolutionize efforts requiring dense taxon and character sampling
Vectorization and parallelization of a production reactor assembly code
International Nuclear Information System (INIS)
Vujic, J.L.; Martin, W.R.; Michigan Univ., Ann Arbor, MI
1991-01-01
In order to use efficiently the new features of supercomputers, production codes, usually written 10 -20 years ago, must be tailored for modern computer architectures. We have chosen to optimize the CPM-2 code, a production reactor assembly code based on the collision probability transport method. Substantial speedup in the execution times was obtained with the parallel/vector version of the CPM-2 code. In addition, we have developed a new transfer probability method, which removes some of the modelling limitations of the collision probability method encoded in the CPM-2 code, and can fully utilize the parallel/vector architecture of a multiprocessor IBM 3090. (author)
Vectorization and parallelization of a production reactor assembly code
International Nuclear Information System (INIS)
Vujic, J.L.; Martin, W.R.
1991-01-01
In order to efficiently use new features of supercomputers, production codes, usually written 10 - 20 years ago, must be tailored for modern computer architectures. We have chosen to optimize the CPM-2 code, a production reactor assembly code based on the collision probability transport method. Substantial speedups in the execution times were obtained with the parallel/vector version of the CPM-2 code. In addition, we have developed a new transfer probability method, which removes some of the modelling limitations of the collision probability method encoded in the CPM-2 code, and can fully utilize parallel/vector architecture of a multiprocessor IBM 3090. (author)
Reduced complexity and latency for a massive MIMO system using a parallel detection algorithm
Directory of Open Access Journals (Sweden)
Shoichi Higuchi
2017-09-01
Full Text Available In recent years, massive MIMO systems have been widely researched to realize high-speed data transmission. Since massive MIMO systems use a large number of antennas, these systems require huge complexity to detect the signal. In this paper, we propose a novel detection method for massive MIMO using parallel detection with maximum likelihood detection with QR decomposition and M-algorithm (QRM-MLD to reduce the complexity and latency. The proposed scheme obtains an R matrix after permutation of an H matrix and QR decomposition. The R matrix is also eliminated using a Gauss–Jordan elimination method. By using a modified R matrix, the proposed method can detect the transmitted signal using parallel detection. From the simulation results, the proposed scheme can achieve a reduced complexity and latency with a little degradation of the bit error rate (BER performance compared with the conventional method.
International Nuclear Information System (INIS)
Anderson, D.V.; Shumaker, D.E.
1993-01-01
From a computational standpoint, particle simulation calculations for plasmas have not adapted well to the transitions from scalar to vector processing nor from serial to parallel environments. They have suffered from inordinate and excessive accessing of computer memory and have been hobbled by relatively inefficient gather-scatter constructs resulting from the use of indirect indexing. Lastly, the many-to-one mapping characteristic of the deposition phase has made it difficult to perform this in parallel. The authors' code sorts and reorders the particles in a spatial order. This allows them to greatly reduce the memory references, to run in directly indexed vector mode, and to employ domain decomposition to achieve parallelization. In this hybrid simulation the electrons are modeled as a fluid and the field equations solved are obtained from the electron momentum equation together with the pre-Maxwell equations (displacement current neglected). Either zero or finite electron mass can be used in the electron model. The resulting field equations are solved with an iteratively explicit procedure which is thus trivial to parallelize. Likewise, the field interpolations and the particle pushing is simple to parallelize. The deposition, sorting, and reordering phases are less simple and it is for these that the authors present detailed algorithms. They have now successfully tested the parallel version of HOPS in serial mode and it is now being readied for parallel execution on the Cray C-90. They will then port HOPS to a massively parallel computer, in the next year
Parallel Scaling Characteristics of Selected NERSC User ProjectCodes
Energy Technology Data Exchange (ETDEWEB)
Skinner, David; Verdier, Francesca; Anand, Harsh; Carter,Jonathan; Durst, Mark; Gerber, Richard
2005-03-05
This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems. An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.
DEFF Research Database (Denmark)
Romero, N. A.; Glinsvad, Christian; Larsen, Ask Hjorth
2013-01-01
Density function theory (DFT) is the most widely employed electronic structure method because of its favorable scaling with system size and accuracy for a broad range of molecular and condensed-phase systems. The advent of massively parallel supercomputers has enhanced the scientific community...
Application of Raptor-M3G to reactor dosimetry problems on massively parallel architectures - 026
International Nuclear Information System (INIS)
Longoni, G.
2010-01-01
The solution of complex 3-D radiation transport problems requires significant resources both in terms of computation time and memory availability. Therefore, parallel algorithms and multi-processor architectures are required to solve efficiently large 3-D radiation transport problems. This paper presents the application of RAPTOR-M3G (Rapid Parallel Transport Of Radiation - Multiple 3D Geometries) to reactor dosimetry problems. RAPTOR-M3G is a newly developed parallel computer code designed to solve the discrete ordinates (SN) equations on multi-processor computer architectures. This paper presents the results for a reactor dosimetry problem using a 3-D model of a commercial 2-loop pressurized water reactor (PWR). The accuracy and performance of RAPTOR-M3G will be analyzed and the numerical results obtained from the calculation will be compared directly to measurements of the neutron field in the reactor cavity air gap. The parallel performance of RAPTOR-M3G on massively parallel architectures, where the number of computing nodes is in the order of hundreds, will be analyzed up to four hundred processors. The performance results will be presented based on two supercomputing architectures: the POPLE supercomputer operated by the Pittsburgh Supercomputing Center and the Westinghouse computer cluster. The Westinghouse computer cluster is equipped with a standard Ethernet network connection and an InfiniBand R interconnects capable of a bandwidth in excess of 20 GBit/sec. Therefore, the impact of the network architecture on RAPTOR-M3G performance will be analyzed as well. (authors)
Introduction to massively-parallel computing in high-energy physics
AUTHOR|(CDS)2083520
1993-01-01
Ever since computers were first used for scientific and numerical work, there has existed an "arms race" between the technical development of faster computing hardware, and the desires of scientists to solve larger problems in shorter time-scales. However, the vast leaps in processor performance achieved through advances in semi-conductor science have reached a hiatus as the technology comes up against the physical limits of the speed of light and quantum effects. This has lead all high performance computer manufacturers to turn towards a parallel architecture for their new machines. In these lectures we will introduce the history and concepts behind parallel computing, and review the various parallel architectures and software environments currently available. We will then introduce programming methodologies that allow efficient exploitation of parallel machines, and present case studies of the parallelization of typical High Energy Physics codes for the two main classes of parallel computing architecture (S...
Parallelization of a Monte Carlo particle transport simulation code
Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.
2010-05-01
We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Image processing with massively parallel computer Quadrics Q1
International Nuclear Information System (INIS)
Della Rocca, A.B.; La Porta, L.; Ferriani, S.
1995-05-01
Aimed to evaluate the image processing capabilities of the massively parallel computer Quadrics Q1, a convolution algorithm that has been implemented is described in this report. At first the discrete convolution mathematical definition is recalled together with the main Q1 h/w and s/w features. Then the different codification forms of the algorythm are described and the Q1 performances are compared with those obtained by different computers. Finally, the conclusions report on main results and suggestions
Mesh-based parallel code coupling interface
Energy Technology Data Exchange (ETDEWEB)
Wolf, K.; Steckel, B. (eds.) [GMD - Forschungszentrum Informationstechnik GmbH, St. Augustin (DE). Inst. fuer Algorithmen und Wissenschaftliches Rechnen (SCAI)
2001-04-01
MpCCI (mesh-based parallel code coupling interface) is an interface for multidisciplinary simulations. It provides industrial end-users as well as commercial code-owners with the facility to combine different simulation tools in one environment. Thereby new solutions for multidisciplinary problems will be created. This opens new application dimensions for existent simulation tools. This Book of Abstracts gives a short overview about ongoing activities in industry and research - all presented at the 2{sup nd} MpCCI User Forum in February 2001 at GMD Sankt Augustin. (orig.) [German] MpCCI (mesh-based parallel code coupling interface) definiert eine Schnittstelle fuer multidisziplinaere Simulationsanwendungen. Sowohl industriellen Anwender als auch kommerziellen Softwarehersteller wird mit MpCCI die Moeglichkeit gegeben, Simulationswerkzeuge unterschiedlicher Disziplinen miteinander zu koppeln. Dadurch entstehen neue Loesungen fuer multidisziplinaere Problemstellungen und fuer etablierte Simulationswerkzeuge ergeben sich neue Anwendungsfelder. Dieses Book of Abstracts bietet einen Ueberblick ueber zur Zeit laufende Arbeiten in der Industrie und in der Forschung, praesentiert auf dem 2{sup nd} MpCCI User Forum im Februar 2001 an der GMD Sankt Augustin. (orig.)
Implementation of a Monte Carlo algorithm for neutron transport on a massively parallel SIMD machine
International Nuclear Information System (INIS)
Baker, R.S.
1992-01-01
We present some results from the recent adaptation of a vectorized Monte Carlo algorithm to a massively parallel architecture. The performance of the algorithm on a single processor Cray Y-MP and a Thinking Machine Corporations CM-2 and CM-200 is compared for several test problems. The results show that significant speedups are obtainable for vectorized Monte Carlo algorithms on massively parallel machines, even when the algorithms are applied to realistic problems which require extensive variance reduction. However, the architecture of the Connection Machine does place some limitations on the regime in which the Monte Carlo algorithm may be expected to perform well
Implementation of a Monte Carlo algorithm for neutron transport on a massively parallel SIMD machine
International Nuclear Information System (INIS)
Baker, R.S.
1993-01-01
We present some results from the recent adaptation of a vectorized Monte Carlo algorithm to a massively parallel architecture. The performance of the algorithm on a single processor Cray Y-MP and a Thinking Machine Corporations CM-2 and CM-200 is compared for several test problems. The results show that significant speedups are obtainable for vectorized Monte Carlo algorithms on massively parallel machines, even when the algorithms are applied to realistic problems which require extensive variance reduction. However, the architecture of the Connection Machine does place some limitations on the regime in which the Monte Carlo algorithm may be expected to perform well. (orig.)
Parallel Computing Characteristics of Two-Phase Thermal-Hydraulics code, CUPID
International Nuclear Information System (INIS)
Lee, Jae Ryong; Yoon, Han Young
2013-01-01
Parallelized CUPID code has proved to be able to reproduce multi-dimensional thermal hydraulic analysis by validating with various conceptual problems and experimental data. In this paper, the characteristics of the parallelized CUPID code were investigated. Both single- and two phase simulation are taken into account. Since the scalability of a parallel simulation is known to be better for fine mesh system, two types of mesh system are considered. In addition, the dependency of the preconditioner for matrix solver was also compared. The scalability for the single-phase flow is better than that for two-phase flow due to the less numbers of iterations for solving pressure matrix. The CUPID code was investigated the parallel performance in terms of scalability. The CUPID code was parallelized with domain decomposition method. The MPI library was adopted to communicate the information at the interface cells. As increasing the number of mesh, the scalability is improved. For a given mesh, single-phase flow simulation with diagonal preconditioner shows the best speedup. However, for the two-phase flow simulation, the ILU preconditioner is recommended since it reduces the overall simulation time
International Nuclear Information System (INIS)
Yamazaki, Takao; Fujisaki, Masahide; Okuda, Motoi; Takano, Makoto; Masukawa, Fumihiro; Naito, Yoshitaka
1993-01-01
The general purpose Monte Carlo code MCNP4 has been implemented on the Fujitsu AP1000 distributed memory highly parallel computer. Parallelization techniques developed and studied are reported. A shielding analysis function of the MCNP4 code is parallelized in this study. A technique to map a history to each processor dynamically and to map control process to a certain processor was applied. The efficiency of parallelized code is up to 80% for a typical practical problem with 512 processors. These results demonstrate the advantages of a highly parallel computer to the conventional computers in the field of shielding analysis by Monte Carlo method. (orig.)
Schultz, A.
2010-12-01
3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We
Implementation of QR up- and downdating on a massively parallel |computer
DEFF Research Database (Denmark)
Bendtsen, Claus; Hansen, Per Christian; Madsen, Kaj
1995-01-01
We describe an implementation of QR up- and downdating on a massively parallel computer (the Connection Machine CM-200) and show that the algorithm maps well onto the computer. In particular, we show how the use of corrected semi-normal equations for downdating can be efficiently implemented. We...... also illustrate the use of our algorithms in a new LP algorithm....
Towards Interactive Visual Exploration of Parallel Programs using a Domain-Specific Language
Klein, Tobias; Bruckner, Stefan; Grö ller, M. Eduard; Hadwiger, Markus; Rautek, Peter
2016-01-01
The use of GPUs and the massively parallel computing paradigm have become wide-spread. We describe a framework for the interactive visualization and visual analysis of the run-time behavior of massively parallel programs, especially OpenCL kernels. This facilitates understanding a program's function and structure, finding the causes of possible slowdowns, locating program bugs, and interactively exploring and visually comparing different code variants in order to improve performance and correctness. Our approach enables very specific, user-centered analysis, both in terms of the recording of the run-time behavior and the visualization itself. Instead of having to manually write instrumented code to record data, simple code annotations tell the source-to-source compiler which code instrumentation to generate automatically. The visualization part of our framework then enables the interactive analysis of kernel run-time behavior in a way that can be very specific to a particular problem or optimization goal, such as analyzing the causes of memory bank conflicts or understanding an entire parallel algorithm.
Towards Interactive Visual Exploration of Parallel Programs using a Domain-Specific Language
Klein, Tobias
2016-04-19
The use of GPUs and the massively parallel computing paradigm have become wide-spread. We describe a framework for the interactive visualization and visual analysis of the run-time behavior of massively parallel programs, especially OpenCL kernels. This facilitates understanding a program\\'s function and structure, finding the causes of possible slowdowns, locating program bugs, and interactively exploring and visually comparing different code variants in order to improve performance and correctness. Our approach enables very specific, user-centered analysis, both in terms of the recording of the run-time behavior and the visualization itself. Instead of having to manually write instrumented code to record data, simple code annotations tell the source-to-source compiler which code instrumentation to generate automatically. The visualization part of our framework then enables the interactive analysis of kernel run-time behavior in a way that can be very specific to a particular problem or optimization goal, such as analyzing the causes of memory bank conflicts or understanding an entire parallel algorithm.
Study on MPI/OpenMP hybrid parallelism for Monte Carlo neutron transport code
International Nuclear Information System (INIS)
Liang Jingang; Xu Qi; Wang Kan; Liu Shiwen
2013-01-01
Parallel programming with mixed mode of messages-passing and shared-memory has several advantages when used in Monte Carlo neutron transport code, such as fitting hardware of distributed-shared clusters, economizing memory demand of Monte Carlo transport, improving parallel performance, and so on. MPI/OpenMP hybrid parallelism was implemented based on a one dimension Monte Carlo neutron transport code. Some critical factors affecting the parallel performance were analyzed and solutions were proposed for several problems such as contention access, lock contention and false sharing. After optimization the code was tested finally. It is shown that the hybrid parallel code can reach good performance just as pure MPI parallel program, while it saves a lot of memory usage at the same time. Therefore hybrid parallel is efficient for achieving large-scale parallel of Monte Carlo neutron transport. (authors)
(Nearly) portable PIC code for parallel computers
International Nuclear Information System (INIS)
Decyk, V.K.
1993-01-01
As part of the Numerical Tokamak Project, the author has developed a (nearly) portable, one dimensional version of the GCPIC algorithm for particle-in-cell codes on parallel computers. This algorithm uses a spatial domain decomposition for the fields, and passes particles from one domain to another as the particles move spatially. With only minor changes, the code has been run in parallel on the Intel Delta, the Cray C-90, the IBM ES/9000 and a cluster of workstations. After a line by line translation into cmfortran, the code was also run on the CM-200. Impressive speeds have been achieved, both on the Intel Delta and the Cray C-90, around 30 nanoseconds per particle per time step. In addition, the author was able to isolate the data management modules, so that the physics modules were not changed much from their sequential version, and the data management modules can be used as open-quotes black boxes.close quotes
International Nuclear Information System (INIS)
Byers, J.A.; Williams, T.J.; Cohen, B.I.; Dimits, A.M.
1994-01-01
One of the programs of the Magnetic fusion Energy (MFE) Theory and computations Program is studying the anomalous transport of thermal energy across the field lines in the core of a tokamak. We use the method of gyrokinetic particle-in-cell simulation in this study. For this LDRD project we employed massively parallel processing, new algorithms, and new algorithms, and new formal techniques to improve this research. Specifically, we sought to take steps toward: researching experimentally-relevant parameters in our simulations, learning parallel computing to have as a resource for our group, and achieving a 100 x speedup over our starting-point Cray2 simulation code's performance
Massively Parallel Single-Molecule Manipulation Using Centrifugal Force
Wong, Wesley; Halvorsen, Ken
2011-03-01
Precise manipulation of single molecules has led to remarkable insights in physics, chemistry, biology, and medicine. However, two issues that have impeded the widespread adoption of these techniques are equipment cost and the laborious nature of making measurements one molecule at a time. To meet these challenges, we have developed an approach that enables massively parallel single- molecule force measurements using centrifugal force. This approach is realized in the centrifuge force microscope, an instrument in which objects in an orbiting sample are subjected to a calibration-free, macroscopically uniform force- field while their micro-to-nanoscopic motions are observed. We demonstrate high- throughput single-molecule force spectroscopy with this technique by performing thousands of rupture experiments in parallel, characterizing force-dependent unbinding kinetics of an antibody-antigen pair in minutes rather than days. Currently, we are taking steps to integrate high-resolution detection, fluorescence, temperature control and a greater dynamic range in force. With significant benefits in efficiency, cost, simplicity, and versatility, single-molecule centrifugation has the potential to expand single-molecule experimentation to a wider range of researchers and experimental systems.
Parallelization Issues and Particle-In Codes.
Elster, Anne Cathrine
1994-01-01
"Everything should be made as simple as possible, but not simpler." Albert Einstein. The field of parallel scientific computing has concentrated on parallelization of individual modules such as matrix solvers and factorizers. However, many applications involve several interacting modules. Our analyses of a particle-in-cell code modeling charged particles in an electric field, show that these accompanying dependencies affect data partitioning and lead to new parallelization strategies concerning processor, memory and cache utilization. Our test-bed, a KSR1, is a distributed memory machine with a globally shared addressing space. However, most of the new methods presented hold generally for hierarchical and/or distributed memory systems. We introduce a novel approach that uses dual pointers on the local particle arrays to keep the particle locations automatically partially sorted. Complexity and performance analyses with accompanying KSR benchmarks, have been included for both this scheme and for the traditional replicated grids approach. The latter approach maintains load-balance with respect to particles. However, our results demonstrate it fails to scale properly for problems with large grids (say, greater than 128-by-128) running on as few as 15 KSR nodes, since the extra storage and computation time associated with adding the grid copies, becomes significant. Our grid partitioning scheme, although harder to implement, does not need to replicate the whole grid. Consequently, it scales well for large problems on highly parallel systems. It may, however, require load balancing schemes for non-uniform particle distributions. Our dual pointer approach may facilitate this through dynamically partitioned grids. We also introduce hierarchical data structures that store neighboring grid-points within the same cache -line by reordering the grid indexing. This alignment produces a 25% savings in cache-hits for a 4-by-4 cache. A consideration of the input data's effect on
International Nuclear Information System (INIS)
Deng Li; Xie Zhongsheng
1999-01-01
The coupled neutron and photon transport Monte Carlo code MCNP (version 3B) has been parallelized in parallel virtual machine (PVM) and message passing interface (MPI) by modifying a previous serial code. The new code has been verified by solving sample problems. The speedup increases linearly with the number of processors and the average efficiency is up to 99% for 12-processor. (author)
MADmap: A Massively Parallel Maximum-Likelihood Cosmic Microwave Background Map-Maker
Energy Technology Data Exchange (ETDEWEB)
Cantalupo, Christopher; Borrill, Julian; Jaffe, Andrew; Kisner, Theodore; Stompor, Radoslaw
2009-06-09
MADmap is a software application used to produce maximum-likelihood images of the sky from time-ordered data which include correlated noise, such as those gathered by Cosmic Microwave Background (CMB) experiments. It works efficiently on platforms ranging from small workstations to the most massively parallel supercomputers. Map-making is a critical step in the analysis of all CMB data sets, and the maximum-likelihood approach is the most accurate and widely applicable algorithm; however, it is a computationally challenging task. This challenge will only increase with the next generation of ground-based, balloon-borne and satellite CMB polarization experiments. The faintness of the B-mode signal that these experiments seek to measure requires them to gather enormous data sets. MADmap is already being run on up to O(1011) time samples, O(108) pixels and O(104) cores, with ongoing work to scale to the next generation of data sets and supercomputers. We describe MADmap's algorithm based around a preconditioned conjugate gradient solver, fast Fourier transforms and sparse matrix operations. We highlight MADmap's ability to address problems typically encountered in the analysis of realistic CMB data sets and describe its application to simulations of the Planck and EBEX experiments. The massively parallel and distributed implementation is detailed and scaling complexities are given for the resources required. MADmap is capable of analysing the largest data sets now being collected on computing resources currently available, and we argue that, given Moore's Law, MADmap will be capable of reducing the most massive projected data sets.
Heydt, Carina; Fassunke, Jana; Künstlinger, Helen; Ihle, Michaela Angelika; König, Katharina; Heukamp, Lukas Carl; Schildhaus, Hans-Ulrich; Odenthal, Margarete; Büttner, Reinhard; Merkelbach-Bruse, Sabine
2014-01-01
Over the last years, massively parallel sequencing has rapidly evolved and has now transitioned into molecular pathology routine laboratories. It is an attractive platform for analysing multiple genes at the same time with very little input material. Therefore, the need for high quality DNA obtained from automated DNA extraction systems has increased, especially to those laboratories which are dealing with formalin-fixed paraffin-embedded (FFPE) material and high sample throughput. This study evaluated five automated FFPE DNA extraction systems as well as five DNA quantification systems using the three most common techniques, UV spectrophotometry, fluorescent dye-based quantification and quantitative PCR, on 26 FFPE tissue samples. Additionally, the effects on downstream applications were analysed to find the most suitable pre-analytical methods for massively parallel sequencing in routine diagnostics. The results revealed that the Maxwell 16 from Promega (Mannheim, Germany) seems to be the superior system for DNA extraction from FFPE material. The extracts had a 1.3–24.6-fold higher DNA concentration in comparison to the other extraction systems, a higher quality and were most suitable for downstream applications. The comparison of the five quantification methods showed intermethod variations but all methods could be used to estimate the right amount for PCR amplification and for massively parallel sequencing. Interestingly, the best results in massively parallel sequencing were obtained with a DNA input of 15 ng determined by the NanoDrop 2000c spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA). No difference could be detected in mutation analysis based on the results of the quantification methods. These findings emphasise, that it is particularly important to choose the most reliable and constant DNA extraction system, especially when using small biopsies and low elution volumes, and that all common DNA quantification techniques can be used for
Directory of Open Access Journals (Sweden)
Carina Heydt
Full Text Available Over the last years, massively parallel sequencing has rapidly evolved and has now transitioned into molecular pathology routine laboratories. It is an attractive platform for analysing multiple genes at the same time with very little input material. Therefore, the need for high quality DNA obtained from automated DNA extraction systems has increased, especially to those laboratories which are dealing with formalin-fixed paraffin-embedded (FFPE material and high sample throughput. This study evaluated five automated FFPE DNA extraction systems as well as five DNA quantification systems using the three most common techniques, UV spectrophotometry, fluorescent dye-based quantification and quantitative PCR, on 26 FFPE tissue samples. Additionally, the effects on downstream applications were analysed to find the most suitable pre-analytical methods for massively parallel sequencing in routine diagnostics. The results revealed that the Maxwell 16 from Promega (Mannheim, Germany seems to be the superior system for DNA extraction from FFPE material. The extracts had a 1.3-24.6-fold higher DNA concentration in comparison to the other extraction systems, a higher quality and were most suitable for downstream applications. The comparison of the five quantification methods showed intermethod variations but all methods could be used to estimate the right amount for PCR amplification and for massively parallel sequencing. Interestingly, the best results in massively parallel sequencing were obtained with a DNA input of 15 ng determined by the NanoDrop 2000c spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA. No difference could be detected in mutation analysis based on the results of the quantification methods. These findings emphasise, that it is particularly important to choose the most reliable and constant DNA extraction system, especially when using small biopsies and low elution volumes, and that all common DNA quantification techniques can
Analysis of parallel computing performance of the code MCNP
International Nuclear Information System (INIS)
Wang Lei; Wang Kan; Yu Ganglin
2006-01-01
Parallel computing can reduce the running time of the code MCNP effectively. With the MPI message transmitting software, MCNP5 can achieve its parallel computing on PC cluster with Windows operating system. Parallel computing performance of MCNP is influenced by factors such as the type, the complexity level and the parameter configuration of the computing problem. This paper analyzes the parallel computing performance of MCNP regarding with these factors and gives measures to improve the MCNP parallel computing performance. (authors)
K.I.S.S. Parallel Coding (lecture 2)
CERN. Geneva
2018-01-01
K.I.S.S.ing parallel computing means, finally, loving it. Parallel computing will be approached in a theoretical and experimental way, using the most advanced and used C API: OpenMP. OpenMP is an open source project constantly developed and updated to hide the awful complexity of parallel coding in an awesome interface. The result is a tool which leaves plenty of space for clever solutions and terrific results in terms of efficiency and performance maximisation.
High Efficiency EBCOT with Parallel Coding Architecture for JPEG2000
Directory of Open Access Journals (Sweden)
Chiang Jen-Shiun
2006-01-01
Full Text Available This work presents a parallel context-modeling coding architecture and a matching arithmetic coder (MQ-coder for the embedded block coding (EBCOT unit of the JPEG2000 encoder. Tier-1 of the EBCOT consumes most of the computation time in a JPEG2000 encoding system. The proposed parallel architecture can increase the throughput rate of the context modeling. To match the high throughput rate of the parallel context-modeling architecture, an efficient pipelined architecture for context-based adaptive arithmetic encoder is proposed. This encoder of JPEG2000 can work at 180 MHz to encode one symbol each cycle. Compared with the previous context-modeling architectures, our parallel architectures can improve the throughput rate up to 25%.
Computational chaos in massively parallel neural networks
Barhen, Jacob; Gulati, Sandeep
1989-01-01
A fundamental issue which directly impacts the scalability of current theoretical neural network models to massively parallel embodiments, in both software as well as hardware, is the inherent and unavoidable concurrent asynchronicity of emerging fine-grained computational ensembles and the possible emergence of chaotic manifestations. Previous analyses attributed dynamical instability to the topology of the interconnection matrix, to parasitic components or to propagation delays. However, researchers have observed the existence of emergent computational chaos in a concurrently asynchronous framework, independent of the network topology. Researcher present a methodology enabling the effective asynchronous operation of large-scale neural networks. Necessary and sufficient conditions guaranteeing concurrent asynchronous convergence are established in terms of contracting operators. Lyapunov exponents are computed formally to characterize the underlying nonlinear dynamics. Simulation results are presented to illustrate network convergence to the correct results, even in the presence of large delays.
Simplifying the parallelization of scientific codes by a function-centric approach in Python
International Nuclear Information System (INIS)
Nilsen, Jon K; Cai Xing; Langtangen, Hans Petter; Hoeyland, Bjoern
2010-01-01
The purpose of this paper is to show how existing scientific software can be parallelized using a separate thin layer of Python code where all parallelization-specific tasks are implemented. We provide specific examples of such a Python code layer, which can act as templates for parallelizing a wide set of serial scientific codes. The use of Python for parallelization is motivated by the fact that the language is well suited for reusing existing serial codes programmed in other languages. The extreme flexibility of Python with regard to handling functions makes it very easy to wrap up decomposed computational tasks of a serial scientific application as Python functions. Many parallelization-specific components can be implemented as generic Python functions, which may take as input those wrapped functions that perform concrete computational tasks. The overall programming effort needed by this parallelization approach is limited, and the resulting parallel Python scripts have a compact and clean structure. The usefulness of the parallelization approach is exemplified by three different classes of application in natural and social sciences.
Routing performance analysis and optimization within a massively parallel computer
Archer, Charles Jens; Peters, Amanda; Pinnow, Kurt Walter; Swartz, Brent Allen
2013-04-16
An apparatus, program product and method optimize the operation of a massively parallel computer system by, in part, receiving actual performance data concerning an application executed by the plurality of interconnected nodes, and analyzing the actual performance data to identify an actual performance pattern. A desired performance pattern may be determined for the application, and an algorithm may be selected from among a plurality of algorithms stored within a memory, the algorithm being configured to achieve the desired performance pattern based on the actual performance data.
Just-in-Time Compilation-Inspired Methodology for Parallelization of Compute Intensive Java Code
Directory of Open Access Journals (Sweden)
GHULAM MUSTAFA
2017-01-01
Full Text Available Compute intensive programs generally consume significant fraction of execution time in a small amount of repetitive code. Such repetitive code is commonly known as hotspot code. We observed that compute intensive hotspots often possess exploitable loop level parallelism. A JIT (Just-in-Time compiler profiles a running program to identify its hotspots. Hotspots are then translated into native code, for efficient execution. Using similar approach, we propose a methodology to identify hotspots and exploit their parallelization potential on multicore systems. Proposed methodology selects and parallelizes each DOALL loop that is either contained in a hotspot method or calls a hotspot method. The methodology could be integrated in front-end of a JIT compiler to parallelize sequential code, just before native translation. However, compilation to native code is out of scope of this work. As a case study, we analyze eighteen JGF (Java Grande Forum benchmarks to determine parallelization potential of hotspots. Eight benchmarks demonstrate a speedup of up to 7.6x on an 8-core system
Just-in-time compilation-inspired methodology for parallelization of compute intensive java code
International Nuclear Information System (INIS)
Mustafa, G.; Ghani, M.U.
2017-01-01
Compute intensive programs generally consume significant fraction of execution time in a small amount of repetitive code. Such repetitive code is commonly known as hotspot code. We observed that compute intensive hotspots often possess exploitable loop level parallelism. A JIT (Just-in-Time) compiler profiles a running program to identify its hotspots. Hotspots are then translated into native code, for efficient execution. Using similar approach, we propose a methodology to identify hotspots and exploit their parallelization potential on multicore systems. Proposed methodology selects and parallelizes each DOALL loop that is either contained in a hotspot method or calls a hotspot method. The methodology could be integrated in front-end of a JIT compiler to parallelize sequential code, just before native translation. However, compilation to native code is out of scope of this work. As a case study, we analyze eighteen JGF (Java Grande Forum) benchmarks to determine parallelization potential of hotspots. Eight benchmarks demonstrate a speedup of up to 7.6x on an 8-core system. (author)
Representing and computing regular languages on massively parallel networks
Energy Technology Data Exchange (ETDEWEB)
Miller, M.I.; O' Sullivan, J.A. (Electronic Systems and Research Lab., of Electrical Engineering, Washington Univ., St. Louis, MO (US)); Boysam, B. (Dept. of Electrical, Computer and Systems Engineering, Rensselaer Polytechnic Inst., Troy, NY (US)); Smith, K.R. (Dept. of Electrical Engineering, Southern Illinois Univ., Edwardsville, IL (US))
1991-01-01
This paper proposes a general method for incorporating rule-based constraints corresponding to regular languages into stochastic inference problems, thereby allowing for a unified representation of stochastic and syntactic pattern constraints. The authors' approach first established the formal connection of rules to Chomsky grammars, and generalizes the original work of Shannon on the encoding of rule-based channel sequences to Markov chains of maximum entropy. This maximum entropy probabilistic view leads to Gibb's representations with potentials which have their number of minima growing at precisely the exponential rate that the language of deterministically constrained sequences grow. These representations are coupled to stochastic diffusion algorithms, which sample the language-constrained sequences by visiting the energy minima according to the underlying Gibbs' probability law. The coupling to stochastic search methods yields the all-important practical result that fully parallel stochastic cellular automata may be derived to generate samples from the rule-based constraint sets. The production rules and neighborhood state structure of the language of sequences directly determines the necessary connection structures of the required parallel computing surface. Representations of this type have been mapped to the DAP-510 massively-parallel processor consisting of 1024 mesh-connected bit-serial processing elements for performing automated segmentation of electron-micrograph images.
Intelligent trigger by massively parallel processors for high energy physics experiments
International Nuclear Information System (INIS)
Rohrbach, F.; Vesztergombi, G.
1992-01-01
The CERN-MPPC collaboration concentrates its effort on the development of machines based on massive parallelism with thousands of integrated processing elements, arranged in a string. Seven applications are under detailed studies within the collaboration: three for LHC, one for SSC, two for fixed target high energy physics at CERN and one for HDTV. Preliminary results are presented. They show that the objectives should be reached with the use of the ASP architecture. (author)
Parallel thermal radiation transport in two dimensions
International Nuclear Information System (INIS)
Smedley-Stevenson, R.P.; Ball, S.R.
2003-01-01
This paper describes the distributed memory parallel implementation of a deterministic thermal radiation transport algorithm in a 2-dimensional ALE hydrodynamics code. The parallel algorithm consists of a variety of components which are combined in order to produce a state of the art computational capability, capable of solving large thermal radiation transport problems using Blue-Oak, the 3 Tera-Flop MPP (massive parallel processors) computing facility at AWE (United Kingdom). Particular aspects of the parallel algorithm are described together with examples of the performance on some challenging applications. (author)
Parallel thermal radiation transport in two dimensions
Energy Technology Data Exchange (ETDEWEB)
Smedley-Stevenson, R.P.; Ball, S.R. [AWE Aldermaston (United Kingdom)
2003-07-01
This paper describes the distributed memory parallel implementation of a deterministic thermal radiation transport algorithm in a 2-dimensional ALE hydrodynamics code. The parallel algorithm consists of a variety of components which are combined in order to produce a state of the art computational capability, capable of solving large thermal radiation transport problems using Blue-Oak, the 3 Tera-Flop MPP (massive parallel processors) computing facility at AWE (United Kingdom). Particular aspects of the parallel algorithm are described together with examples of the performance on some challenging applications. (author)
Energy Technology Data Exchange (ETDEWEB)
Bauerle, Matthew [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2014-08-01
This project utilizes Graphics Processing Units (GPUs) to compute radiograph simulations for arbitrary objects. The generation of radiographs, also known as the forward projection imaging model, is computationally intensive and not widely utilized. The goal of this research is to develop a massively parallel algorithm that can compute forward projections for objects with a trillion voxels (3D pixels). To achieve this end, the data are divided into blocks that can each t into GPU memory. The forward projected image is also divided into segments to allow for future parallelization and to avoid needless computations.
Massively parallel fabrication of repetitive nanostructures: nanolithography for nanoarrays
International Nuclear Information System (INIS)
Luttge, Regina
2009-01-01
This topical review provides an overview of nanolithographic techniques for nanoarrays. Using patterning techniques such as lithography, normally we aim for a higher order architecture similarly to functional systems in nature. Inspired by the wealth of complexity in nature, these architectures are translated into technical devices, for example, found in integrated circuitry or other systems in which structural elements work as discrete building blocks in microdevices. Ordered artificial nanostructures (arrays of pillars, holes and wires) have shown particular properties and bring about the opportunity to modify and tune the device operation. Moreover, these nanostructures deliver new applications, for example, the nanoscale control of spin direction within a nanomagnet. Subsequently, we can look for applications where this unique property of the smallest manufactured element is repetitively used such as, for example with respect to spin, in nanopatterned magnetic media for data storage. These nanostructures are generally called nanoarrays. Most of these applications require massively parallel produced nanopatterns which can be directly realized by laser interference (areas up to 4 cm 2 are easily achieved with a Lloyd's mirror set-up). In this topical review we will further highlight the application of laser interference as a tool for nanofabrication, its limitations and ultimate advantages towards a variety of devices including nanostructuring for photonic crystal devices, high resolution patterned media and surface modifications of medical implants. The unique properties of nanostructured surfaces have also found applications in biomedical nanoarrays used either for diagnostic or functional assays including catalytic reactions on chip. Bio-inspired templated nanoarrays will be presented in perspective to other massively parallel nanolithography techniques currently discussed in the scientific literature. (topical review)
GRADSPMHD: A parallel MHD code based on the SPH formalism
Vanaverbeke, S.; Keppens, R.; Poedts, S.
2014-03-01
We present GRADSPMHD, a completely Lagrangian parallel magnetohydrodynamics code based on the SPH formalism. The implementation of the equations of SPMHD in the “GRAD-h” formalism assembles known results, including the derivation of the discretized MHD equations from a variational principle, the inclusion of time-dependent artificial viscosity, resistivity and conductivity terms, as well as the inclusion of a mixed hyperbolic/parabolic correction scheme for satisfying the ∇ṡB→ constraint on the magnetic field. The code uses a tree-based formalism for neighbor finding and can optionally use the tree code for computing the self-gravity of the plasma. The structure of the code closely follows the framework of our parallel GRADSPH FORTRAN 90 code which we added previously to the CPC program library. We demonstrate the capabilities of GRADSPMHD by running 1, 2, and 3 dimensional standard benchmark tests and we find good agreement with previous work done by other researchers. The code is also applied to the problem of simulating the magnetorotational instability in 2.5D shearing box tests as well as in global simulations of magnetized accretion disks. We find good agreement with available results on this subject in the literature. Finally, we discuss the performance of the code on a parallel supercomputer with distributed memory architecture. Catalogue identifier: AERP_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERP_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 620503 No. of bytes in distributed program, including test data, etc.: 19837671 Distribution format: tar.gz Programming language: FORTRAN 90/MPI. Computer: HPC cluster. Operating system: Unix. Has the code been vectorized or parallelized?: Yes, parallelized using MPI. RAM: ˜30 MB for a
Optimization of Particle-in-Cell Codes on RISC Processors
Decyk, Viktor K.; Karmesin, Steve Roy; Boer, Aeint de; Liewer, Paulette C.
1996-01-01
General strategies are developed to optimize particle-cell-codes written in Fortran for RISC processors which are commonly used on massively parallel computers. These strategies include data reorganization to improve cache utilization and code reorganization to improve efficiency of arithmetic pipelines.
Vectorization, parallelization and porting of nuclear codes (porting). Progress report fiscal 1999
Energy Technology Data Exchange (ETDEWEB)
Kawasaki, Nobuo; Nemoto, Toshiyuki; Kawai, Wataru; Ishizuki, Shigeru [Fujitsu Ltd., Tokyo (Japan); Ogasawara, Shinobu; Kume, Etsuo; Adachi, Masaaki [Japan Atomic Energy Research Inst., Tokai, Ibaraki (Japan). Tokai Research Establishment; Yatake, Yo-ichi [Hitachi Ltd., Tokyo (Japan)
2001-01-01
Several computer codes in the nuclear field have been vectorized, parallelized and transported on the FUJITSU VPP500 system, the AP3000 system, the SX-4 system and the Paragon system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. We dealt with 18 codes in fiscal 1999. These results are reported in 3 parts, i.e., the vectorization and the parallelization part on vector processors, the parallelization port on scalar processors and the porting part. In this report, we describe the porting. In this porting part, the porting of Assisted Model Building with Energy Refinement code version 5 (AMBER5), general purpose Monte Carlo codes far neutron and photon transport calculations based on continuous energy and multigroup methods (MVP/GMVP), automatic editing system for MCNP library code (autonj), neutron damage calculations for materials irradiations and neutron damage calculations for compounds code (SPECTER/SPECOMP), severe accident analysis code (MELCOR) and COolant Boiling in Rod Arrays, Two-Fluid code (COBRA-TF) on the VPP500 system and/or the AP3000 system are described. (author)
Massively parallel DNA sequencing facilitates diagnosis of patients with Usher syndrome type 1.
Directory of Open Access Journals (Sweden)
Hidekane Yoshimura
Full Text Available Usher syndrome is an autosomal recessive disorder manifesting hearing loss, retinitis pigmentosa and vestibular dysfunction, and having three clinical subtypes. Usher syndrome type 1 is the most severe subtype due to its profound hearing loss, lack of vestibular responses, and retinitis pigmentosa that appears in prepuberty. Six of the corresponding genes have been identified, making early diagnosis through DNA testing possible, with many immediate and several long-term advantages for patients and their families. However, the conventional genetic techniques, such as direct sequence analysis, are both time-consuming and expensive. Targeted exon sequencing of selected genes using the massively parallel DNA sequencing technology will potentially enable us to systematically tackle previously intractable monogenic disorders and improve molecular diagnosis. Using this technique combined with direct sequence analysis, we screened 17 unrelated Usher syndrome type 1 patients and detected probable pathogenic variants in the 16 of them (94.1% who carried at least one mutation. Seven patients had the MYO7A mutation (41.2%, which is the most common type in Japanese. Most of the mutations were detected by only the massively parallel DNA sequencing. We report here four patients, who had probable pathogenic mutations in two different Usher syndrome type 1 genes, and one case of MYO7A/PCDH15 digenic inheritance. This is the first report of Usher syndrome mutation analysis using massively parallel DNA sequencing and the frequency of Usher syndrome type 1 genes in Japanese. Mutation screening using this technique has the power to quickly identify mutations of many causative genes while maintaining cost-benefit performance. In addition, the simultaneous mutation analysis of large numbers of genes is useful for detecting mutations in different genes that are possibly disease modifiers or of digenic inheritance.
Massively parallel DNA sequencing facilitates diagnosis of patients with Usher syndrome type 1.
Yoshimura, Hidekane; Iwasaki, Satoshi; Nishio, Shin-Ya; Kumakawa, Kozo; Tono, Tetsuya; Kobayashi, Yumiko; Sato, Hiroaki; Nagai, Kyoko; Ishikawa, Kotaro; Ikezono, Tetsuo; Naito, Yasushi; Fukushima, Kunihiro; Oshikawa, Chie; Kimitsuki, Takashi; Nakanishi, Hiroshi; Usami, Shin-Ichi
2014-01-01
Usher syndrome is an autosomal recessive disorder manifesting hearing loss, retinitis pigmentosa and vestibular dysfunction, and having three clinical subtypes. Usher syndrome type 1 is the most severe subtype due to its profound hearing loss, lack of vestibular responses, and retinitis pigmentosa that appears in prepuberty. Six of the corresponding genes have been identified, making early diagnosis through DNA testing possible, with many immediate and several long-term advantages for patients and their families. However, the conventional genetic techniques, such as direct sequence analysis, are both time-consuming and expensive. Targeted exon sequencing of selected genes using the massively parallel DNA sequencing technology will potentially enable us to systematically tackle previously intractable monogenic disorders and improve molecular diagnosis. Using this technique combined with direct sequence analysis, we screened 17 unrelated Usher syndrome type 1 patients and detected probable pathogenic variants in the 16 of them (94.1%) who carried at least one mutation. Seven patients had the MYO7A mutation (41.2%), which is the most common type in Japanese. Most of the mutations were detected by only the massively parallel DNA sequencing. We report here four patients, who had probable pathogenic mutations in two different Usher syndrome type 1 genes, and one case of MYO7A/PCDH15 digenic inheritance. This is the first report of Usher syndrome mutation analysis using massively parallel DNA sequencing and the frequency of Usher syndrome type 1 genes in Japanese. Mutation screening using this technique has the power to quickly identify mutations of many causative genes while maintaining cost-benefit performance. In addition, the simultaneous mutation analysis of large numbers of genes is useful for detecting mutations in different genes that are possibly disease modifiers or of digenic inheritance.
Recent activities in accelerator code development
International Nuclear Information System (INIS)
Copper, R.K.; Ryne, R.D.
1992-01-01
In this paper we will review recent activities in the area of code development as it affects the accelerator community. We will first discuss the changing computing environment. We will review how the computing environment has changed in the last 10 years, with emphasis on computing power, operating systems, computer languages, graphics standards, and massively parallel processing. Then we will discuss recent code development activities in the areas of electromagnetics codes and beam dynamics codes
Parallelization of a three-dimensional whole core transport code DeCART
Energy Technology Data Exchange (ETDEWEB)
Jin Young, Cho; Han Gyu, Joo; Ha Yong, Kim; Moon-Hee, Chang [Korea Atomic Energy Research Institute, Yuseong-gu, Daejon (Korea, Republic of)
2003-07-01
Parallelization of the DeCART (deterministic core analysis based on ray tracing) code is presented that reduces the computational burden of the tremendous computing time and memory required in three-dimensional whole core transport calculations. The parallelization employs the concept of MPI grouping and the MPI/OpenMP mixed scheme as well. Since most of the computing time and memory are used in MOC (method of characteristics) and the multi-group CMFD (coarse mesh finite difference) calculation in DeCART, variables and subroutines related to these two modules are the primary targets for parallelization. Specifically, the ray tracing module was parallelized using a planar domain decomposition scheme and an angular domain decomposition scheme. The parallel performance of the DeCART code is evaluated by solving a rodded variation of the C5G7MOX three dimensional benchmark problem and a simplified three-dimensional SMART PWR core problem. In C5G7MOX problem with 24 CPUs, a speedup of maximum 21 is obtained on an IBM Regatta machine and 22 on a LINUX Cluster in the MOC kernel, which indicates good parallel performance of the DeCART code. In the simplified SMART problem, the memory requirement of about 11 GBytes in the single processor cases reduces to 940 Mbytes with 24 processors, which means that the DeCART code can now solve large core problems with affordable LINUX clusters. (authors)
Parallelization characteristics of a three-dimensional whole-core code DeCART
International Nuclear Information System (INIS)
Cho, J. Y.; Joo, H.K.; Kim, H. Y.; Lee, J. C.; Jang, M. H.
2003-01-01
Neutron transport calculation for three-dimensional amount of computing time but also huge memory. Therefore, whole-core codes such as DeCART need both also parallel computation and distributed memory capabilities. This paper is to implement such parallel capabilities based on MPI grouping and memory distribution on the DeCART code, and then to evaluate the performance by solving the C5G7 three-dimensional benchmark and a simplified three-dimensional SMART core problem. In C5G7 problem with 24 CPUs, a speedup of maximum 22 is obtained on IBM regatta machine and 21 on a LINUX cluster for the MOC kernel, which indicates good parallel performance of the DeCART code. The simplified SMART problem which need about 11 GBytes memory with one processors requires about 940 MBytes, which means that the DeCART code can now solve large core problems on affordable LINUX clusters
Parallelization of MCNP4 code by using simple FORTRAN algorithms
International Nuclear Information System (INIS)
Yazid, P.I.; Takano, Makoto; Masukawa, Fumihiro; Naito, Yoshitaka.
1993-12-01
Simple FORTRAN algorithms, that rely only on open, close, read and write statements, together with disk files and some UNIX commands have been applied to parallelization of MCNP4. The code, named MCNPNFS, maintains almost all capabilities of MCNP4 in solving shielding problems. It is able to perform parallel computing on a set of any UNIX workstations connected by a network, regardless of the heterogeneity in hardware system, provided that all processors produce a binary file in the same format. Further, it is confirmed that MCNPNFS can be executed also on Monte-4 vector-parallel computer. MCNPNFS has been tested intensively by executing 5 photon-neutron benchmark problems, a spent fuel cask problem and 17 sample problems included in the original code package of MCNP4. Three different workstations, connected by a network, have been used to execute MCNPNFS in parallel. By measuring CPU time, the parallel efficiency is determined to be 58% to 99% and 86% in average. On Monte-4, MCNPNFS has been executed using 4 processors concurrently and has achieved the parallel efficiency of 79% in average. (author)
Enhanced memory architecture for massively parallel vision chip
Chen, Zhe; Yang, Jie; Liu, Liyuan; Wu, Nanjian
2015-04-01
Local memory architecture plays an important role in high performance massively parallel vision chip. In this paper, we propose an enhanced memory architecture with compact circuit area designed in a full-custom flow. The memory consists of separate master-stage static latches and shared slave-stage dynamic latches. We use split transmission transistors on the input data path to enhance tolerance for charge sharing and to achieve random read/write capabilities. The memory is designed in a 0.18 μm CMOS process. The area overhead of the memory achieves 16.6 μm2/bit. Simulation results show that the maximum operating frequency reaches 410 MHz and the corresponding peak dynamic power consumption for a 64-bit memory unit is 190 μW under 1.8 V supply voltage.
PUMA: An Operating System for Massively Parallel Systems
Directory of Open Access Journals (Sweden)
Stephen R. Wheat
1994-01-01
Full Text Available This article presents an overview of PUMA (Performance-oriented, User-managed Messaging Architecture, a message-passing kernel for massively parallel systems. Message passing in PUMA is based on portals – an opening in the address space of an application process. Once an application process has established a portal, other processes can write values into the portal using a simple send operation. Because messages are written directly into the address space of the receiving process, there is no need to buffer messages in the PUMA kernel and later copy them into the applications address space. PUMA consists of two components: the quintessential kernel (Q-Kernel and the process control thread (PCT. Although the PCT provides management decisions, the Q-Kernel controls access and implements the policies specified by the PCT.
Control rod drop transient analysis with the coupled parallel code pCTF-PARCSv2.7
International Nuclear Information System (INIS)
Ramos, Enrique; Roman, Jose E.; Abarca, Agustín; Miró, Rafael; Bermejo, Juan A.
2016-01-01
Highlights: • An MPI parallel version of the thermal–hydraulic subchannel code COBRA-TF has been developed. • The parallel code has been coupled to the 3D neutron diffusion code PARCSv2.7. • The new codes are validated with a control rod drop transient. - Abstract: In order to reduce the response time when simulating large reactors in detail, a parallel version of the thermal–hydraulic subchannel code COBRA-TF (CTF) has been developed using the standard Message Passing Interface (MPI). The parallelization is oriented to reactor cells, so it is best suited for models consisting of many cells. The generation of the Jacobian matrix is parallelized, in such a way that each processor is in charge of generating the data associated with a subset of cells. Also, the solution of the linear system of equations is done in parallel, using the PETSc toolkit. With the goal of creating a powerful tool to simulate the reactor core behavior during asymmetrical transients, the 3D neutron diffusion code PARCSv2.7 (PARCS) has been coupled with the parallel version of CTF (pCTF) using the Parallel Virtual Machine (PVM) technology. In order to validate the correctness of the parallel coupled code, a control rod drop transient has been simulated comparing the results with the real experimental measures acquired during an NPP real test.
Development of parallel benchmark code by sheet metal forming simulator 'ITAS'
International Nuclear Information System (INIS)
Watanabe, Hiroshi; Suzuki, Shintaro; Minami, Kazuo
1999-03-01
This report describes the development of parallel benchmark code by sheet metal forming simulator 'ITAS'. ITAS is a nonlinear elasto-plastic analysis program by the finite element method for the purpose of the simulation of sheet metal forming. ITAS adopts the dynamic analysis method that computes displacement of sheet metal at every time unit and utilizes the implicit method with the direct linear equation solver. Therefore the simulator is very robust. However, it requires a lot of computational time and memory capacity. In the development of the parallel benchmark code, we designed the code by MPI programming to reduce the computational time. In numerical experiments on the five kinds of parallel super computers at CCSE JAERI, i.e., SP2, SR2201, SX-4, T94 and VPP300, good performances are observed. The result will be shown to the public through WWW so that the benchmark results may become a guideline of research and development of the parallel program. (author)
Matthew Parks; Richard Cronn; Aaron Liston
2009-01-01
We reconstruct the infrageneric phylogeny of Pinus from 37 nearly-complete chloroplast genomes (average 109 kilobases each of an approximately 120 kilobase genome) generated using multiplexed massively parallel sequencing. We found that 30/33 ingroup nodes resolved wlth > 95-percent bootstrap support; this is a substantial improvement relative...
ANNarchy: a code generation approach to neural simulations on parallel hardware
Vitay, Julien; Dinkelbach, Helge Ü.; Hamker, Fred H.
2015-01-01
Many modern neural simulators focus on the simulation of networks of spiking neurons on parallel hardware. Another important framework in computational neuroscience, rate-coded neural networks, is mostly difficult or impossible to implement using these simulators. We present here the ANNarchy (Artificial Neural Networks architect) neural simulator, which allows to easily define and simulate rate-coded and spiking networks, as well as combinations of both. The interface in Python has been designed to be close to the PyNN interface, while the definition of neuron and synapse models can be specified using an equation-oriented mathematical description similar to the Brian neural simulator. This information is used to generate C++ code that will efficiently perform the simulation on the chosen parallel hardware (multi-core system or graphical processing unit). Several numerical methods are available to transform ordinary differential equations into an efficient C++code. We compare the parallel performance of the simulator to existing solutions. PMID:26283957
MCBooster: a tool for MC generation for massively parallel platforms
Alves Junior, Antonio Augusto
2016-01-01
MCBooster is a header-only, C++11-compliant library for the generation of large samples of phase-space Monte Carlo events on massively parallel platforms. It was released on GitHub in the spring of 2016. The library core algorithms implement the Raubold-Lynch method; they are able to generate the full kinematics of decays with up to nine particles in the final state. The library supports the generation of sequential decays as well as the parallel evaluation of arbitrary functions over the generated events. The output of MCBooster completely accords with popular and well-tested software packages such as GENBOD (W515 from CERNLIB) and TGenPhaseSpace from the ROOT framework. MCBooster is developed on top of the Thrust library and runs on Linux systems. It deploys transparently on NVidia CUDA-enabled GPUs as well as multicore CPUs. This contribution summarizes the main features of MCBooster. A basic description of the user interface and some examples of applications are provided, along with measurements of perfor...
Hosokawa, Masahito; Nishikawa, Yohei; Kogawa, Masato; Takeyama, Haruko
2017-07-12
Massively parallel single-cell genome sequencing is required to further understand genetic diversities in complex biological systems. Whole genome amplification (WGA) is the first step for single-cell sequencing, but its throughput and accuracy are insufficient in conventional reaction platforms. Here, we introduce single droplet multiple displacement amplification (sd-MDA), a method that enables massively parallel amplification of single cell genomes while maintaining sequence accuracy and specificity. Tens of thousands of single cells are compartmentalized in millions of picoliter droplets and then subjected to lysis and WGA by passive droplet fusion in microfluidic channels. Because single cells are isolated in compartments, their genomes are amplified to saturation without contamination. This enables the high-throughput acquisition of contamination-free and cell specific sequence reads from single cells (21,000 single-cells/h), resulting in enhancement of the sequence data quality compared to conventional methods. This method allowed WGA of both single bacterial cells and human cancer cells. The obtained sequencing coverage rivals those of conventional techniques with superior sequence quality. In addition, we also demonstrate de novo assembly of uncultured soil bacteria and obtain draft genomes from single cell sequencing. This sd-MDA is promising for flexible and scalable use in single-cell sequencing.
Highly parallel line-based image coding for many cores.
Peng, Xiulian; Xu, Jizheng; Zhou, You; Wu, Feng
2012-01-01
Computers are developing along with a new trend from the dual-core and quad-core processors to ones with tens or even hundreds of cores. Multimedia, as one of the most important applications in computers, has an urgent need to design parallel coding algorithms for compression. Taking intraframe/image coding as a start point, this paper proposes a pure line-by-line coding scheme (LBLC) to meet the need. In LBLC, an input image is processed line by line sequentially, and each line is divided into small fixed-length segments. The compression of all segments from prediction to entropy coding is completely independent and concurrent at many cores. Results on a general-purpose computer show that our scheme can get a 13.9 times speedup with 15 cores at the encoder and a 10.3 times speedup at the decoder. Ideally, such near-linear speeding relation with the number of cores can be kept for more than 100 cores. In addition to the high parallelism, the proposed scheme can perform comparatively or even better than the H.264 high profile above middle bit rates. At near-lossless coding, it outperforms H.264 more than 10 dB. At lossless coding, up to 14% bit-rate reduction is observed compared with H.264 lossless coding at the high 4:4:4 profile.
Block iterative restoration of astronomical images with the massively parallel processor
International Nuclear Information System (INIS)
Heap, S.R.; Lindler, D.J.
1987-01-01
A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images
Energy Technology Data Exchange (ETDEWEB)
Takemiya, Hiroshi; Ohta, Hirofumi; Honma, Ichirou
1996-03-01
The parallelization of Electro-Magnetic Cascade Monte Carlo Simulation Code, EGS4 on distributed memory scalar parallel computer: Intel Paragon XP/S15-256 is described. EGS4 has the feature that calculation time for one incident particle is quite different from each other because of the dynamic generation of secondary particles and different behavior of each particle. Granularity for parallel processing, parallel programming model and the algorithm of parallel random number generation are discussed and two kinds of method, each of which allocates particles dynamically or statically, are used for the purpose of realizing high speed parallel processing of this code. Among four problems chosen for performance evaluation, the speedup factors for three problems have been attained to nearly 100 times with 128 processor. It has been found that when both the calculation time for each incident particles and its dispersion are large, it is preferable to use dynamic particle allocation method which can average the load for each processor. And it has also been found that when they are small, it is preferable to use static particle allocation method which reduces the communication overhead. Moreover, it is pointed out that to get the result accurately, it is necessary to use double precision variables in EGS4 code. Finally, the workflow of program parallelization is analyzed and tools for program parallelization through the experience of the EGS4 parallelization are discussed. (author).
International Nuclear Information System (INIS)
Taraglio, S.; Massaioli, F.
1995-08-01
A parallel implementation of a library to build and train Multi Layer Perceptrons via the Back Propagation algorithm is presented. The target machine is the SIMD massively parallel supercomputer Quadrics. Performance measures are provided on three different machines with different number of processors, for two network examples. A sample source code is given
Scalable Parallel Distributed Coprocessor System for Graph Searching Problems with Massive Data
Directory of Open Access Journals (Sweden)
Wanrong Huang
2017-01-01
Full Text Available The Internet applications, such as network searching, electronic commerce, and modern medical applications, produce and process massive data. Considerable data parallelism exists in computation processes of data-intensive applications. A traversal algorithm, breadth-first search (BFS, is fundamental in many graph processing applications and metrics when a graph grows in scale. A variety of scientific programming methods have been proposed for accelerating and parallelizing BFS because of the poor temporal and spatial locality caused by inherent irregular memory access patterns. However, new parallel hardware could provide better improvement for scientific methods. To address small-world graph problems, we propose a scalable and novel field-programmable gate array-based heterogeneous multicore system for scientific programming. The core is multithread for streaming processing. And the communication network InfiniBand is adopted for scalability. We design a binary search algorithm to address mapping to unify all processor addresses. Within the limits permitted by the Graph500 test bench after 1D parallel hybrid BFS algorithm testing, our 8-core and 8-thread-per-core system achieved superior performance and efficiency compared with the prior work under the same degree of parallelism. Our system is efficient not as a special acceleration unit but as a processor platform that deals with graph searching applications.
Roever, Stefan
2012-01-01
A massively parallel, low cost molecular analysis platform will dramatically change the nature of protein, molecular and genomics research, DNA sequencing, and ultimately, molecular diagnostics. An integrated circuit (IC) with 264 sensors was fabricated using standard CMOS semiconductor processing technology. Each of these sensors is individually controlled with precision analog circuitry and is capable of single molecule measurements. Under electronic and software control, the IC was used to demonstrate the feasibility of creating and detecting lipid bilayers and biological nanopores using wild type α-hemolysin. The ability to dynamically create bilayers over each of the sensors will greatly accelerate pore development and pore mutation analysis. In addition, the noise performance of the IC was measured to be 30fA(rms). With this noise performance, single base detection of DNA was demonstrated using α-hemolysin. The data shows that a single molecule, electrical detection platform using biological nanopores can be operationalized and can ultimately scale to millions of sensors. Such a massively parallel platform will revolutionize molecular analysis and will completely change the field of molecular diagnostics in the future.
Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems
Martina-Cezara Albutiu, Alfons Kemper, Thomas Neumann
2012-01-01
Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research ...
A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.
Load balancing in highly parallel processing of Monte Carlo code for particle transport
International Nuclear Information System (INIS)
Higuchi, Kenji; Takemiya, Hiroshi; Kawasaki, Takuji
1998-01-01
In parallel processing of Monte Carlo (MC) codes for neutron, photon and electron transport problems, particle histories are assigned to processors making use of independency of the calculation for each particle. Although we can easily parallelize main part of a MC code by this method, it is necessary and practically difficult to optimize the code concerning load balancing in order to attain high speedup ratio in highly parallel processing. In fact, the speedup ratio in the case of 128 processors remains in nearly one hundred times when using the test bed for the performance evaluation. Through the parallel processing of the MCNP code, which is widely used in the nuclear field, it is shown that it is difficult to attain high performance by static load balancing in especially neutron transport problems, and a load balancing method, which dynamically changes the number of assigned particles minimizing the sum of the computational and communication costs, overcomes the difficulty, resulting in nearly fifteen percentage of reduction for execution time. (author)
Development of a parallelization strategy for the VARIANT code
International Nuclear Information System (INIS)
Hanebutte, U.R.; Khalil, H.S.; Palmiotti, G.; Tatsumi, M.
1996-01-01
The VARIANT code solves the multigroup steady-state neutron diffusion and transport equation in three-dimensional Cartesian and hexagonal geometries using the variational nodal method. VARIANT consists of four major parts that must be executed sequentially: input handling, calculation of response matrices, solution algorithm (i.e. inner-outer iteration), and output of results. The objective of the parallelization effort was to reduce the overall computing time by distributing the work of the two computationally intensive (sequential) tasks, the coupling coefficient calculation and the iterative solver, equally among a group of processors. This report describes the code's calculations and gives performance results on one of the benchmark problems used to test the code. The performance analysis in the IBM SPx system shows good efficiency for well-load-balanced programs. Even for relatively small problem sizes, respectable efficiencies are seen for the SPx. An extension to achieve a higher degree of parallelism will be addressed in future work. 7 refs., 1 tab
Massive Parallelism of Monte-Carlo Simulation on Low-End Hardware using Graphic Processing Units
International Nuclear Information System (INIS)
Mburu, Joe Mwangi; Hah, Chang Joo Hah
2014-01-01
Within the past decade, research has been done on utilizing GPU massive parallelization in core simulation with impressive results but unfortunately, not much commercial application has been done in the nuclear field especially in reactor core simulation. The purpose of this paper is to give an introductory concept on the topic and illustrate the potential of exploiting the massive parallel nature of GPU computing on a simple monte-carlo simulation with very minimal hardware specifications. To do a comparative analysis, a simple two dimension monte-carlo simulation is implemented for both the CPU and GPU in order to evaluate performance gain based on the computing devices. The heterogeneous platform utilized in this analysis is done on a slow notebook with only 1GHz processor. The end results are quite surprising whereby high speedups obtained are almost a factor of 10. In this work, we have utilized heterogeneous computing in a GPU-based approach in applying potential high arithmetic intensive calculation. By applying a complex monte-carlo simulation on GPU platform, we have speed up the computational process by almost a factor of 10 based on one million neutrons. This shows how easy, cheap and efficient it is in using GPU in accelerating scientific computing and the results should encourage in exploring further this avenue especially in nuclear reactor physics simulation where deterministic and stochastic calculations are quite favourable in parallelization
Massive Parallelism of Monte-Carlo Simulation on Low-End Hardware using Graphic Processing Units
Energy Technology Data Exchange (ETDEWEB)
Mburu, Joe Mwangi; Hah, Chang Joo Hah [KEPCO International Nuclear Graduate School, Ulsan (Korea, Republic of)
2014-05-15
Within the past decade, research has been done on utilizing GPU massive parallelization in core simulation with impressive results but unfortunately, not much commercial application has been done in the nuclear field especially in reactor core simulation. The purpose of this paper is to give an introductory concept on the topic and illustrate the potential of exploiting the massive parallel nature of GPU computing on a simple monte-carlo simulation with very minimal hardware specifications. To do a comparative analysis, a simple two dimension monte-carlo simulation is implemented for both the CPU and GPU in order to evaluate performance gain based on the computing devices. The heterogeneous platform utilized in this analysis is done on a slow notebook with only 1GHz processor. The end results are quite surprising whereby high speedups obtained are almost a factor of 10. In this work, we have utilized heterogeneous computing in a GPU-based approach in applying potential high arithmetic intensive calculation. By applying a complex monte-carlo simulation on GPU platform, we have speed up the computational process by almost a factor of 10 based on one million neutrons. This shows how easy, cheap and efficient it is in using GPU in accelerating scientific computing and the results should encourage in exploring further this avenue especially in nuclear reactor physics simulation where deterministic and stochastic calculations are quite favourable in parallelization.
Automatic Parallelization Tool: Classification of Program Code for Parallel Computing
Directory of Open Access Journals (Sweden)
Mustafa Basthikodi
2016-04-01
Full Text Available Performance growth of single-core processors has come to a halt in the past decade, but was re-enabled by the introduction of parallelism in processors. Multicore frameworks along with Graphical Processing Units empowered to enhance parallelism broadly. Couples of compilers are updated to developing challenges forsynchronization and threading issues. Appropriate program and algorithm classifications will have advantage to a great extent to the group of software engineers to get opportunities for effective parallelization. In present work we investigated current species for classification of algorithms, in that related work on classification is discussed along with the comparison of issues that challenges the classification. The set of algorithms are chosen which matches the structure with different issues and perform given task. We have tested these algorithms utilizing existing automatic species extraction toolsalong with Bones compiler. We have added functionalities to existing tool, providing a more detailed characterization. The contributions of our work include support for pointer arithmetic, conditional and incremental statements, user defined types, constants and mathematical functions. With this, we can retain significant data which is not captured by original speciesof algorithms. We executed new theories into the device, empowering automatic characterization of program code.
Parallel plasma fluid turbulence calculations
International Nuclear Information System (INIS)
Leboeuf, J.N.; Carreras, B.A.; Charlton, L.A.; Drake, J.B.; Lynch, V.E.; Newman, D.E.; Sidikman, K.L.; Spong, D.A.
1994-01-01
The study of plasma turbulence and transport is a complex problem of critical importance for fusion-relevant plasmas. To this day, the fluid treatment of plasma dynamics is the best approach to realistic physics at the high resolution required for certain experimentally relevant calculations. Core and edge turbulence in a magnetic fusion device have been modeled using state-of-the-art, nonlinear, three-dimensional, initial-value fluid and gyrofluid codes. Parallel implementation of these models on diverse platforms--vector parallel (National Energy Research Supercomputer Center's CRAY Y-MP C90), massively parallel (Intel Paragon XP/S 35), and serial parallel (clusters of high-performance workstations using the Parallel Virtual Machine protocol)--offers a variety of paths to high resolution and significant improvements in real-time efficiency, each with its own advantages. The largest and most efficient calculations have been performed at the 200 Mword memory limit on the C90 in dedicated mode, where an overlap of 12 to 13 out of a maximum of 16 processors has been achieved with a gyrofluid model of core fluctuations. The richness of the physics captured by these calculations is commensurate with the increased resolution and efficiency and is limited only by the ingenuity brought to the analysis of the massive amounts of data generated
Massively Parallel Dimension Independent Adaptive Metropolis
Chen, Yuxin
2015-05-14
This work considers black-box Bayesian inference over high-dimensional parameter spaces. The well-known and widely respected adaptive Metropolis (AM) algorithm is extended herein to asymptotically scale uniformly with respect to the underlying parameter dimension, by respecting the variance, for Gaussian targets. The result- ing algorithm, referred to as the dimension-independent adaptive Metropolis (DIAM) algorithm, also shows improved performance with respect to adaptive Metropolis on non-Gaussian targets. This algorithm is further improved, and the possibility of probing high-dimensional targets is enabled, via GPU-accelerated numerical libraries and periodically synchronized concurrent chains (justified a posteriori). Asymptoti- cally in dimension, this massively parallel dimension-independent adaptive Metropolis (MPDIAM) GPU implementation exhibits a factor of four improvement versus the CPU-based Intel MKL version alone, which is itself already a factor of three improve- ment versus the serial version. The scaling to multiple CPUs and GPUs exhibits a form of strong scaling in terms of the time necessary to reach a certain convergence criterion, through a combination of longer time per sample batch (weak scaling) and yet fewer necessary samples to convergence. This is illustrated by e ciently sampling from several Gaussian and non-Gaussian targets for dimension d 1000.
Parallelization of a beam dynamics code and first large scale radio frequency quadrupole simulations
Directory of Open Access Journals (Sweden)
J. Xu
2007-01-01
Full Text Available The design and operation support of hadron (proton and heavy-ion linear accelerators require substantial use of beam dynamics simulation tools. The beam dynamics code TRACK has been originally developed at Argonne National Laboratory (ANL to fulfill the special requirements of the rare isotope accelerator (RIA accelerator systems. From the beginning, the code has been developed to make it useful in the three stages of a linear accelerator project, namely, the design, commissioning, and operation of the machine. To realize this concept, the code has unique features such as end-to-end simulations from the ion source to the final beam destination and automatic procedures for tuning of a multiple charge state heavy-ion beam. The TRACK code has become a general beam dynamics code for hadron linacs and has found wide applications worldwide. Until recently, the code has remained serial except for a simple parallelization used for the simulation of multiple seeds to study the machine errors. To speed up computation, the TRACK Poisson solver has been parallelized. This paper discusses different parallel models for solving the Poisson equation with the primary goal to extend the scalability of the code onto 1024 and more processors of the new generation of supercomputers known as BlueGene (BG/L. Domain decomposition techniques have been adapted and incorporated into the parallel version of the TRACK code. To demonstrate the new capabilities of the parallelized TRACK code, the dynamics of a 45 mA proton beam represented by 10^{8} particles has been simulated through the 325 MHz radio frequency quadrupole and initial accelerator section of the proposed FNAL proton driver. The results show the benefits and advantages of large-scale parallel computing in beam dynamics simulations.
Yu, Leiming; Nina-Paravecino, Fanny; Kaeli, David; Fang, Qianqian
2018-01-01
We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strategies are developed to obtain efficient simulations using multiple central processing units and GPUs. (2018) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).
Parallelization of a numerical simulation code for isotropic turbulence
International Nuclear Information System (INIS)
Sato, Shigeru; Yokokawa, Mitsuo; Watanabe, Tadashi; Kaburaki, Hideo.
1996-03-01
A parallel pseudospectral code which solves the three-dimensional Navier-Stokes equation by direct numerical simulation is developed and execution time, parallelization efficiency, load balance and scalability are evaluated. A vector parallel supercomputer, Fujitsu VPP500 with up to 16 processors is used for this calculation for Fourier modes up to 256x256x256 using 16 processors. Good scalability for number of processors is achieved when number of Fourier mode is fixed. For small Fourier modes, calculation time of the program is proportional to NlogN which is ideal complexity of calculation for 3D-FFT on vector parallel processors. It is found that the calculation performance decreases as the increase of the Fourier modes. (author)
Siretskiy, Alexey; Sundqvist, Tore; Voznesenskiy, Mikhail; Spjuth, Ola
2015-01-01
New high-throughput technologies, such as massively parallel sequencing, have transformed the life sciences into a data-intensive field. The most common e-infrastructure for analyzing this data consists of batch systems that are based on high-performance computing resources; however, the bioinformatics software that is built on this platform does not scale well in the general case. Recently, the Hadoop platform has emerged as an interesting option to address the challenges of increasingly large datasets with distributed storage, distributed processing, built-in data locality, fault tolerance, and an appealing programming methodology. In this work we introduce metrics and report on a quantitative comparison between Hadoop and a single node of conventional high-performance computing resources for the tasks of short read mapping and variant calling. We calculate efficiency as a function of data size and observe that the Hadoop platform is more efficient for biologically relevant data sizes in terms of computing hours for both split and un-split data files. We also quantify the advantages of the data locality provided by Hadoop for NGS problems, and show that a classical architecture with network-attached storage will not scale when computing resources increase in numbers. Measurements were performed using ten datasets of different sizes, up to 100 gigabases, using the pipeline implemented in Crossbow. To make a fair comparison, we implemented an improved preprocessor for Hadoop with better performance for splittable data files. For improved usability, we implemented a graphical user interface for Crossbow in a private cloud environment using the CloudGene platform. All of the code and data in this study are freely available as open source in public repositories. From our experiments we can conclude that the improved Hadoop pipeline scales better than the same pipeline on high-performance computing resources, we also conclude that Hadoop is an economically viable
Directory of Open Access Journals (Sweden)
Jiayi Wu
Full Text Available Structural heterogeneity in single-particle cryo-electron microscopy (cryo-EM data represents a major challenge for high-resolution structure determination. Unsupervised classification may serve as the first step in the assessment of structural heterogeneity. However, traditional algorithms for unsupervised classification, such as K-means clustering and maximum likelihood optimization, may classify images into wrong classes with decreasing signal-to-noise-ratio (SNR in the image data, yet demand increased computational costs. Overcoming these limitations requires further development of clustering algorithms for high-performance cryo-EM data processing. Here we introduce an unsupervised single-particle clustering algorithm derived from a statistical manifold learning framework called generative topographic mapping (GTM. We show that unsupervised GTM clustering improves classification accuracy by about 40% in the absence of input references for data with lower SNRs. Applications to several experimental datasets suggest that our algorithm can detect subtle structural differences among classes via a hierarchical clustering strategy. After code optimization over a high-performance computing (HPC environment, our software implementation was able to generate thousands of reference-free class averages within hours in a massively parallel fashion, which allows a significant improvement on ab initio 3D reconstruction and assists in the computational purification of homogeneous datasets for high-resolution visualization.
Wu, Jiayi; Ma, Yong-Bei; Congdon, Charles; Brett, Bevin; Chen, Shuobing; Xu, Yaofang; Ouyang, Qi; Mao, Youdong
2017-01-01
Structural heterogeneity in single-particle cryo-electron microscopy (cryo-EM) data represents a major challenge for high-resolution structure determination. Unsupervised classification may serve as the first step in the assessment of structural heterogeneity. However, traditional algorithms for unsupervised classification, such as K-means clustering and maximum likelihood optimization, may classify images into wrong classes with decreasing signal-to-noise-ratio (SNR) in the image data, yet demand increased computational costs. Overcoming these limitations requires further development of clustering algorithms for high-performance cryo-EM data processing. Here we introduce an unsupervised single-particle clustering algorithm derived from a statistical manifold learning framework called generative topographic mapping (GTM). We show that unsupervised GTM clustering improves classification accuracy by about 40% in the absence of input references for data with lower SNRs. Applications to several experimental datasets suggest that our algorithm can detect subtle structural differences among classes via a hierarchical clustering strategy. After code optimization over a high-performance computing (HPC) environment, our software implementation was able to generate thousands of reference-free class averages within hours in a massively parallel fashion, which allows a significant improvement on ab initio 3D reconstruction and assists in the computational purification of homogeneous datasets for high-resolution visualization.
Directory of Open Access Journals (Sweden)
Cord Drögemüller
2010-08-01
Full Text Available Arachnomelia is a monogenic recessive defect of skeletal development in cattle. The causative mutation was previously mapped to a ∼7 Mb interval on chromosome 5. Here we show that array-based sequence capture and massively parallel sequencing technology, combined with the typical family structure in livestock populations, facilitates the identification of the causative mutation. We re-sequenced the entire critical interval in a healthy partially inbred cow carrying one copy of the critical chromosome segment in its ancestral state and one copy of the same segment with the arachnomelia mutation, and we detected a single heterozygous position. The genetic makeup of several partially inbred cattle provides extremely strong support for the causality of this mutation. The mutation represents a single base insertion leading to a premature stop codon in the coding sequence of the SUOX gene and is perfectly associated with the arachnomelia phenotype. Our findings suggest an important role for sulfite oxidase in bone development.
Drögemüller, Cord; Tetens, Jens; Sigurdsson, Snaevar; Gentile, Arcangelo; Testoni, Stefania; Lindblad-Toh, Kerstin; Leeb, Tosso
2010-01-01
Arachnomelia is a monogenic recessive defect of skeletal development in cattle. The causative mutation was previously mapped to a ∼7 Mb interval on chromosome 5. Here we show that array-based sequence capture and massively parallel sequencing technology, combined with the typical family structure in livestock populations, facilitates the identification of the causative mutation. We re-sequenced the entire critical interval in a healthy partially inbred cow carrying one copy of the critical chromosome segment in its ancestral state and one copy of the same segment with the arachnomelia mutation, and we detected a single heterozygous position. The genetic makeup of several partially inbred cattle provides extremely strong support for the causality of this mutation. The mutation represents a single base insertion leading to a premature stop codon in the coding sequence of the SUOX gene and is perfectly associated with the arachnomelia phenotype. Our findings suggest an important role for sulfite oxidase in bone development. PMID:20865119
Tolerating correlated failures in Massively Parallel Stream Processing Engines
DEFF Research Database (Denmark)
Su, L.; Zhou, Y.
2016-01-01
Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the o......Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint....... On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE...
ASSET: Analysis of Sequences of Synchronous Events in Massively Parallel Spike Trains
Canova, Carlos; Denker, Michael; Gerstein, George; Helias, Moritz
2016-01-01
With the ability to observe the activity from large numbers of neurons simultaneously using modern recording technologies, the chance to identify sub-networks involved in coordinated processing increases. Sequences of synchronous spike events (SSEs) constitute one type of such coordinated spiking that propagates activity in a temporally precise manner. The synfire chain was proposed as one potential model for such network processing. Previous work introduced a method for visualization of SSEs in massively parallel spike trains, based on an intersection matrix that contains in each entry the degree of overlap of active neurons in two corresponding time bins. Repeated SSEs are reflected in the matrix as diagonal structures of high overlap values. The method as such, however, leaves the task of identifying these diagonal structures to visual inspection rather than to a quantitative analysis. Here we present ASSET (Analysis of Sequences of Synchronous EvenTs), an improved, fully automated method which determines diagonal structures in the intersection matrix by a robust mathematical procedure. The method consists of a sequence of steps that i) assess which entries in the matrix potentially belong to a diagonal structure, ii) cluster these entries into individual diagonal structures and iii) determine the neurons composing the associated SSEs. We employ parallel point processes generated by stochastic simulations as test data to demonstrate the performance of the method under a wide range of realistic scenarios, including different types of non-stationarity of the spiking activity and different correlation structures. Finally, the ability of the method to discover SSEs is demonstrated on complex data from large network simulations with embedded synfire chains. Thus, ASSET represents an effective and efficient tool to analyze massively parallel spike data for temporal sequences of synchronous activity. PMID:27420734
A Massively Parallel Solver for the Mechanical Harmonic Analysis of Accelerator Cavities
International Nuclear Information System (INIS)
2015-01-01
ACE3P is a 3D massively parallel simulation suite that developed at SLAC National Accelerator Laboratory that can perform coupled electromagnetic, thermal and mechanical study. Effectively utilizing supercomputer resources, ACE3P has become a key simulation tool for particle accelerator R and D. A new frequency domain solver to perform mechanical harmonic response analysis of accelerator components is developed within the existing parallel framework. This solver is designed to determine the frequency response of the mechanical system to external harmonic excitations for time-efficient accurate analysis of the large-scale problems. Coupled with the ACE3P electromagnetic modules, this capability complements a set of multi-physics tools for a comprehensive study of microphonics in superconducting accelerating cavities in order to understand the RF response and feedback requirements for the operational reliability of a particle accelerator. (auth)
Directory of Open Access Journals (Sweden)
Ivone U. S. Leong
2014-06-01
Full Text Available Sudden cardiac death in people between the ages of 1–40 years is a devastating event and is frequently caused by several heritable cardiac disorders. These disorders include cardiac ion channelopathies, such as long QT syndrome, catecholaminergic polymorphic ventricular tachycardia and Brugada syndrome and cardiomyopathies, such as hypertrophic cardiomyopathy and arrhythmogenic right ventricular cardiomyopathy. Through careful molecular genetic evaluation of DNA from sudden death victims, the causative gene mutation can be uncovered, and the rest of the family can be screened and preventative measures implemented in at-risk individuals. The current screening approach in most diagnostic laboratories uses Sanger-based sequencing; however, this method is time consuming and labour intensive. The development of massively parallel sequencing has made it possible to produce millions of sequence reads simultaneously and is potentially an ideal approach to screen for mutations in genes that are associated with sudden cardiac death. This approach offers mutation screening at reduced cost and turnaround time. Here, we will review the current commercially available enrichment kits, massively parallel sequencing (MPS platforms, downstream data analysis and its application to sudden cardiac death in a diagnostic environment.
Optimisation of a parallel ocean general circulation model
Beare, M. I.; Stevens, D. P.
1997-10-01
This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers.
Development of burnup methods and capabilities in Monte Carlo code RMC
International Nuclear Information System (INIS)
She, Ding; Liu, Yuxuan; Wang, Kan; Yu, Ganglin; Forget, Benoit; Romano, Paul K.; Smith, Kord
2013-01-01
Highlights: ► The RMC code has been developed aiming at large-scale burnup calculations. ► Matrix exponential methods are employed to solve the depletion equations. ► The Energy-Bin method reduces the time expense of treating ACE libraries. ► The Cell-Mapping method is efficient to handle massive amounts of tally cells. ► Parallelized depletion is necessary for massive amounts of burnup regions. -- Abstract: The Monte Carlo burnup calculation has always been a challenging problem because of its large time consumption when applied to full-scale assembly or core calculations, and thus its application in routine analysis is limited. Most existing MC burnup codes are usually external wrappers between a MC code, e.g. MCNP, and a depletion code, e.g. ORIGEN. The code RMC is a newly developed MC code with an embedded depletion module aimed at performing burnup calculations of large-scale problems with high efficiency. Several measures have been taken to strengthen the burnup capabilities of RMC. Firstly, an accurate and efficient depletion module called DEPTH has been developed and built in, which employs the rational approximation and polynomial approximation methods. Secondly, the Energy-Bin method and the Cell-Mapping method are implemented to speed up the transport calculations with large numbers of nuclides and tally cells. Thirdly, the batch tally method and the parallelized depletion module have been utilized to better handle cases with massive amounts of burnup regions in parallel calculations. Burnup cases including a PWR pin and a 5 × 5 assembly group are calculated, thereby demonstrating the burnup capabilities of the RMC code. In addition, the computational time and memory requirements of RMC are compared with other MC burnup codes.
Parallel community climate model: Description and user`s guide
Energy Technology Data Exchange (ETDEWEB)
Drake, J.B.; Flanery, R.E.; Semeraro, B.D.; Worley, P.H. [and others
1996-07-15
This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain into geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.
A portable, parallel, object-oriented Monte Carlo neutron transport code in C++
International Nuclear Information System (INIS)
Lee, S.R.; Cummings, J.C.; Nolen, S.D.
1997-01-01
We have developed a multi-group Monte Carlo neutron transport code using C++ and the Parallel Object-Oriented Methods and Applications (POOMA) class library. This transport code, called MC++, currently computes k and α-eigenvalues and is portable to and runs parallel on a wide variety of platforms, including MPPs, clustered SMPs, and individual workstations. It contains appropriate classes and abstractions for particle transport and, through the use of POOMA, for portable parallelism. Current capabilities of MC++ are discussed, along with physics and performance results on a variety of hardware, including all Accelerated Strategic Computing Initiative (ASCI) hardware. Current parallel performance indicates the ability to compute α-eigenvalues in seconds to minutes rather than hours to days. Future plans and the implementation of a general transport physics framework are also discussed
Hierarchical Image Segmentation of Remotely Sensed Data using Massively Parallel GNU-LINUX Software
Tilton, James C.
2003-01-01
A hierarchical set of image segmentations is a set of several image segmentations of the same image at different levels of detail in which the segmentations at coarser levels of detail can be produced from simple merges of regions at finer levels of detail. In [1], Tilton, et a1 describes an approach for producing hierarchical segmentations (called HSEG) and gave a progress report on exploiting these hierarchical segmentations for image information mining. The HSEG algorithm is a hybrid of region growing and constrained spectral clustering that produces a hierarchical set of image segmentations based on detected convergence points. In the main, HSEG employs the hierarchical stepwise optimization (HSWO) approach to region growing, which was described as early as 1989 by Beaulieu and Goldberg. The HSWO approach seeks to produce segmentations that are more optimized than those produced by more classic approaches to region growing (e.g. Horowitz and T. Pavlidis, [3]). In addition, HSEG optionally interjects between HSWO region growing iterations, merges between spatially non-adjacent regions (i.e., spectrally based merging or clustering) constrained by a threshold derived from the previous HSWO region growing iteration. While the addition of constrained spectral clustering improves the utility of the segmentation results, especially for larger images, it also significantly increases HSEG s computational requirements. To counteract this, a computationally efficient recursive, divide-and-conquer, implementation of HSEG (RHSEG) was devised, which includes special code to avoid processing artifacts caused by RHSEG s recursive subdivision of the image data. The recursive nature of RHSEG makes for a straightforward parallel implementation. This paper describes the HSEG algorithm, its recursive formulation (referred to as RHSEG), and the implementation of RHSEG using massively parallel GNU-LINUX software. Results with Landsat TM data are included comparing RHSEG with classic
From Massively Parallel Algorithms and Fluctuating Time Horizons to Nonequilibrium Surface Growth
International Nuclear Information System (INIS)
Korniss, G.; Toroczkai, Z.; Novotny, M. A.; Rikvold, P. A.
2000-01-01
We study the asymptotic scaling properties of a massively parallel algorithm for discrete-event simulations where the discrete events are Poisson arrivals. The evolution of the simulated time horizon is analogous to a nonequilibrium surface. Monte Carlo simulations and a coarse-grained approximation indicate that the macroscopic landscape in the steady state is governed by the Edwards-Wilkinson Hamiltonian. Since the efficiency of the algorithm corresponds to the density of local minima in the associated surface, our results imply that the algorithm is asymptotically scalable. (c) 2000 The American Physical Society
Passive and partially active fault tolerance for massively parallel stream processing engines
DEFF Research Database (Denmark)
Su, Li; Zhou, Yongluan
2018-01-01
. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE...... also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness...
Parallel Computing Characteristics of CUPID code under MPI and Hybrid environment
Energy Technology Data Exchange (ETDEWEB)
Lee, Jae Ryong; Yoon, Han Young [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of); Jeon, Byoung Jin; Choi, Hyoung Gwon [Seoul National Univ. of Science and Technology, Seoul (Korea, Republic of)
2014-05-15
In this paper, a characteristic of parallel algorithm is presented for solving an elliptic type equation of CUPID via domain decomposition method using the MPI and the parallel performance is estimated in terms of a scalability which shows the speedup ratio. In addition, the time-consuming pattern of major subroutines is studied. Two different grid systems are taken into account: 40,000 meshes for coarse system and 320,000 meshes for fine system. Since the matrix of the CUPID code differs according to whether the flow is single-phase or two-phase, the effect of matrix shape is evaluated. Finally, the effect of the preconditioner for matrix solver is also investigated. Finally, the hybrid (OpenMP+MPI) parallel algorithm is introduced and discussed in detail for solving pressure solver. Component-scale thermal-hydraulics code, CUPID has been developed for two-phase flow analysis, which adopts a three-dimensional, transient, three-field model, and parallelized to fulfill a recent demand for long-transient and highly resolved multi-phase flow behavior. In this study, the parallel performance of the CUPID code was investigated in terms of scalability. The CUPID code was parallelized with domain decomposition method. The MPI library was adopted to communicate the information at the neighboring domain. For managing the sparse matrix effectively, the CSR storage format is used. To take into account the characteristics of the pressure matrix which turns to be asymmetric for two-phase flow, both single-phase and two-phase calculations were run. In addition, the effect of the matrix size and preconditioning was also investigated. The fine mesh calculation shows better scalability than the coarse mesh because the number of coarse mesh does not need to decompose the computational domain excessively. The fine mesh can be present good scalability when dividing geometry with considering the ratio between computation and communication time. For a given mesh, single-phase flow
Massively Parallel Interrogation of Aptamer Sequence, Structure and Function
Energy Technology Data Exchange (ETDEWEB)
Fischer, N O; Tok, J B; Tarasow, T M
2008-02-08
Optimization of high affinity reagents is a significant bottleneck in medicine and the life sciences. The ability to synthetically create thousands of permutations of a lead high-affinity reagent and survey the properties of individual permutations in parallel could potentially relieve this bottleneck. Aptamers are single stranded oligonucleotides affinity reagents isolated by in vitro selection processes and as a class have been shown to bind a wide variety of target molecules. Methodology/Principal Findings. High density DNA microarray technology was used to synthesize, in situ, arrays of approximately 3,900 aptamer sequence permutations in triplicate. These sequences were interrogated on-chip for their ability to bind the fluorescently-labeled cognate target, immunoglobulin E, resulting in the parallel execution of thousands of experiments. Fluorescence intensity at each array feature was well resolved and shown to be a function of the sequence present. The data demonstrated high intra- and interchip correlation between the same features as well as among the sequence triplicates within a single array. Consistent with aptamer mediated IgE binding, fluorescence intensity correlated strongly with specific aptamer sequences and the concentration of IgE applied to the array. The massively parallel sequence-function analyses provided by this approach confirmed the importance of a consensus sequence found in all 21 of the original IgE aptamer sequences and support a common stem:loop structure as being the secondary structure underlying IgE binding. The microarray application, data and results presented illustrate an efficient, high information content approach to optimizing aptamer function. It also provides a foundation from which to better understand and manipulate this important class of high affinity biomolecules.
Massively parallel interrogation of aptamer sequence, structure and function.
Directory of Open Access Journals (Sweden)
Nicholas O Fischer
Full Text Available BACKGROUND: Optimization of high affinity reagents is a significant bottleneck in medicine and the life sciences. The ability to synthetically create thousands of permutations of a lead high-affinity reagent and survey the properties of individual permutations in parallel could potentially relieve this bottleneck. Aptamers are single stranded oligonucleotides affinity reagents isolated by in vitro selection processes and as a class have been shown to bind a wide variety of target molecules. METHODOLOGY/PRINCIPAL FINDINGS: High density DNA microarray technology was used to synthesize, in situ, arrays of approximately 3,900 aptamer sequence permutations in triplicate. These sequences were interrogated on-chip for their ability to bind the fluorescently-labeled cognate target, immunoglobulin E, resulting in the parallel execution of thousands of experiments. Fluorescence intensity at each array feature was well resolved and shown to be a function of the sequence present. The data demonstrated high intra- and inter-chip correlation between the same features as well as among the sequence triplicates within a single array. Consistent with aptamer mediated IgE binding, fluorescence intensity correlated strongly with specific aptamer sequences and the concentration of IgE applied to the array. CONCLUSION AND SIGNIFICANCE: The massively parallel sequence-function analyses provided by this approach confirmed the importance of a consensus sequence found in all 21 of the original IgE aptamer sequences and support a common stem:loop structure as being the secondary structure underlying IgE binding. The microarray application, data and results presented illustrate an efficient, high information content approach to optimizing aptamer function. It also provides a foundation from which to better understand and manipulate this important class of high affinity biomolecules.
Data Parallel Line Relaxation (DPLR) Code User Manual: Acadia - Version 4.01.1
Wright, Michael J.; White, Todd; Mangini, Nancy
2009-01-01
Data-Parallel Line Relaxation (DPLR) code is a computational fluid dynamic (CFD) solver that was developed at NASA Ames Research Center to help mission support teams generate high-value predictive solutions for hypersonic flow field problems. The DPLR Code Package is an MPI-based, parallel, full three-dimensional Navier-Stokes CFD solver with generalized models for finite-rate reaction kinetics, thermal and chemical non-equilibrium, accurate high-temperature transport coefficients, and ionized flow physics incorporated into the code. DPLR also includes a large selection of generalized realistic surface boundary conditions and links to enable loose coupling with external thermal protection system (TPS) material response and shock layer radiation codes.
A Parallel Numerical Micromagnetic Code Using FEniCS
Nagy, L.; Williams, W.; Mitchell, L.
2013-12-01
Many problems in the geosciences depend on understanding the ability of magnetic minerals to provide stable paleomagnetic recordings. Numerical micromagnetic modelling allows us to calculate the domain structures found in naturally occurring magnetic materials. However the computational cost rises exceedingly quickly with respect to the size and complexity of the geometries that we wish to model. This problem is compounded by the fact that the modern processor design no longer focuses on the speed at which calculations are performed, but rather on the number of computational units amongst which we may distribute our calculations. Consequently to better exploit modern computational resources our micromagnetic simulations must "go parallel". We present a parallel and scalable micromagnetics code written using FEniCS. FEniCS is a multinational collaboration involving several institutions (University of Cambridge, University of Chicago, The Simula Research Laboratory, etc.) that aims to provide a set of tools for writing scientific software; in particular software that employs the finite element method. The advantages of this approach are the leveraging of pre-existing projects from the world of scientific computing (PETSc, Trilinos, Metis/Parmetis, etc.) and exposing these so that researchers may pose problems in a manner closer to the mathematical language of their domain. Our code provides a scriptable interface (in Python) that allows users to not only run micromagnetic models in parallel, but also to perform pre/post processing of data.
Step by step parallel programming method for molecular dynamics code
International Nuclear Information System (INIS)
Orii, Shigeo; Ohta, Toshio
1996-07-01
Parallel programming for a numerical simulation program of molecular dynamics is carried out with a step-by-step programming technique using the two phase method. As a result, within the range of a certain computing parameters, it is found to obtain parallel performance by using the level of parallel programming which decomposes the calculation according to indices of do-loops into each processor on the vector parallel computer VPP500 and the scalar parallel computer Paragon. It is also found that VPP500 shows parallel performance in wider range computing parameters. The reason is that the time cost of the program parts, which can not be reduced by the do-loop level of the parallel programming, can be reduced to the negligible level by the vectorization. After that, the time consuming parts of the program are concentrated on less parts that can be accelerated by the do-loop level of the parallel programming. This report shows the step-by-step parallel programming method and the parallel performance of the molecular dynamics code on VPP500 and Paragon. (author)
Parallel computers and three-dimensional computational electromagnetics
International Nuclear Information System (INIS)
Madsen, N.K.
1994-01-01
The authors have continued to enhance their ability to use new massively parallel processing computers to solve time-domain electromagnetic problems. New vectorization techniques have improved the performance of their code DSI3D by factors of 5 to 15, depending on the computer used. New radiation boundary conditions and far-field transformations now allow the computation of radar cross-section values for complex objects. A new parallel-data extraction code has been developed that allows the extraction of data subsets from large problems, which have been run on parallel computers, for subsequent post-processing on workstations with enhanced graphics capabilities. A new charged-particle-pushing version of DSI3D is under development. Finally, DSI3D has become a focal point for several new Cooperative Research and Development Agreement activities with industrial companies such as Lockheed Advanced Development Company, Varian, Hughes Electron Dynamics Division, General Atomic, and Cray
Performance studies of the parallel VIM code
International Nuclear Information System (INIS)
Shi, B.; Blomquist, R.N.
1996-01-01
In this paper, the authors evaluate the performance of the parallel version of the VIM Monte Carlo code on the IBM SPx at the High Performance Computing Research Facility at ANL. Three test problems with contrasting computational characteristics were used to assess effects in performance. A statistical method for estimating the inefficiencies due to load imbalance and communication is also introduced. VIM is a large scale continuous energy Monte Carlo radiation transport program and was parallelized using history partitioning, the master/worker approach, and p4 message passing library. Dynamic load balancing is accomplished when the master processor assigns chunks of histories to workers that have completed a previously assigned task, accommodating variations in the lengths of histories, processor speeds, and worker loads. At the end of each batch (generation), the fission sites and tallies are sent from each worker to the master process, contributing to the parallel inefficiency. All communications are between master and workers, and are serial. The SPx is a scalable 128-node parallel supercomputer with high-performance Omega switches of 63 microsec latency and 35 MBytes/sec bandwidth. For uniform and reproducible performance, they used only the 120 identical regular processors (IBM RS/6000) and excluded the remaining eight planet nodes, which may be loaded by other's jobs
DGDFT: A massively parallel method for large scale density functional theory calculations.
Hu, Wei; Lin, Lin; Yang, Chao
2015-09-28
We describe a massively parallel implementation of the recently developed discontinuous Galerkin density functional theory (DGDFT) method, for efficient large-scale Kohn-Sham DFT based electronic structure calculations. The DGDFT method uses adaptive local basis (ALB) functions generated on-the-fly during the self-consistent field iteration to represent the solution to the Kohn-Sham equations. The use of the ALB set provides a systematic way to improve the accuracy of the approximation. By using the pole expansion and selected inversion technique to compute electron density, energy, and atomic forces, we can make the computational complexity of DGDFT scale at most quadratically with respect to the number of electrons for both insulating and metallic systems. We show that for the two-dimensional (2D) phosphorene systems studied here, using 37 basis functions per atom allows us to reach an accuracy level of 1.3 × 10(-4) Hartree/atom in terms of the error of energy and 6.2 × 10(-4) Hartree/bohr in terms of the error of atomic force, respectively. DGDFT can achieve 80% parallel efficiency on 128,000 high performance computing cores when it is used to study the electronic structure of 2D phosphorene systems with 3500-14 000 atoms. This high parallel efficiency results from a two-level parallelization scheme that we will describe in detail.
DGDFT: A massively parallel method for large scale density functional theory calculations
International Nuclear Information System (INIS)
Hu, Wei; Yang, Chao; Lin, Lin
2015-01-01
We describe a massively parallel implementation of the recently developed discontinuous Galerkin density functional theory (DGDFT) method, for efficient large-scale Kohn-Sham DFT based electronic structure calculations. The DGDFT method uses adaptive local basis (ALB) functions generated on-the-fly during the self-consistent field iteration to represent the solution to the Kohn-Sham equations. The use of the ALB set provides a systematic way to improve the accuracy of the approximation. By using the pole expansion and selected inversion technique to compute electron density, energy, and atomic forces, we can make the computational complexity of DGDFT scale at most quadratically with respect to the number of electrons for both insulating and metallic systems. We show that for the two-dimensional (2D) phosphorene systems studied here, using 37 basis functions per atom allows us to reach an accuracy level of 1.3 × 10 −4 Hartree/atom in terms of the error of energy and 6.2 × 10 −4 Hartree/bohr in terms of the error of atomic force, respectively. DGDFT can achieve 80% parallel efficiency on 128,000 high performance computing cores when it is used to study the electronic structure of 2D phosphorene systems with 3500-14 000 atoms. This high parallel efficiency results from a two-level parallelization scheme that we will describe in detail
DGDFT: A massively parallel method for large scale density functional theory calculations
Energy Technology Data Exchange (ETDEWEB)
Hu, Wei, E-mail: whu@lbl.gov; Yang, Chao, E-mail: cyang@lbl.gov [Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720 (United States); Lin, Lin, E-mail: linlin@math.berkeley.edu [Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720 (United States); Department of Mathematics, University of California, Berkeley, California 94720 (United States)
2015-09-28
We describe a massively parallel implementation of the recently developed discontinuous Galerkin density functional theory (DGDFT) method, for efficient large-scale Kohn-Sham DFT based electronic structure calculations. The DGDFT method uses adaptive local basis (ALB) functions generated on-the-fly during the self-consistent field iteration to represent the solution to the Kohn-Sham equations. The use of the ALB set provides a systematic way to improve the accuracy of the approximation. By using the pole expansion and selected inversion technique to compute electron density, energy, and atomic forces, we can make the computational complexity of DGDFT scale at most quadratically with respect to the number of electrons for both insulating and metallic systems. We show that for the two-dimensional (2D) phosphorene systems studied here, using 37 basis functions per atom allows us to reach an accuracy level of 1.3 × 10{sup −4} Hartree/atom in terms of the error of energy and 6.2 × 10{sup −4} Hartree/bohr in terms of the error of atomic force, respectively. DGDFT can achieve 80% parallel efficiency on 128,000 high performance computing cores when it is used to study the electronic structure of 2D phosphorene systems with 3500-14 000 atoms. This high parallel efficiency results from a two-level parallelization scheme that we will describe in detail.
Optimisation of a parallel ocean general circulation model
Directory of Open Access Journals (Sweden)
M. I. Beare
1997-10-01
Full Text Available This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers.
Optimisation of a parallel ocean general circulation model
Directory of Open Access Journals (Sweden)
M. I. Beare
Full Text Available This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers.
A Self Consistent Multiprocessor Space Charge Algorithm that is Almost Embarrassingly Parallel
International Nuclear Information System (INIS)
Nissen, Edward; Erdelyi, B.; Manikonda, S.L.
2012-01-01
We present a space charge code that is self consistent, massively parallelizeable, and requires very little communication between computer nodes; making the calculation almost embarrassingly parallel. This method is implemented in the code COSY Infinity where the differential algebras used in this code are important to the algorithm's proper functioning. The method works by calculating the self consistent space charge distribution using the statistical moments of the test particles, and converting them into polynomial series coefficients. These coefficients are combined with differential algebraic integrals to form the potential, and electric fields. The result is a map which contains the effects of space charge. This method allows for massive parallelization since its statistics based solver doesn't require any binning of particles, and only requires a vector containing the partial sums of the statistical moments for the different nodes to be passed. All other calculations are done independently. The resulting maps can be used to analyze the system using normal form analysis, as well as advance particles in numbers and at speeds that were previously impossible.
Scalable Strategies for Computing with Massive Data
Directory of Open Access Journals (Sweden)
Michael Kane
2013-11-01
Full Text Available This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware.
A data parallel pseudo-spectral semi-implicit magnetohydrodynamics code
Keppens, R.; Poedts, S.; Meijer, P. M.; Goedbloed, J. P.; Hertzberger, B.; Sloot, P.
1997-01-01
The set of eight nonlinear partial differential equations of magnetohydrodynamics (MHD) is used for time dependent simulations of three-dimensional (3D) fluid flow in a magnetic field. A data parallel code is presented, which integrates the MHD equations in cylindrical geometry, combining a
An Expert System for the Development of Efficient Parallel Code
Jost, Gabriele; Chun, Robert; Jin, Hao-Qiang; Labarta, Jesus; Gimenez, Judit
2004-01-01
We have built the prototype of an expert system to assist the user in the development of efficient parallel code. The system was integrated into the parallel programming environment that is currently being developed at NASA Ames. The expert system interfaces to tools for automatic parallelization and performance analysis. It uses static program structure information and performance data in order to automatically determine causes of poor performance and to make suggestions for improvements. In this paper we give an overview of our programming environment, describe the prototype implementation of our expert system, and demonstrate its usefulness with several case studies.
International Nuclear Information System (INIS)
Andreo, P.
1996-01-01
Research on plane-parallel ionization chambers since the IAEA Code of Practice (TRS-277) was published in 1987 has expanded our knowledge on perturbation and other correction factors in ionization chamber dosimeter, and also constructional details of these chambers have been shown to be important. Different national organizations have published, or are in the process of publishing, recommendations on detailed procedures for the calibration and use of plane-parallel ionization chambers. An international working group was formed under the auspices of the IAEA, first to assess the status and validity of IAEA TRS-277, and second to develop an international Code of Practice for the calibration and use of plane-parallel ionization chambers in high-energy electron and photon beams. The purpose of this work is to describe the forthcoming Code of Practice. (author). 39 refs, 3 figs, 2 tabs
Energy Technology Data Exchange (ETDEWEB)
Andreo, P [Lunds Hospital, Lund (Sweden). Radiophysics Dept.; Almond, P R [J.G. Brown Cancer Center, Univ. of Lousville, Lousville, KY (United States). Dept. of Radiation Oncology; Mattsson, O [Sahlgrenska Hospital, Gothenburg (Sweden). Dept. of Radiation Physics; Nahum, A E [Royal Marsden Hospital, Sutton (United Kingdom). Joint Dept. of Physics; Roos, M [Physikalisch-Technische Bundesanstalt, Braunschweig (Germany)
1996-08-01
Research on plane-parallel ionization chambers since the IAEA Code of Practice (TRS-277) was published in 1987 has expanded our knowledge on perturbation and other correction factors in ionization chamber dosimeter, and also constructional details of these chambers have been shown to be important. Different national organizations have published, or are in the process of publishing, recommendations on detailed procedures for the calibration and use of plane-parallel ionization chambers. An international working group was formed under the auspices of the IAEA, first to assess the status and validity of IAEA TRS-277, and second to develop an international Code of Practice for the calibration and use of plane-parallel ionization chambers in high-energy electron and photon beams. The purpose of this work is to describe the forthcoming Code of Practice. (author). 39 refs, 3 figs, 2 tabs.
CERN. Geneva
2016-01-01
Large scale scientific computing raises questions on different levels ranging from the fomulation of the problems to the choice of the best algorithms and their implementation for a specific platform. There are similarities in these different topics that can be exploited by modern-style C++ template metaprogramming techniques to produce readable, maintainable and generic code. Traditional low-level code tend to be fast but platform-dependent, and it obfuscates the meaning of the algorithm. On the other hand, object-oriented approach is nice to read, but may come with an inherent performance penalty. These lectures aim to present he basics of the Expression Template (ET) idiom which allows us to keep the object-oriented approach without sacrificing performance. We will in particular show to to enhance ET to include SIMD vectorization. We will then introduce techniques for abstracting iteration, and introduce thread-level parallelism for use in heavy data-centric loads. We will show to to apply these methods i...
Lagardère, Louis; Jolly, Luc-Henri; Lipparini, Filippo; Aviat, Félix; Stamm, Benjamin; Jing, Zhifeng F; Harger, Matthew; Torabifard, Hedieh; Cisneros, G Andrés; Schnieders, Michael J; Gresh, Nohad; Maday, Yvon; Ren, Pengyu Y; Ponder, Jay W; Piquemal, Jean-Philip
2018-01-28
We present Tinker-HP, a massively MPI parallel package dedicated to classical molecular dynamics (MD) and to multiscale simulations, using advanced polarizable force fields (PFF) encompassing distributed multipoles electrostatics. Tinker-HP is an evolution of the popular Tinker package code that conserves its simplicity of use and its reference double precision implementation for CPUs. Grounded on interdisciplinary efforts with applied mathematics, Tinker-HP allows for long polarizable MD simulations on large systems up to millions of atoms. We detail in the paper the newly developed extension of massively parallel 3D spatial decomposition to point dipole polarizable models as well as their coupling to efficient Krylov iterative and non-iterative polarization solvers. The design of the code allows the use of various computer systems ranging from laboratory workstations to modern petascale supercomputers with thousands of cores. Tinker-HP proposes therefore the first high-performance scalable CPU computing environment for the development of next generation point dipole PFFs and for production simulations. Strategies linking Tinker-HP to Quantum Mechanics (QM) in the framework of multiscale polarizable self-consistent QM/MD simulations are also provided. The possibilities, performances and scalability of the software are demonstrated via benchmarks calculations using the polarizable AMOEBA force field on systems ranging from large water boxes of increasing size and ionic liquids to (very) large biosystems encompassing several proteins as well as the complete satellite tobacco mosaic virus and ribosome structures. For small systems, Tinker-HP appears to be competitive with the Tinker-OpenMM GPU implementation of Tinker. As the system size grows, Tinker-HP remains operational thanks to its access to distributed memory and takes advantage of its new algorithmic enabling for stable long timescale polarizable simulations. Overall, a several thousand-fold acceleration over
Multiplexed microsatellite recovery using massively parallel sequencing
Jennings, T.N.; Knaus, B.J.; Mullins, T.D.; Haig, S.M.; Cronn, R.C.
2011-01-01
Conservation and management of natural populations requires accurate and inexpensive genotyping methods. Traditional microsatellite, or simple sequence repeat (SSR), marker analysis remains a popular genotyping method because of the comparatively low cost of marker development, ease of analysis and high power of genotype discrimination. With the availability of massively parallel sequencing (MPS), it is now possible to sequence microsatellite-enriched genomic libraries in multiplex pools. To test this approach, we prepared seven microsatellite-enriched, barcoded genomic libraries from diverse taxa (two conifer trees, five birds) and sequenced these on one lane of the Illumina Genome Analyzer using paired-end 80-bp reads. In this experiment, we screened 6.1 million sequences and identified 356958 unique microreads that contained di- or trinucleotide microsatellites. Examination of four species shows that our conversion rate from raw sequences to polymorphic markers compares favourably to Sanger- and 454-based methods. The advantage of multiplexed MPS is that the staggering capacity of modern microread sequencing is spread across many libraries; this reduces sample preparation and sequencing costs to less than $400 (USD) per species. This price is sufficiently low that microsatellite libraries could be prepared and sequenced for all 1373 organisms listed as 'threatened' and 'endangered' in the United States for under $0.5M (USD).
GPAW - massively parallel electronic structure calculations with Python-based software
DEFF Research Database (Denmark)
Enkovaara, Jussi; Romero, Nichols A.; Shende, Sameer
2011-01-01
of the productivity enhancing features together with a good numerical performance. We have used this approach in implementing an electronic structure simulation software GPAW using the combination of Python and C programming languages. While the chosen approach works well in standard workstations and Unix...... popular choice. While dynamic, interpreted languages, such as Python, can increase the effciency of programmer, they cannot compete directly with the raw performance of compiled languages. However, by using an interpreted language together with a compiled language, it is possible to have most...... environments, massively parallel supercomputing systems can present some challenges in porting, debugging and profiling the software. In this paper we describe some details of the implementation and discuss the advantages and challenges of the combined Python/C approach. We show that despite the challenges...
Development of Parallel Code for the Alaska Tsunami Forecast Model
Bahng, B.; Knight, W. R.; Whitmore, P.
2014-12-01
The Alaska Tsunami Forecast Model (ATFM) is a numerical model used to forecast propagation and inundation of tsunamis generated by earthquakes and other means in both the Pacific and Atlantic Oceans. At the U.S. National Tsunami Warning Center (NTWC), the model is mainly used in a pre-computed fashion. That is, results for hundreds of hypothetical events are computed before alerts, and are accessed and calibrated with observations during tsunamis to immediately produce forecasts. ATFM uses the non-linear, depth-averaged, shallow-water equations of motion with multiply nested grids in two-way communications between domains of each parent-child pair as waves get closer to coastal waters. Even with the pre-computation the task becomes non-trivial as sub-grid resolution gets finer. Currently, the finest resolution Digital Elevation Models (DEM) used by ATFM are 1/3 arc-seconds. With a serial code, large or multiple areas of very high resolution can produce run-times that are unrealistic even in a pre-computed approach. One way to increase the model performance is code parallelization used in conjunction with a multi-processor computing environment. NTWC developers have undertaken an ATFM code-parallelization effort to streamline the creation of the pre-computed database of results with the long term aim of tsunami forecasts from source to high resolution shoreline grids in real time. Parallelization will also permit timely regeneration of the forecast model database with new DEMs; and, will make possible future inclusion of new physics such as the non-hydrostatic treatment of tsunami propagation. The purpose of our presentation is to elaborate on the parallelization approach and to show the compute speed increase on various multi-processor systems.
Parallel computing techniques for rotorcraft aerodynamics
Ekici, Kivanc
The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).
QuASAR-MPRA: accurate allele-specific analysis for massively parallel reporter assays.
Kalita, Cynthia A; Moyerbrailean, Gregory A; Brown, Christopher; Wen, Xiaoquan; Luca, Francesca; Pique-Regi, Roger
2018-03-01
The majority of the human genome is composed of non-coding regions containing regulatory elements such as enhancers, which are crucial for controlling gene expression. Many variants associated with complex traits are in these regions, and may disrupt gene regulatory sequences. Consequently, it is important to not only identify true enhancers but also to test if a variant within an enhancer affects gene regulation. Recently, allele-specific analysis in high-throughput reporter assays, such as massively parallel reporter assays (MPRAs), have been used to functionally validate non-coding variants. However, we are still missing high-quality and robust data analysis tools for these datasets. We have further developed our method for allele-specific analysis QuASAR (quantitative allele-specific analysis of reads) to analyze allele-specific signals in barcoded read counts data from MPRA. Using this approach, we can take into account the uncertainty on the original plasmid proportions, over-dispersion, and sequencing errors. The provided allelic skew estimate and its standard error also simplifies meta-analysis of replicate experiments. Additionally, we show that a beta-binomial distribution better models the variability present in the allelic imbalance of these synthetic reporters and results in a test that is statistically well calibrated under the null. Applying this approach to the MPRA data, we found 602 SNPs with significant (false discovery rate 10%) allele-specific regulatory function in LCLs. We also show that we can combine MPRA with QuASAR estimates to validate existing experimental and computational annotations of regulatory variants. Our study shows that with appropriate data analysis tools, we can improve the power to detect allelic effects in high-throughput reporter assays. http://github.com/piquelab/QuASAR/tree/master/mpra. fluca@wayne.edu or rpique@wayne.edu. Supplementary data are available online at Bioinformatics. © The Author (2017). Published by
Rubus: A compiler for seamless and extensible parallelism
Adnan, Muhammad; Aslam, Faisal; Sarwar, Syed Mansoor
2017-01-01
Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer’s expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been
Rubus: A compiler for seamless and extensible parallelism.
Directory of Open Access Journals (Sweden)
Muhammad Adnan
Full Text Available Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU, originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84
Beam dynamics simulations using a parallel version of PARMILA
International Nuclear Information System (INIS)
Ryne, R.D.
1996-01-01
The computer code PARMILA has been the primary tool for the design of proton and ion linacs in the United States for nearly three decades. Previously it was sufficient to perform simulations with of order 10000 particles, but recently the need to perform high resolution halo studies for next-generation, high intensity linacs has made it necessary to perform simulations with of order 100 million particles. With the advent of massively parallel computers such simulations are now within reach. Parallel computers already make it possible, for example, to perform beam dynamics calculations with tens of millions of particles, requiring over 10 GByte of core memory, in just a few hours. Also, parallel computers are becoming easier to use thanks to the availability of mature, Fortran-like languages such as Connection Machine Fortran and High Performance Fortran. We will describe our experience developing a parallel version of PARMILA and the performance of the new code
Beam dynamics simulations using a parallel version of PARMILA
International Nuclear Information System (INIS)
Ryne, Robert
1996-01-01
The computer code PARMILA has been the primary tool for the design of proton and ion linacs in the United States for nearly three decades. Previously it was sufficient to perform simulations with of order 10000 particles, but recently the need to perform high resolution halo studies for next-generation, high intensity linacs has made it necessary to perform simulations with of order 100 million particles. With the advent of massively parallel computers such simulations are now within reach. Parallel computers already make it possible, for example, to perform beam dynamics calculations with tens of millions of particles, requiring over 10 GByte of core memory, in just a few hours. Also, parallel computers are becoming easier to use thanks to the availability of mature, Fortran-like languages such as Connection Machine Fortran and High Performance Fortran. We will describe our experience developing a parallel version of PARMILA and the performance of the new code. (author)
Massively parallel de novo protein design for targeted therapeutics
Chevalier, Aaron
2017-09-26
De novo protein design holds promise for creating small stable proteins with shapes customized to bind therapeutic targets. We describe a massively parallel approach for designing, manufacturing and screening mini-protein binders, integrating large-scale computational design, oligonucleotide synthesis, yeast display screening and next-generation sequencing. We designed and tested 22,660 mini-proteins of 37-43 residues that target influenza haemagglutinin and botulinum neurotoxin B, along with 6,286 control sequences to probe contributions to folding and binding, and identified 2,618 high-affinity binders. Comparison of the binding and non-binding design sets, which are two orders of magnitude larger than any previously investigated, enabled the evaluation and improvement of the computational model. Biophysical characterization of a subset of the binder designs showed that they are extremely stable and, unlike antibodies, do not lose activity after exposure to high temperatures. The designs elicit little or no immune response and provide potent prophylactic and therapeutic protection against influenza, even after extensive repeated dosing.
Massively parallel de novo protein design for targeted therapeutics
Chevalier, Aaron; Silva, Daniel-Adriano; Rocklin, Gabriel J.; Hicks, Derrick R.; Vergara, Renan; Murapa, Patience; Bernard, Steffen M.; Zhang, Lu; Lam, Kwok-Ho; Yao, Guorui; Bahl, Christopher D.; Miyashita, Shin-Ichiro; Goreshnik, Inna; Fuller, James T.; Koday, Merika T.; Jenkins, Cody M.; Colvin, Tom; Carter, Lauren; Bohn, Alan; Bryan, Cassie M.; Ferná ndez-Velasco, D. Alejandro; Stewart, Lance; Dong, Min; Huang, Xuhui; Jin, Rongsheng; Wilson, Ian A.; Fuller, Deborah H.; Baker, David
2017-01-01
De novo protein design holds promise for creating small stable proteins with shapes customized to bind therapeutic targets. We describe a massively parallel approach for designing, manufacturing and screening mini-protein binders, integrating large-scale computational design, oligonucleotide synthesis, yeast display screening and next-generation sequencing. We designed and tested 22,660 mini-proteins of 37-43 residues that target influenza haemagglutinin and botulinum neurotoxin B, along with 6,286 control sequences to probe contributions to folding and binding, and identified 2,618 high-affinity binders. Comparison of the binding and non-binding design sets, which are two orders of magnitude larger than any previously investigated, enabled the evaluation and improvement of the computational model. Biophysical characterization of a subset of the binder designs showed that they are extremely stable and, unlike antibodies, do not lose activity after exposure to high temperatures. The designs elicit little or no immune response and provide potent prophylactic and therapeutic protection against influenza, even after extensive repeated dosing.
Massively parallel performance of neutron transport response matrix algorithms
International Nuclear Information System (INIS)
Hanebutte, U.R.; Lewis, E.E.
1993-01-01
Massively parallel red/black response matrix algorithms for the solution of within-group neutron transport problems are implemented on the Connection Machines-2, 200 and 5. The response matrices are dericed from the diamond-differences and linear-linear nodal discrete ordinate and variational nodal P 3 approximations. The unaccelerated performance of the iterative procedure is examined relative to the maximum rated performances of the machines. The effects of processor partitions size, of virtual processor ratio and of problems size are examined in detail. For the red/black algorithm, the ratio of inter-node communication to computing times is found to be quite small, normally of the order of ten percent or less. Performance increases with problems size and with virtual processor ratio, within the memeory per physical processor limitation. Algorithm adaptation to courser grain machines is straight-forward, with total computing time being virtually inversely proportional to the number of physical processors. (orig.)
Massively parallel de novo protein design for targeted therapeutics
Chevalier, Aaron; Silva, Daniel-Adriano; Rocklin, Gabriel J.; Hicks, Derrick R.; Vergara, Renan; Murapa, Patience; Bernard, Steffen M.; Zhang, Lu; Lam, Kwok-Ho; Yao, Guorui; Bahl, Christopher D.; Miyashita, Shin-Ichiro; Goreshnik, Inna; Fuller, James T.; Koday, Merika T.; Jenkins, Cody M.; Colvin, Tom; Carter, Lauren; Bohn, Alan; Bryan, Cassie M.; Fernández-Velasco, D. Alejandro; Stewart, Lance; Dong, Min; Huang, Xuhui; Jin, Rongsheng; Wilson, Ian A.; Fuller, Deborah H.; Baker, David
2018-01-01
De novo protein design holds promise for creating small stable proteins with shapes customized to bind therapeutic targets. We describe a massively parallel approach for designing, manufacturing and screening mini-protein binders, integrating large-scale computational design, oligonucleotide synthesis, yeast display screening and next-generation sequencing. We designed and tested 22,660 mini-proteins of 37–43 residues that target influenza haemagglutinin and botulinum neurotoxin B, along with 6,286 control sequences to probe contributions to folding and binding, and identified 2,618 high-affinity binders. Comparison of the binding and non-binding design sets, which are two orders of magnitude larger than any previously investigated, enabled the evaluation and improvement of the computational model. Biophysical characterization of a subset of the binder designs showed that they are extremely stable and, unlike antibodies, do not lose activity after exposure to high temperatures. The designs elicit little or no immune response and provide potent prophylactic and therapeutic protection against influenza, even after extensive repeated dosing. PMID:28953867
A PARALLEL MONTE CARLO CODE FOR SIMULATING COLLISIONAL N-BODY SYSTEMS
Energy Technology Data Exchange (ETDEWEB)
Pattabiraman, Bharath; Umbreit, Stefan; Liao, Wei-keng; Choudhary, Alok; Kalogera, Vassiliki; Memik, Gokhan; Rasio, Frederic A., E-mail: bharath@u.northwestern.edu [Center for Interdisciplinary Exploration and Research in Astrophysics, Northwestern University, Evanston, IL (United States)
2013-02-15
We present a new parallel code for computing the dynamical evolution of collisional N-body systems with up to N {approx} 10{sup 7} particles. Our code is based on the Henon Monte Carlo method for solving the Fokker-Planck equation, and makes assumptions of spherical symmetry and dynamical equilibrium. The principal algorithmic developments involve optimizing data structures and the introduction of a parallel random number generation scheme as well as a parallel sorting algorithm required to find nearest neighbors for interactions and to compute the gravitational potential. The new algorithms we introduce along with our choice of decomposition scheme minimize communication costs and ensure optimal distribution of data and workload among the processing units. Our implementation uses the Message Passing Interface library for communication, which makes it portable to many different supercomputing architectures. We validate the code by calculating the evolution of clusters with initial Plummer distribution functions up to core collapse with the number of stars, N, spanning three orders of magnitude from 10{sup 5} to 10{sup 7}. We find that our results are in good agreement with self-similar core-collapse solutions, and the core-collapse times generally agree with expectations from the literature. Also, we observe good total energy conservation, within {approx}< 0.04% throughout all simulations. We analyze the performance of the code, and demonstrate near-linear scaling of the runtime with the number of processors up to 64 processors for N = 10{sup 5}, 128 for N = 10{sup 6} and 256 for N = 10{sup 7}. The runtime reaches saturation with the addition of processors beyond these limits, which is a characteristic of the parallel sorting algorithm. The resulting maximum speedups we achieve are approximately 60 Multiplication-Sign , 100 Multiplication-Sign , and 220 Multiplication-Sign , respectively.
International Nuclear Information System (INIS)
Kawasaki, Nobuo; Ogasawara, Shinobu; Adachi, Masaaki; Kume, Etsuo; Ishizuki, Shigeru; Tanabe, Hidenobu; Nemoto, Toshiyuki; Kawai, Wataru; Watanabe, Hideo
1999-05-01
Several computer codes in the nuclear field have been vectorized, parallelized and transported on the FUJITSU VPP500 system and/or the AP3000 system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. We dealt with 14 codes in fiscal 1997. These results are reported in 3 parts, i.e., the vectorization part, the parallelization part and the porting part. In this report, we describe the vectorization. In this vectorization part, the vectorization of multidimensional two-fluid model code ACE-3D for evaluation of constitutive equations, statistical decay code SD and three-dimensional thermal analysis code for in-core test section (T2) of HENDEL SSPHEAT are described. In the parallelization part, the parallelization of cylindrical direct numerical simulation code CYLDNS44N, worldwide version of system for prediction of environmental emergency dose information code WSPEEDI, extension of quantum molecular dynamics code EQMD and three-dimensional non-steady compressible fluid dynamics code STREAM are described. In the porting part, the porting of transient reactor analysis code TRAC-BF1 and Monte Carlo radiation transport code MCNP4A on the AP3000 are described. In addition, a modification of program libraries for command-driven interactive data analysis plotting program IPLOT is described. (author)
International Nuclear Information System (INIS)
Ishizuki, Shigeru; Nemoto, Toshiyuki; Kawai, Wataru; Watanabe, Hideo; Tanabe, Hidenobu; Kawasaki, Nobuo; Adachi, Masaaki; Ogasawara, Shinobu; Kume, Etsuo
1999-05-01
Several computer codes in the nuclear field have been vectorized, parallelized and transported on the FUJITSU VPP500 system and/or the AP3000 system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. We dealt with 14 codes in fiscal 1997. These results are reported in 3 parts, i.e., the vectorization part, the parallelization part and the porting part. In this report, we describe the porting. In this porting part, the porting of transient reactor analysis code TRAC-BF1 and Monte Carlo radiation transport code MCNP4A on the AP3000 are described. In addition, a modification of program libraries for command-driven interactive data analysis plotting program IPLOT is described. In the vectorization part, the vectorization of multidimensional two-fluid model code ACE-3D for evaluation of constitutive equations, statistical decay code SD and three-dimensional thermal analysis code for in-core test section (T2) of HENDEL SSPHEAT are described. In the parallelization part, the parallelization of cylindrical direct numerical simulation code CYLDNS44N, worldwide version of system for prediction of environmental emergency dose information code WSPEEDI, extension of quantum molecular dynamics code EQMD and three-dimensional non-steady compressible fluid dynamics code STREAM are described. (author)
Energy Technology Data Exchange (ETDEWEB)
Nemoto, Toshiyuki; Kawai, Wataru [Fujitsu Ltd., Tokyo (Japan); Kawasaki, Nobuo [and others
1997-12-01
Several computer codes in the nuclear field have been vectorized, parallelized and transported on the FUJITSU VPP500 system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. These results are reported in 3 parts, i.e., the vectorization part, the parallelization part and the porting part. In this report, we describe the vectorization. In this vectorization part, the vectorization of two and three dimensional discrete ordinates simulation code DORT-TORT, gas dynamics analysis code FLOWGR and relativistic Boltzmann-Uehling-Uhlenbeck simulation code RBUU are described. In the parallelization part, the parallelization of 2-Dimensional relativistic electromagnetic particle code EM2D, Cylindrical Direct Numerical Simulation code CYLDNS and molecular dynamics code for simulating radiation damages in diamond crystals DGR are described. And then, in the porting part, the porting of reactor safety analysis code RELAP5/MOD3.2 and RELAP5/MOD3.2.1.2, nuclear data processing system NJOY and 2-D multigroup discrete ordinate transport code TWOTRAN-II are described. And also, a survey for the porting of command-driven interactive data analysis plotting program IPLOT are described. (author)
Energy Technology Data Exchange (ETDEWEB)
Nemoto, Toshiyuki [Fujitsu Ltd., Tokyo (Japan); Kawasaki, Nobuo; Tanabe, Hidenobu [and others
1998-01-01
Several computer codes in the nuclear field have been vectorized, parallelized and transported on the FUJITSU VPP500 system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. These results are reported in 3 parts, i.e., the vectorization part, the parallelization part and the porting part. In this report, we describe the porting. In this porting part, the porting of reactor safety analysis code RELAP5/MOD3.2 and RELAP5/MOD3.2.1.2, nuclear data processing system NJOY and 2-D multigroup discrete ordinate transport code TWOTRAN-II are described. And also, a survey for the porting of command-driven interactive data analysis plotting program IPLOT are described. In the parallelization part, the parallelization of 2-Dimensional relativistic electromagnetic particle code EM2D, Cylindrical Direct Numerical Simulation code CYLDNS and molecular dynamics code for simulating radiation damages in diamond crystals DGR are described. And then, in the vectorization part, the vectorization of two and three dimensional discrete ordinates simulation code DORT-TORT, gas dynamics analysis code FLOWGR and relativistic Boltzmann-Uehling-Uhlenbeck simulation code RBUU are described. (author)
Chen, Jian; Matuttis, Hans-Georg
2013-02-01
We report our experiences with the optimization and parallelization of a discrete element code for convex polyhedra on multi-core machines and introduce a novel variant of the sort-and-sweep neighborhood algorithm. While in theory the whole code in itself parallelizes ideally, in practice the results on different architectures with different compilers and performance measurement tools depend very much on the particle number and optimization of the code. After difficulties with the interpretation of the data for speedup and efficiency are overcome, respectable parallelization speedups could be obtained.
Optimizing a massive parallel sequencing workflow for quantitative miRNA expression analysis.
Directory of Open Access Journals (Sweden)
Francesca Cordero
Full Text Available BACKGROUND: Massive Parallel Sequencing methods (MPS can extend and improve the knowledge obtained by conventional microarray technology, both for mRNAs and short non-coding RNAs, e.g. miRNAs. The processing methods used to extract and interpret the information are an important aspect of dealing with the vast amounts of data generated from short read sequencing. Although the number of computational tools for MPS data analysis is constantly growing, their strengths and weaknesses as part of a complex analytical pipe-line have not yet been well investigated. PRIMARY FINDINGS: A benchmark MPS miRNA dataset, resembling a situation in which miRNAs are spiked in biological replication experiments was assembled by merging a publicly available MPS spike-in miRNAs data set with MPS data derived from healthy donor peripheral blood mononuclear cells. Using this data set we observed that short reads counts estimation is strongly under estimated in case of duplicates miRNAs, if whole genome is used as reference. Furthermore, the sensitivity of miRNAs detection is strongly dependent by the primary tool used in the analysis. Within the six aligners tested, specifically devoted to miRNA detection, SHRiMP and MicroRazerS show the highest sensitivity. Differential expression estimation is quite efficient. Within the five tools investigated, two of them (DESseq, baySeq show a very good specificity and sensitivity in the detection of differential expression. CONCLUSIONS: The results provided by our analysis allow the definition of a clear and simple analytical optimized workflow for miRNAs digital quantitative analysis.
Optimizing a massive parallel sequencing workflow for quantitative miRNA expression analysis.
Cordero, Francesca; Beccuti, Marco; Arigoni, Maddalena; Donatelli, Susanna; Calogero, Raffaele A
2012-01-01
Massive Parallel Sequencing methods (MPS) can extend and improve the knowledge obtained by conventional microarray technology, both for mRNAs and short non-coding RNAs, e.g. miRNAs. The processing methods used to extract and interpret the information are an important aspect of dealing with the vast amounts of data generated from short read sequencing. Although the number of computational tools for MPS data analysis is constantly growing, their strengths and weaknesses as part of a complex analytical pipe-line have not yet been well investigated. A benchmark MPS miRNA dataset, resembling a situation in which miRNAs are spiked in biological replication experiments was assembled by merging a publicly available MPS spike-in miRNAs data set with MPS data derived from healthy donor peripheral blood mononuclear cells. Using this data set we observed that short reads counts estimation is strongly under estimated in case of duplicates miRNAs, if whole genome is used as reference. Furthermore, the sensitivity of miRNAs detection is strongly dependent by the primary tool used in the analysis. Within the six aligners tested, specifically devoted to miRNA detection, SHRiMP and MicroRazerS show the highest sensitivity. Differential expression estimation is quite efficient. Within the five tools investigated, two of them (DESseq, baySeq) show a very good specificity and sensitivity in the detection of differential expression. The results provided by our analysis allow the definition of a clear and simple analytical optimized workflow for miRNAs digital quantitative analysis.
International Nuclear Information System (INIS)
Masukawa, Fumihiro; Takano, Makoto; Naito, Yoshitaka; Yamazaki, Takao; Fujisaki, Masahide; Suzuki, Koichiro; Okuda, Motoi.
1993-11-01
In order to improve the accuracy and calculating speed of shielding analyses, MCNP 4, a Monte Carlo neutron and photon transport code system, has been parallelized and measured of its efficiency in the highly parallel distributed memory type computer, AP1000. The code has been analyzed statically and dynamically, then the suitable algorithm for parallelization has been determined for the shielding analysis functions of MCNP 4. This includes a strategy where a new history is assigned to the idling processor element dynamically during the execution. Furthermore, to avoid the congestion of communicative processing, the batch concept, processing multi-histories by a unit, has been introduced. By analyzing a sample cask problem with 2,000,000 histories by the AP1000 with 512 processor elements, the 82 % of parallelization efficiency is achieved, and the calculational speed has been estimated to be around 50 times as fast as that of FACOM M-780. (author)
Quantification of massively parallel sequencing libraries - a comparative study of eight methods
DEFF Research Database (Denmark)
Hussing, Christian; Kampmann, Marie-Louise; Mogensen, Helle Smidt
2018-01-01
Quantification of massively parallel sequencing libraries is important for acquisition of monoclonal beads or clusters prior to clonal amplification and to avoid large variations in library coverage when multiple samples are included in one sequencing analysis. No gold standard for quantification...... estimates followed by Qubit and electrophoresis-based instruments (Bioanalyzer, TapeStation, GX Touch, and Fragment Analyzer), while SYBR Green and TaqMan based qPCR assays gave the lowest estimates. qPCR gave more accurate predictions of sequencing coverage than Qubit and TapeStation did. Costs, time......-consumption, workflow simplicity, and ability to quantify multiple samples are discussed. Technical specifications, advantages, and disadvantages of the various methods are pointed out....
Massive parallel sequencing in sarcoma pathobiology: state of the art and perspectives.
Brenca, Monica; Maestro, Roberta
2015-01-01
Sarcomas are an aggressive and highly heterogeneous group of mesenchymal malignancies with different morphologies and clinical behavior. Current therapeutic strategies remain unsatisfactory. Cytogenetic and molecular characterization of these tumors is resulting in the breakdown of the classical histopathological categories into molecular subgroups that better define sarcoma pathobiology and pave the way to more precise diagnostic criteria and novel therapeutic opportunities. The purpose of this short review is to summarize the state-of-the-art on the exploitation of massive parallel sequencing technologies, also known as next generation sequencing, in the elucidation of sarcoma pathobiology and to discuss how these applications may impact on diagnosis, prognosis and therapy of these tumors.
A PARALLEL MONTE CARLO CODE FOR SIMULATING COLLISIONAL N-BODY SYSTEMS
International Nuclear Information System (INIS)
Pattabiraman, Bharath; Umbreit, Stefan; Liao, Wei-keng; Choudhary, Alok; Kalogera, Vassiliki; Memik, Gokhan; Rasio, Frederic A.
2013-01-01
We present a new parallel code for computing the dynamical evolution of collisional N-body systems with up to N ∼ 10 7 particles. Our code is based on the Hénon Monte Carlo method for solving the Fokker-Planck equation, and makes assumptions of spherical symmetry and dynamical equilibrium. The principal algorithmic developments involve optimizing data structures and the introduction of a parallel random number generation scheme as well as a parallel sorting algorithm required to find nearest neighbors for interactions and to compute the gravitational potential. The new algorithms we introduce along with our choice of decomposition scheme minimize communication costs and ensure optimal distribution of data and workload among the processing units. Our implementation uses the Message Passing Interface library for communication, which makes it portable to many different supercomputing architectures. We validate the code by calculating the evolution of clusters with initial Plummer distribution functions up to core collapse with the number of stars, N, spanning three orders of magnitude from 10 5 to 10 7 . We find that our results are in good agreement with self-similar core-collapse solutions, and the core-collapse times generally agree with expectations from the literature. Also, we observe good total energy conservation, within ∼ 5 , 128 for N = 10 6 and 256 for N = 10 7 . The runtime reaches saturation with the addition of processors beyond these limits, which is a characteristic of the parallel sorting algorithm. The resulting maximum speedups we achieve are approximately 60×, 100×, and 220×, respectively.
Application of parallel computing techniques to a large-scale reservoir simulation
International Nuclear Information System (INIS)
Zhang, Keni; Wu, Yu-Shu; Ding, Chris; Pruess, Karsten
2001-01-01
Even with the continual advances made in both computational algorithms and computer hardware used in reservoir modeling studies, large-scale simulation of fluid and heat flow in heterogeneous reservoirs remains a challenge. The problem commonly arises from intensive computational requirement for detailed modeling investigations of real-world reservoirs. This paper presents the application of a massive parallel-computing version of the TOUGH2 code developed for performing large-scale field simulations. As an application example, the parallelized TOUGH2 code is applied to develop a three-dimensional unsaturated-zone numerical model simulating flow of moisture, gas, and heat in the unsaturated zone of Yucca Mountain, Nevada, a potential repository for high-level radioactive waste. The modeling approach employs refined spatial discretization to represent the heterogeneous fractured tuffs of the system, using more than a million 3-D gridblocks. The problem of two-phase flow and heat transfer within the model domain leads to a total of 3,226,566 linear equations to be solved per Newton iteration. The simulation is conducted on a Cray T3E-900, a distributed-memory massively parallel computer. Simulation results indicate that the parallel computing technique, as implemented in the TOUGH2 code, is very efficient. The reliability and accuracy of the model results have been demonstrated by comparing them to those of small-scale (coarse-grid) models. These comparisons show that simulation results obtained with the refined grid provide more detailed predictions of the future flow conditions at the site, aiding in the assessment of proposed repository performance
Directory of Open Access Journals (Sweden)
Abdeslam Mehenni
2017-03-01
Full Text Available As our populations grow in a world of limited resources enterprise seek ways to lighten our load on the planet. The idea of modifying consumer behavior appears as a foundation for smart grids. Enterprise demonstrates the value available from deep analysis of electricity consummation histories, consumers’ messages, and outage alerts, etc. Enterprise mines massive structured and unstructured data. In a nutshell, smart grids result in a flood of data that needs to be analyzed, for better adjust to demand and give customers more ability to delve into their power consumption. Simply put, smart grids will increasingly have a flexible data warehouse attached to them. The key driver for the adoption of data management strategies is clearly the need to handle and analyze the large amounts of information utilities are now faced with. New approaches to data integration are nauseating moment; Hadoop is in fact now being used by the utility to help manage the huge growth in data whilst maintaining coherence of the Data Warehouse. In this paper we define a new Meter Data Management System Architecture repository that differ with three leaders MDMS, where we use MapReduce programming model for ETL and Parallel DBMS in Query statements(Massive Parallel Processing MPP.
On the adequacy of message-passing parallel supercomputers for solving neutron transport problems
International Nuclear Information System (INIS)
Azmy, Y.Y.
1990-01-01
A coarse-grained, static-scheduling parallelization of the standard iterative scheme used for solving the discrete-ordinates approximation of the neutron transport equation is described. The parallel algorithm is based on a decomposition of the angular domain along the discrete ordinates, thus naturally producing a set of completely uncoupled systems of equations in each iteration. Implementation of the parallel code on Intcl's iPSC/2 hypercube, and solutions to test problems are presented as evidence of the high speedup and efficiency of the parallel code. The performance of the parallel code on the iPSC/2 is analyzed, and a model for the CPU time as a function of the problem size (order of angular quadrature) and the number of participating processors is developed and validated against measured CPU times. The performance model is used to speculate on the potential of massively parallel computers for significantly speeding up real-life transport calculations at acceptable efficiencies. We conclude that parallel computers with a few hundred processors are capable of producing large speedups at very high efficiencies in very large three-dimensional problems. 10 refs., 8 figs
Araujo, Luiz H.; Timmers, Cynthia; Bell, Erica Hlavin; Shilo, Konstantin; Lammers, Philip E.; Zhao, Weiqiang; Natarajan, Thanemozhi G.; Miller, Clinton J.; Zhang, Jianying; Yilmaz, Ayse S.; Liu, Tom; Coombes, Kevin; Amann, Joseph; Carbone, David P.
2015-01-01
Purpose Technologic advances have enabled the comprehensive analysis of genetic perturbations in non–small-cell lung cancer (NSCLC); however, African Americans have often been underrepresented in these studies. This ethnic group has higher lung cancer incidence and mortality rates, and some studies have suggested a lower incidence of epidermal growth factor receptor mutations. Herein, we report the most in-depth molecular profile of NSCLC in African Americans to date. Methods A custom panel was designed to cover the coding regions of 81 NSCLC-related genes and 40 ancestry-informative markers. Clinical samples were sequenced on a massively parallel sequencing instrument, and anaplastic lymphoma kinase translocation was evaluated by fluorescent in situ hybridization. Results The study cohort included 99 patients (61% males, 94% smokers) comprising 31 squamous and 68 nonsquamous cell carcinomas. We detected 227 nonsilent variants in the coding sequence, including 24 samples with nonoverlapping, classic driver alterations. The frequency of driver mutations was not significantly different from that of whites, and no association was found between genetic ancestry and the presence of somatic mutations. Copy number alteration analysis disclosed distinguishable amplifications in the 3q chromosome arm in squamous cell carcinomas and pointed toward a handful of targetable alterations. We also found frequent SMARCA4 mutations and protein loss, mostly in driver-negative tumors. Conclusion Our data suggest that African American ancestry may not be significantly different from European/white background for the presence of somatic driver mutations in NSCLC. Furthermore, we demonstrated that using a comprehensive genotyping approach could identify numerous targetable alterations, with potential impact on therapeutic decisions. PMID:25918285
Newman, Gregory A.; Commer, Michael
2009-07-01
Three-dimensional (3D) geophysical imaging is now receiving considerable attention for electrical conductivity mapping of potential offshore oil and gas reservoirs. The imaging technology employs controlled source electromagnetic (CSEM) and magnetotelluric (MT) fields and treats geological media exhibiting transverse anisotropy. Moreover when combined with established seismic methods, direct imaging of reservoir fluids is possible. Because of the size of the 3D conductivity imaging problem, strategies are required exploiting computational parallelism and optimal meshing. The algorithm thus developed has been shown to scale to tens of thousands of processors. In one imaging experiment, 32,768 tasks/processors on the IBM Watson Research Blue Gene/L supercomputer were successfully utilized. Over a 24 hour period we were able to image a large scale field data set that previously required over four months of processing time on distributed clusters based on Intel or AMD processors utilizing 1024 tasks on an InfiniBand fabric. Electrical conductivity imaging using massively parallel computational resources produces results that cannot be obtained otherwise and are consistent with timeframes required for practical exploration problems.
International Nuclear Information System (INIS)
Newman, Gregory A; Commer, Michael
2009-01-01
Three-dimensional (3D) geophysical imaging is now receiving considerable attention for electrical conductivity mapping of potential offshore oil and gas reservoirs. The imaging technology employs controlled source electromagnetic (CSEM) and magnetotelluric (MT) fields and treats geological media exhibiting transverse anisotropy. Moreover when combined with established seismic methods, direct imaging of reservoir fluids is possible. Because of the size of the 3D conductivity imaging problem, strategies are required exploiting computational parallelism and optimal meshing. The algorithm thus developed has been shown to scale to tens of thousands of processors. In one imaging experiment, 32,768 tasks/processors on the IBM Watson Research Blue Gene/L supercomputer were successfully utilized. Over a 24 hour period we were able to image a large scale field data set that previously required over four months of processing time on distributed clusters based on Intel or AMD processors utilizing 1024 tasks on an InfiniBand fabric. Electrical conductivity imaging using massively parallel computational resources produces results that cannot be obtained otherwise and are consistent with timeframes required for practical exploration problems.
International Nuclear Information System (INIS)
Moravie, Philippe
1997-01-01
Today, in the digitized satellite image domain, the needs for high dimension increase considerably. To transmit or to stock such images (more than 6000 by 6000 pixels), we need to reduce their data volume and so we have to use real-time image compression techniques. The large amount of computations required by image compression algorithms prohibits the use of common sequential processors, for the benefits of parallel computers. The study presented here deals with parallelization of a very efficient image compression scheme, based on three techniques: Wavelets Transform (WT), Vector Quantization (VQ) and Entropic Coding (EC). First, we studied and implemented the parallelism of each algorithm, in order to determine the architectural characteristics needed for real-time image compression. Then, we defined eight parallel architectures: 3 for Mallat algorithm (WT), 3 for Tree-Structured Vector Quantization (VQ) and 2 for Huffman Coding (EC). As our system has to be multi-purpose, we chose 3 global architectures between all of the 3x3x2 systems available. Because, for technological reasons, real-time is not reached at anytime (for all the compression parameter combinations), we also defined and evaluated two algorithmic optimizations: fix point precision and merging entropic coding in vector quantization. As a result, we defined a new multi-purpose multi-SMIMD parallel machine, able to compress digitized satellite image in real-time. The definition of the best suited architecture for real-time image compression was answered by presenting 3 parallel machines among which one multi-purpose, embedded and which might be used for other applications on board. (author) [fr
Massively parallel simulator of optical coherence tomography of inhomogeneous turbid media.
Malektaji, Siavash; Lima, Ivan T; Escobar I, Mauricio R; Sherif, Sherif S
2017-10-01
An accurate and practical simulator for Optical Coherence Tomography (OCT) could be an important tool to study the underlying physical phenomena in OCT such as multiple light scattering. Recently, many researchers have investigated simulation of OCT of turbid media, e.g., tissue, using Monte Carlo methods. The main drawback of these earlier simulators is the long computational time required to produce accurate results. We developed a massively parallel simulator of OCT of inhomogeneous turbid media that obtains both Class I diffusive reflectivity, due to ballistic and quasi-ballistic scattered photons, and Class II diffusive reflectivity due to multiply scattered photons. This Monte Carlo-based simulator is implemented on graphic processing units (GPUs), using the Compute Unified Device Architecture (CUDA) platform and programming model, to exploit the parallel nature of propagation of photons in tissue. It models an arbitrary shaped sample medium as a tetrahedron-based mesh and uses an advanced importance sampling scheme. This new simulator speeds up simulations of OCT of inhomogeneous turbid media by about two orders of magnitude. To demonstrate this result, we have compared the computation times of our new parallel simulator and its serial counterpart using two samples of inhomogeneous turbid media. We have shown that our parallel implementation reduced simulation time of OCT of the first sample medium from 407 min to 92 min by using a single GPU card, to 12 min by using 8 GPU cards and to 7 min by using 16 GPU cards. For the second sample medium, the OCT simulation time was reduced from 209 h to 35.6 h by using a single GPU card, and to 4.65 h by using 8 GPU cards, and to only 2 h by using 16 GPU cards. Therefore our new parallel simulator is considerably more practical to use than its central processing unit (CPU)-based counterpart. Our new parallel OCT simulator could be a practical tool to study the different physical phenomena underlying OCT
Parallel Implicit Algorithms for CFD
Keyes, David E.
1998-01-01
The main goal of this project was efficient distributed parallel and workstation cluster implementations of Newton-Krylov-Schwarz (NKS) solvers for implicit Computational Fluid Dynamics (CFD.) "Newton" refers to a quadratically convergent nonlinear iteration using gradient information based on the true residual, "Krylov" to an inner linear iteration that accesses the Jacobian matrix only through highly parallelizable sparse matrix-vector products, and "Schwarz" to a domain decomposition form of preconditioning the inner Krylov iterations with primarily neighbor-only exchange of data between the processors. Prior experience has established that Newton-Krylov methods are competitive solvers in the CFD context and that Krylov-Schwarz methods port well to distributed memory computers. The combination of the techniques into Newton-Krylov-Schwarz was implemented on 2D and 3D unstructured Euler codes on the parallel testbeds that used to be at LaRC and on several other parallel computers operated by other agencies or made available by the vendors. Early implementations were made directly in Massively Parallel Integration (MPI) with parallel solvers we adapted from legacy NASA codes and enhanced for full NKS functionality. Later implementations were made in the framework of the PETSC library from Argonne National Laboratory, which now includes pseudo-transient continuation Newton-Krylov-Schwarz solver capability (as a result of demands we made upon PETSC during our early porting experiences). A secondary project pursued with funding from this contract was parallel implicit solvers in acoustics, specifically in the Helmholtz formulation. A 2D acoustic inverse problem has been solved in parallel within the PETSC framework.
Directory of Open Access Journals (Sweden)
Paolo Cazzaniga
2014-01-01
high-performance computing solutions is motivated by the need of performing large numbers of in silico analysis to study the behavior of biological systems in different conditions, which necessitate a computing power that usually overtakes the capability of standard desktop computers. In this work we present coagSODA, a CUDA-powered computational tool that was purposely developed for the analysis of a large mechanistic model of the blood coagulation cascade (BCC, defined according to both mass-action kinetics and Hill functions. coagSODA allows the execution of parallel simulations of the dynamics of the BCC by automatically deriving the system of ordinary differential equations and then exploiting the numerical integration algorithm LSODA. We present the biological results achieved with a massive exploration of perturbed conditions of the BCC, carried out with one-dimensional and bi-dimensional parameter sweep analysis, and show that GPU-accelerated parallel simulations of this model can increase the computational performances up to a 181× speedup compared to the corresponding sequential simulations.
Beam dynamics calculations and particle tracking using massively parallel processors
International Nuclear Information System (INIS)
Ryne, R.D.; Habib, S.
1995-01-01
During the past decade massively parallel processors (MPPs) have slowly gained acceptance within the scientific community. At present these machines typically contain a few hundred to one thousand off-the-shelf microprocessors and a total memory of up to 32 GBytes. The potential performance of these machines is illustrated by the fact that a month long job on a high end workstation might require only a few hours on an MPP. The acceptance of MPPs has been slow for a variety of reasons. For example, some algorithms are not easily parallelizable. Also, in the past these machines were difficult to program. But in recent years the development of Fortran-like languages such as CM Fortran and High Performance Fortran have made MPPs much easier to use. In the following we will describe how MPPs can be used for beam dynamics calculations and long term particle tracking
Object-Oriented Parallel Particle-in-Cell Code for Beam Dynamics Simulation in Linear Accelerators
International Nuclear Information System (INIS)
Qiang, J.; Ryne, R.D.; Habib, S.; Decky, V.
1999-01-01
In this paper, we present an object-oriented three-dimensional parallel particle-in-cell code for beam dynamics simulation in linear accelerators. A two-dimensional parallel domain decomposition approach is employed within a message passing programming paradigm along with a dynamic load balancing. Implementing object-oriented software design provides the code with better maintainability, reusability, and extensibility compared with conventional structure based code. This also helps to encapsulate the details of communications syntax. Performance tests on SGI/Cray T3E-900 and SGI Origin 2000 machines show good scalability of the object-oriented code. Some important features of this code also include employing symplectic integration with linear maps of external focusing elements and using z as the independent variable, typical in accelerators. A successful application was done to simulate beam transport through three superconducting sections in the APT linac design
Guermond, Jean-Luc; Minev, Peter D.; Salgado, Abner J.
2012-01-01
We provide a convergence analysis for a new fractional timestepping technique for the incompressible Navier-Stokes equations based on direction splitting. This new technique is of linear complexity, unconditionally stable and convergent, and suitable for massive parallelization. © 2012 American Mathematical Society.
Development Of A Parallel Performance Model For The THOR Neutral Particle Transport Code
Energy Technology Data Exchange (ETDEWEB)
Yessayan, Raffi; Azmy, Yousry; Schunert, Sebastian
2017-02-01
The THOR neutral particle transport code enables simulation of complex geometries for various problems from reactor simulations to nuclear non-proliferation. It is undergoing a thorough V&V requiring computational efficiency. This has motivated various improvements including angular parallelization, outer iteration acceleration, and development of peripheral tools. For guiding future improvements to the code’s efficiency, better characterization of its parallel performance is useful. A parallel performance model (PPM) can be used to evaluate the benefits of modifications and to identify performance bottlenecks. Using INL’s Falcon HPC, the PPM development incorporates an evaluation of network communication behavior over heterogeneous links and a functional characterization of the per-cell/angle/group runtime of each major code component. After evaluating several possible sources of variability, this resulted in a communication model and a parallel portion model. The former’s accuracy is bounded by the variability of communication on Falcon while the latter has an error on the order of 1%.
A scalable approach to modeling groundwater flow on massively parallel computers
International Nuclear Information System (INIS)
Ashby, S.F.; Falgout, R.D.; Tompson, A.F.B.
1995-12-01
We describe a fully scalable approach to the simulation of groundwater flow on a hierarchy of computing platforms, ranging from workstations to massively parallel computers. Specifically, we advocate the use of scalable conceptual models in which the subsurface model is defined independently of the computational grid on which the simulation takes place. We also describe a scalable multigrid algorithm for computing the groundwater flow velocities. We axe thus able to leverage both the engineer's time spent developing the conceptual model and the computing resources used in the numerical simulation. We have successfully employed this approach at the LLNL site, where we have run simulations ranging in size from just a few thousand spatial zones (on workstations) to more than eight million spatial zones (on the CRAY T3D)-all using the same conceptual model
Tiling as a Durable Abstraction for Parallelism and Data Locality
Energy Technology Data Exchange (ETDEWEB)
Unat, Didem [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Chan, Cy P. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Zhang, Weiqun [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Bell, John [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Shalf, John [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
2013-11-18
Tiling is a useful loop transformation for expressing parallelism and data locality. Automated tiling transformations that preserve data-locality are increasingly important due to hardware trends towards massive parallelism and the increasing costs of data movement relative to the cost of computing. We propose TiDA as a durable tiling abstraction that centralizes parameterized tiling information within array data types with minimal changes to the source code. The data layout information can be used by the compiler and runtime to automatically manage parallelism, optimize data locality, and schedule tasks intelligently. In this study, we present the design features and early interface of TiDA along with some preliminary results.
DEFF Research Database (Denmark)
Eduardoff, M; Gross, T E; Santos, C
2016-01-01
Seq™ PCR primers was designed for the Global AIM-SNPs to perform massively parallel sequencing using the Ion PGM™ system. This study assessed individual SNP genotyping precision using the Ion PGM™, the forensic sensitivity of the multiplex using dilution series, degraded DNA plus simple mixtures...
DEFF Research Database (Denmark)
Kielpinski, Lukasz J; Boyd, Mette; Sandelin, Albin
2013-01-01
Detection of reverse transcriptase termination sites is important in many different applications, such as structural probing of RNAs, rapid amplification of cDNA 5' ends (5' RACE), cap analysis of gene expression, and detection of RNA modifications and protein-RNA cross-links. The throughput...... of these methods can be increased by applying massive parallel sequencing technologies.Here, we describe a versatile method for detection of reverse transcriptase termination sites based on ligation of an adapter to the 3' end of cDNA with bacteriophage TS2126 RNA ligase (CircLigase™). In the following PCR...
Energy Technology Data Exchange (ETDEWEB)
Kostin, Mikhail [Michigan State Univ., East Lansing, MI (United States); Mokhov, Nikolai [Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States); Niita, Koji [Research Organization for Information Science and Technology, Ibaraki-ken (Japan)
2013-09-25
A parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. It is intended to be used with older radiation transport codes implemented in Fortran77, Fortran 90 or C. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was developed and tested in conjunction with the MARS15 code. It is possible to use it with other codes such as PHITS, FLUKA and MCNP after certain adjustments. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. The framework corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.
Recent Improvements to the IMPACT-T Parallel Particle Tracking Code
International Nuclear Information System (INIS)
Qiang, J.; Pogorelov, I.V.; Ryne, R.
2006-01-01
The IMPACT-T code is a parallel three-dimensional quasi-static beam dynamics code for modeling high brightness beams in photoinjectors and RF linacs. Developed under the US DOE Scientific Discovery through Advanced Computing (SciDAC) program, it includes several key features including a self-consistent calculation of 3D space-charge forces using a shifted and integrated Green function method, multiple energy bins for beams with large energy spread, and models for treating RF standing wave and traveling wave structures. In this paper, we report on recent improvements to the IMPACT-T code including modeling traveling wave structures, short-range transverse and longitudinal wakefields, and longitudinal coherent synchrotron radiation through bending magnets
Energy Technology Data Exchange (ETDEWEB)
Madduri, Kamesh; Ediger, David; Jiang, Karl; Bader, David A.; Chavarria-Miranda, Daniel
2009-02-15
We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.
I - Template Metaprogramming for Massively Parallel Scientific Computing - Expression Templates
CERN. Geneva
2016-01-01
Large scale scientific computing raises questions on different levels ranging from the fomulation of the problems to the choice of the best algorithms and their implementation for a specific platform. There are similarities in these different topics that can be exploited by modern-style C++ template metaprogramming techniques to produce readable, maintainable and generic code. Traditional low-level code tend to be fast but platform-dependent, and it obfuscates the meaning of the algorithm. On the other hand, object-oriented approach is nice to read, but may come with an inherent performance penalty. These lectures aim to present he basics of the Expression Template (ET) idiom which allows us to keep the object-oriented approach without sacrificing performance. We will in particular show to to enhance ET to include SIMD vectorization. We will then introduce techniques for abstracting iteration, and introduce thread-level parallelism for use in heavy data-centric loads. We will show to to apply these methods i...
Parallel treatment of simulation particles in particle-in-cell codes on SUPRENUM
International Nuclear Information System (INIS)
Seldner, D.
1990-02-01
This report contains the program documentation and description of the program package 2D-PLAS, which has been developed at the Nuclear Research Center Karlsruhe in the Institute for Data Processing in Technology (IDT) under the auspices of the BMFT. 2D-PLAS is a parallel program version of the treatment of the simulation particles of the two-dimensional stationary particle-in-cell code BFCPIC which has been developed at the Nuclear Research Center Karlsruhe. This parallel version has been designed for the parallel computer SUPRENUM. (orig.) [de
High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures
Directory of Open Access Journals (Sweden)
H. Y. Su
2012-04-01
Full Text Available This article presents two high-efficient parallel realizations of the context-based adaptive variable length coding (CAVLC based on heterogeneous multicore processors. By optimizing the architecture of the CAVLC encoder, three kinds of dependences are eliminated or weaken, including the context-based data dependence, the memory accessing dependence and the control dependence. The CAVLC pipeline is divided into three stages: two scans, coding, and lag packing, and be implemented on two typical heterogeneous multicore architectures. One is a block-based SIMD parallel CAVLC encoder on multicore stream processor STORM. The other is a component-oriented SIMT parallel encoder on massively parallel architecture GPU. Both of them exploited rich data-level parallelism. Experiments results show that compared with the CPU version, more than 70 times of speedup can be obtained for STORM and over 50 times for GPU. The implementation of encoder on STORM can make a real-time processing for 1080p @30fps and GPU-based version can satisfy the requirements for 720p real-time encoding. The throughput of the presented CAVLC encoders is more than 10 times higher than that of published software encoders on DSP and multicore platforms.
A parallel solution for high resolution histological image analysis.
Bueno, G; González, R; Déniz, O; García-Rojo, M; González-García, J; Fernández-Carrobles, M M; Vállez, N; Salido, J
2012-10-01
This paper describes a general methodology for developing parallel image processing algorithms based on message passing for high resolution images (on the order of several Gigabytes). These algorithms have been applied to histological images and must be executed on massively parallel processing architectures. Advances in new technologies for complete slide digitalization in pathology have been combined with developments in biomedical informatics. However, the efficient use of these digital slide systems is still a challenge. The image processing that these slides are subject to is still limited both in terms of data processed and processing methods. The work presented here focuses on the need to design and develop parallel image processing tools capable of obtaining and analyzing the entire gamut of information included in digital slides. Tools have been developed to assist pathologists in image analysis and diagnosis, and they cover low and high-level image processing methods applied to histological images. Code portability, reusability and scalability have been tested by using the following parallel computing architectures: distributed memory with massive parallel processors and two networks, INFINIBAND and Myrinet, composed of 17 and 1024 nodes respectively. The parallel framework proposed is flexible, high performance solution and it shows that the efficient processing of digital microscopic images is possible and may offer important benefits to pathology laboratories. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Baumgärtel, M.; Ghanem, K.; Kiani, A.; Koch, E.; Pavarini, E.; Sims, H.; Zhang, G.
2017-07-01
We discuss the efficient implementation of general impurity solvers for dynamical mean-field theory. We show that both Lanczos and quantum Monte Carlo in different flavors (Hirsch-Fye, continuous-time hybridization- and interaction-expansion) exhibit excellent scaling on massively parallel supercomputers. We apply these algorithms to simulate realistic model Hamiltonians including the full Coulomb vertex, crystal-field splitting, and spin-orbit interaction. We discuss how to remove the sign problem in the presence of non-diagonal crystal-field and hybridization matrices. We show how to extract the physically observable quantities from imaginary time data, in particular correlation functions and susceptibilities. Finally, we present benchmarks and applications for representative correlated systems.
Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array
Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul
2008-04-01
This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.
Coupling methods for parallel running RELAPSim codes in nuclear power plant simulation
Energy Technology Data Exchange (ETDEWEB)
Li, Yankai; Lin, Meng, E-mail: linmeng@sjtu.edu.cn; Yang, Yanhua
2016-02-15
When the plant is modeled detailedly for high precision, it is hard to achieve real-time calculation for one single RELAP5 in a large-scale simulation. To improve the speed and ensure the precision of simulation at the same time, coupling methods for parallel running RELAPSim codes were proposed in this study. Explicit coupling method via coupling boundaries was realized based on a data-exchange and procedure-control environment. Compromise of synchronization frequency was well considered to improve the precision of simulation and guarantee the real-time simulation at the same time. The coupling methods were assessed using both single-phase flow models and two-phase flow models and good agreements were obtained between the splitting–coupling models and the integrated model. The mitigation of SGTR was performed as an integral application of the coupling models. A large-scope NPP simulator was developed adopting six splitting–coupling models of RELAPSim and other simulation codes. The coupling models could improve the speed of simulation significantly and make it possible for real-time calculation. In this paper, the coupling of the models in the engineering simulator is taken as an example to expound the coupling methods, i.e., coupling between parallel running RELAPSim codes, and coupling between RELAPSim code and other types of simulation codes. However, the coupling methods are also referable in other simulator, for example, a simulator employing ATHLETE instead of RELAP5, other logic code instead of SIMULINK. It is believed the coupling method is commonly used for NPP simulator regardless of the specific codes chosen in this paper.
Directory of Open Access Journals (Sweden)
Rovini Massimo
2009-01-01
Full Text Available This is a reply to the comments by Gunnam et al. "Comments on 'Techniques and architectures for hazard-free semi-parallel decoding of LDPC codes'", EURASIP Journal on Embedded Systems, vol. 2009, Article ID 704174 on our recent work "Techniques and architectures for hazard-free semi-parallel decoding of LDPC codes", EURASIP Journal on Embedded Systems, vol. 2009, Article ID 723465.
A massively parallel strategy for STR marker development, capture, and genotyping.
Kistler, Logan; Johnson, Stephen M; Irwin, Mitchell T; Louis, Edward E; Ratan, Aakrosh; Perry, George H
2017-09-06
Short tandem repeat (STR) variants are highly polymorphic markers that facilitate powerful population genetic analyses. STRs are especially valuable in conservation and ecological genetic research, yielding detailed information on population structure and short-term demographic fluctuations. Massively parallel sequencing has not previously been leveraged for scalable, efficient STR recovery. Here, we present a pipeline for developing STR markers directly from high-throughput shotgun sequencing data without a reference genome, and an approach for highly parallel target STR recovery. We employed our approach to capture a panel of 5000 STRs from a test group of diademed sifakas (Propithecus diadema, n = 3), endangered Malagasy rainforest lemurs, and we report extremely efficient recovery of targeted loci-97.3-99.6% of STRs characterized with ≥10x non-redundant sequence coverage. We then tested our STR capture strategy on P. diadema fecal DNA, and report robust initial results and suggestions for future implementations. In addition to STR targets, this approach also generates large, genome-wide single nucleotide polymorphism (SNP) panels from flanking regions. Our method provides a cost-effective and scalable solution for rapid recovery of large STR and SNP datasets in any species without needing a reference genome, and can be used even with suboptimal DNA more easily acquired in conservation and ecological studies. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.
The vector and parallel processing of MORSE code on Monte Carlo Machine
International Nuclear Information System (INIS)
Hasegawa, Yukihiro; Higuchi, Kenji.
1995-11-01
Multi-group Monte Carlo Code for particle transport, MORSE is modified for high performance computing on Monte Carlo Machine Monte-4. The method and the results are described. Monte-4 was specially developed to realize high performance computing of Monte Carlo codes for particle transport, which have been difficult to obtain high performance in vector processing on conventional vector processors. Monte-4 has four vector processor units with the special hardware called Monte Carlo pipelines. The vectorization and parallelization of MORSE code and the performance evaluation on Monte-4 are described. (author)
Recent development for the ITS code system: Parallel processing and visualization
International Nuclear Information System (INIS)
Fan, W.C.; Turner, C.D.; Halbleib, J.A. Sr.; Kensek, R.P.
1996-01-01
A brief overview is given for two software developments related to the ITS code system. These developments provide parallel processing and visualization capabilities and thus allow users to perform ITS calculations more efficiently. Timing results and a graphical example are presented to demonstrate these capabilities
Portable, parallel, reusable Krylov space codes
Energy Technology Data Exchange (ETDEWEB)
Smith, B.; Gropp, W. [Argonne National Lab., IL (United States)
1994-12-31
Krylov space accelerators are an important component of many algorithms for the iterative solution of linear systems. Each Krylov space method has it`s own particular advantages and disadvantages, therefore it is desirable to have a variety of them available all with an identical, easy to use, interface. A common complaint application programmers have with available software libraries for the iterative solution of linear systems is that they require the programmer to use the data structures provided by the library. The library is not able to work with the data structures of the application code. Hence, application programmers find themselves constantly recoding the Krlov space algorithms. The Krylov space package (KSP) is a data-structure-neutral implementation of a variety of Krylov space methods including preconditioned conjugate gradient, GMRES, BiCG-Stab, transpose free QMR and CGS. Unlike all other software libraries for linear systems that the authors are aware of, KSP will work with any application codes data structures, in Fortran or C. Due to it`s data-structure-neutral design KSP runs unchanged on both sequential and parallel machines. KSP has been tested on workstations, the Intel i860 and Paragon, Thinking Machines CM-5 and the IBM SP1.
Ocean circulation code on machine connection
International Nuclear Information System (INIS)
Vitart, F.
1993-01-01
This work is part of a development of a global climate model based on a coupling between an ocean model and an atmosphere model. The objective was to develop this global model on a massively parallel machine (CM2). The author presents the OPA7 code (equations, boundary conditions, equation system resolution) and parallelization on the CM2 machine. CM2 data structure is briefly evoked, and two tests are reported (on a flat bottom basin, and a topography with eight islands). The author then gives an overview of studies aimed at improving the ocean circulation code: use of a new state equation, use of a formulation of surface pressure, use of a new mesh. He reports the study of the use of multi-block domains on CM2 through advection tests, and two-block tests
Differences Between Distributed and Parallel Systems
Energy Technology Data Exchange (ETDEWEB)
Brightwell, R.; Maccabe, A.B.; Rissen, R.
1998-10-01
Distributed systems have been studied for twenty years and are now coming into wider use as fast networks and powerful workstations become more readily available. In many respects a massively parallel computer resembles a network of workstations and it is tempting to port a distributed operating system to such a machine. However, there are significant differences between these two environments and a parallel operating system is needed to get the best performance out of a massively parallel system. This report characterizes the differences between distributed systems, networks of workstations, and massively parallel systems and analyzes the impact of these differences on operating system design. In the second part of the report, we introduce Puma, an operating system specifically developed for massively parallel systems. We describe Puma portals, the basic building blocks for message passing paradigms implemented on top of Puma, and show how the differences observed in the first part of the report have influenced the design and implementation of Puma.
Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes
Directory of Open Access Journals (Sweden)
Rovini Massimo
2009-01-01
Full Text Available The layered decoding algorithm has recently been proposed as an efficient means for the decoding of low-density parity-check (LDPC codes, thanks to the remarkable improvement in the convergence speed (2x of the decoding process. However, pipelined semi-parallel decoders suffer from violations or "hazards" between consecutive updates, which not only violate the layered principle but also enforce the loops in the code, thus spoiling the error correction performance. This paper describes three different techniques to properly reschedule the decoding updates, based on the careful insertion of "idle" cycles, to prevent the hazards of the pipeline mechanism. Also, different semi-parallel architectures of a layered LDPC decoder suitable for use with such techniques are analyzed. Then, taking the LDPC codes for the wireless local area network (IEEE 802.11n as a case study, a detailed analysis of the performance attained with the proposed techniques and architectures is reported, and results of the logic synthesis on a 65 nm low-power CMOS technology are shown.
Efficient numerical methods for fluid- and electrodynamics on massively parallel systems
Energy Technology Data Exchange (ETDEWEB)
Zudrop, Jens
2016-07-01
In the last decade, computer technology has evolved rapidly. Modern high performance computing systems offer a tremendous amount of computing power in the range of a few peta floating point operations per second. In contrast, numerical software development is much slower and most existing simulation codes cannot exploit the full computing power of these systems. Partially, this is due to the numerical methods themselves and partially it is related to bottlenecks within the parallelization concept and its data structures. The goal of the thesis is the development of numerical algorithms and corresponding data structures to remedy both kinds of parallelization bottlenecks. The approach is based on a co-design of the numerical schemes (including numerical analysis) and their realizations in algorithms and software. Various kinds of applications, from multicomponent flows (Lattice Boltzmann Method) to electrodynamics (Discontinuous Galerkin Method) to embedded geometries (Octree), are considered and efficiency of the developed approaches is demonstrated for large scale simulations.
Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David
1987-01-01
The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.
International Nuclear Information System (INIS)
Bean, J.E.; Sanchez, M.; Arguello, J.G.
2012-01-01
Document available in extended abstract form only. Because, until recently, U.S. efforts had been focused on the volcanic tuff site at Yucca Mountain, radioactive waste disposal in U.S. clay/shale formations has not been considered for many years. However, advances in multi-physics computational modeling and research into clay mineralogy continue to improve the scientific basis for assessing nuclear waste repository performance in such formations. Disposal of high-level radioactive waste (HLW) in suitable clay/shale formations is attractive because the material is essentially impermeable and self-sealing, conditions are chemically reducing, and sorption tends to prevent radionuclide transport. Vertically and laterally extensive shale and clay formations exist in multiple locations in the contiguous 48 states. This paper describes an emerging massively parallel (MP) high performance computing (HPC) capability - SIERRA Mechanics - that is applicable to the simulation of coupled-physics processes occurring within a potential clay/shale repository for disposal of HLW within the U.S. The SIERRA Mechanics code development project has been underway at Sandia National Laboratories for approximately the past decade under the auspices of the U.S. Department of Energy's Advanced Scientific Computing (ASC) program. SIERRA Mechanics was designed and developed from its inception to run on the latest and most sophisticated massively parallel computing hardware, with the capability to span the hardware range from single workstations to systems with thousands of processors. The foundation of SIERRA Mechanics is the SIERRA tool-kit, which provides finite element application-code services such as: (1) mesh and field data management, both parallel and distributed; (2) transfer operators for mapping field variables from one mechanics application to another; (3) a solution controller for code coupling; and (4) included third party libraries (e.g., solver libraries, communications
A parallel code named NEPTUNE for 3D fully electromagnetic and pic simulations
International Nuclear Information System (INIS)
Dong Ye; Yang Wenyuan; Chen Jun; Zhao Qiang; Xia Fang; Ma Yan; Xiao Li; Sun Huifang; Chen Hong; Zhou Haijing; Mao Zeyao; Dong Zhiwei
2010-01-01
A parallel code named NEPTUNE for 3D fully electromagnetic and particle-in-cell (PIC) simulations is introduced, which could run on the Linux system with hundreds to thousand CPUs. NEPTUNE is suitable to simulate entire 3D HPM devices; many HPM devices are simulated and designed by using it. In NEPTUNE code, the electromagnetic fields are updated by using the finite-difference in time domain (FDTD) method of solving Maxwell equations and the particles are moved by using Buneman-Boris advance method of solving relativistic Newton-Lorentz equation. Electromagnetic fields and particles are coupled by using liner weighing interpolation PIC method, and the electric filed components are corrected by using Boris method of solve Poisson equation in order to ensure charge-conservation. NEPTUNE code could construct many complicated geometric structures, such as arbitrary axial-symmetric structures, plane transforming structures, slow-wave-structures, coupling holes, foils, and so on. The boundary conditions used in NEPTUNE code are introduced in brief, including perfectly electric conductor boundary, external wave boundary, and particle boundary. Finally, some typical HPM devices are simulated and test by using NEPTUNE code, including MILO, RBWO, VCO, and RKA. The simulation results are with correct and credible physical images, and the parallel efficiencies are also given. (authors)
Keppenne, Christian L.; Rienecker, Michele; Borovikov, Anna Y.; Suarez, Max
1999-01-01
A massively parallel ensemble Kalman filter (EnKF)is used to assimilate temperature data from the TOGA/TAO array and altimetry from TOPEX/POSEIDON into a Pacific basin version of the NASA Seasonal to Interannual Prediction Project (NSIPP)ls quasi-isopycnal ocean general circulation model. The EnKF is an approximate Kalman filter in which the error-covariance propagation step is modeled by the integration of multiple instances of a numerical model. An estimate of the true error covariances is then inferred from the distribution of the ensemble of model state vectors. This inplementation of the filter takes advantage of the inherent parallelism in the EnKF algorithm by running all the model instances concurrently. The Kalman filter update step also occurs in parallel by having each processor process the observations that occur in the region of physical space for which it is responsible. The massively parallel data assimilation system is validated by withholding some of the data and then quantifying the extent to which the withheld information can be inferred from the assimilation of the remaining data. The distributions of the forecast and analysis error covariances predicted by the ENKF are also examined.
High performance parallel computing of flows in complex geometries: I. Methods
International Nuclear Information System (INIS)
Gourdain, N; Gicquel, L; Montagnac, M; Vermorel, O; Staffelbach, G; Garcia, M; Boussuge, J-F; Gazaix, M; Poinsot, T
2009-01-01
Efficient numerical tools coupled with high-performance computers, have become a key element of the design process in the fields of energy supply and transportation. However flow phenomena that occur in complex systems such as gas turbines and aircrafts are still not understood mainly because of the models that are needed. In fact, most computational fluid dynamics (CFD) predictions as found today in industry focus on a reduced or simplified version of the real system (such as a periodic sector) and are usually solved with a steady-state assumption. This paper shows how to overcome such barriers and how such a new challenge can be addressed by developing flow solvers running on high-end computing platforms, using thousands of computing cores. Parallel strategies used by modern flow solvers are discussed with particular emphases on mesh-partitioning, load balancing and communication. Two examples are used to illustrate these concepts: a multi-block structured code and an unstructured code. Parallel computing strategies used with both flow solvers are detailed and compared. This comparison indicates that mesh-partitioning and load balancing are more straightforward with unstructured grids than with multi-block structured meshes. However, the mesh-partitioning stage can be challenging for unstructured grids, mainly due to memory limitations of the newly developed massively parallel architectures. Finally, detailed investigations show that the impact of mesh-partitioning on the numerical CFD solutions, due to rounding errors and block splitting, may be of importance and should be accurately addressed before qualifying massively parallel CFD tools for a routine industrial use.
Moon, Hongsik
What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the
Speed up of MCACE, a Monte Carlo code for evaluation of shielding safety, by parallel computer, (3)
International Nuclear Information System (INIS)
Takano, Makoto; Masukawa, Fumihiro; Naito, Yoshitaka; Onodera, Emi; Imawaka, Tsuneyuki; Yoda, Yoshihisa.
1993-07-01
The parallel computing of the MCACE code has been studied on two platforms; 1) Shared Memory Type Vector-Parallel Computer Monte-4 and 2) Networked Several Workstations. On the Monte-4, a disk-file has been allocated to collect all results computed by 4 CPUs in parallel, executing the copy of the MCACE code on each CPU. On the workstations under network environment, two parallel models have been evaluated; 1) a host-node model and 2) the model used on the Monte-4 where no software for parallelization has been employed but only standard FORTRAN language. The measurement of computing times has showed that speed up of about 3 times has been achieved by using 4 CPUs of the Monte-4. Further, connecting 4 workstations by network, the computing speed by parallelization has achieved faster than our scalar main frame computer, FACOM M-780. (author)
Directory of Open Access Journals (Sweden)
Sonja Hall-Mendelin
Full Text Available Human disease incidence attributed to arbovirus infection is increasing throughout the world, with effective control interventions limited by issues of sustainability, insecticide resistance and the lack of effective vaccines. Several promising control strategies are currently under development, such as the release of mosquitoes trans-infected with virus-blocking Wolbachia bacteria. Implementation of any control program is dependent on effective virus surveillance and a thorough understanding of virus-vector interactions. Massively parallel sequencing has enormous potential for providing comprehensive genomic information that can be used to assess many aspects of arbovirus ecology, as well as to evaluate novel control strategies. To demonstrate proof-of-principle, we analyzed Aedes aegypti or Aedes albopictus experimentally infected with dengue, yellow fever or chikungunya viruses. Random amplification was used to prepare sufficient template for sequencing on the Personal Genome Machine. Viral sequences were present in all infected mosquitoes. In addition, in most cases, we were also able to identify the mosquito species and mosquito micro-organisms, including the bacterial endosymbiont Wolbachia. Importantly, naturally occurring Wolbachia strains could be differentiated from strains that had been trans-infected into the mosquito. The method allowed us to assemble near full-length viral genomes and detect other micro-organisms without prior sequence knowledge, in a single reaction. This is a step toward the application of massively parallel sequencing as an arbovirus surveillance tool. It has the potential to provide insight into virus transmission dynamics, and has applicability to the post-release monitoring of Wolbachia in mosquito populations.
Computations on the massively parallel processor at the Goddard Space Flight Center
Strong, James P.
1991-01-01
Described are four significant algorithms implemented on the massively parallel processor (MPP) at the Goddard Space Flight Center. Two are in the area of image analysis. Of the other two, one is a mathematical simulation experiment and the other deals with the efficient transfer of data between distantly separated processors in the MPP array. The first algorithm presented is the automatic determination of elevations from stereo pairs. The second algorithm solves mathematical logistic equations capable of producing both ordered and chaotic (or random) solutions. This work can potentially lead to the simulation of artificial life processes. The third algorithm is the automatic segmentation of images into reasonable regions based on some similarity criterion, while the fourth is an implementation of a bitonic sort of data which significantly overcomes the nearest neighbor interconnection constraints on the MPP for transferring data between distant processors.
An FPGA-Based Massively Parallel Neuromorphic Cortex Simulator.
Wang, Runchun M; Thakur, Chetan S; van Schaik, André
2018-01-01
This paper presents a massively parallel and scalable neuromorphic cortex simulator designed for simulating large and structurally connected spiking neural networks, such as complex models of various areas of the cortex. The main novelty of this work is the abstraction of a neuromorphic architecture into clusters represented by minicolumns and hypercolumns, analogously to the fundamental structural units observed in neurobiology. Without this approach, simulating large-scale fully connected networks needs prohibitively large memory to store look-up tables for point-to-point connections. Instead, we use a novel architecture, based on the structural connectivity in the neocortex, such that all the required parameters and connections can be stored in on-chip memory. The cortex simulator can be easily reconfigured for simulating different neural networks without any change in hardware structure by programming the memory. A hierarchical communication scheme allows one neuron to have a fan-out of up to 200 k neurons. As a proof-of-concept, an implementation on one Altera Stratix V FPGA was able to simulate 20 million to 2.6 billion leaky-integrate-and-fire (LIF) neurons in real time. We verified the system by emulating a simplified auditory cortex (with 100 million neurons). This cortex simulator achieved a low power dissipation of 1.62 μW per neuron. With the advent of commercially available FPGA boards, our system offers an accessible and scalable tool for the design, real-time simulation, and analysis of large-scale spiking neural networks.
An FPGA-Based Massively Parallel Neuromorphic Cortex Simulator
Directory of Open Access Journals (Sweden)
Runchun M. Wang
2018-04-01
Full Text Available This paper presents a massively parallel and scalable neuromorphic cortex simulator designed for simulating large and structurally connected spiking neural networks, such as complex models of various areas of the cortex. The main novelty of this work is the abstraction of a neuromorphic architecture into clusters represented by minicolumns and hypercolumns, analogously to the fundamental structural units observed in neurobiology. Without this approach, simulating large-scale fully connected networks needs prohibitively large memory to store look-up tables for point-to-point connections. Instead, we use a novel architecture, based on the structural connectivity in the neocortex, such that all the required parameters and connections can be stored in on-chip memory. The cortex simulator can be easily reconfigured for simulating different neural networks without any change in hardware structure by programming the memory. A hierarchical communication scheme allows one neuron to have a fan-out of up to 200 k neurons. As a proof-of-concept, an implementation on one Altera Stratix V FPGA was able to simulate 20 million to 2.6 billion leaky-integrate-and-fire (LIF neurons in real time. We verified the system by emulating a simplified auditory cortex (with 100 million neurons. This cortex simulator achieved a low power dissipation of 1.62 μW per neuron. With the advent of commercially available FPGA boards, our system offers an accessible and scalable tool for the design, real-time simulation, and analysis of large-scale spiking neural networks.
International Nuclear Information System (INIS)
Lee, Jin Pyo; Joo, Han Gyu
2010-01-01
In the thermo-fluid analysis code named CUPID, the linear system of pressure equations must be solved in each iteration step. The time for repeatedly solving the linear system can be quite significant because large sparse matrices of Rank more than 50,000 are involved and the diagonal dominance of the system is hardly hold. Therefore parallelization of the linear system solver is essential to reduce the computing time. Meanwhile, Graphics Processing Units (GPU) have been developed as highly parallel, multi-core processors for the global demand of high quality 3D graphics. If a suitable interface is provided, parallelization using GPU can be available to engineering computing. NVIDIA provides a Software Development Kit(SDK) named CUDA(Compute Unified Device Architecture) to code developers so that they can manage GPUs for parallelization using the C language. In this research, we implement parallel routines for the linear system solver using CUDA, and examine the performance of the parallelization. In the next section, we will describe the method of CUDA parallelization for the CUPID code, and then the performance of the CUDA parallelization will be discussed
Shared Memory Parallelization of an Implicit ADI-type CFD Code
Hauser, Th.; Huang, P. G.
1999-01-01
A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.
MHD code using multi graphical processing units: SMAUG+
Gyenge, N.; Griffiths, M. K.; Erdélyi, R.
2018-01-01
This paper introduces the Sheffield Magnetohydrodynamics Algorithm Using GPUs (SMAUG+), an advanced numerical code for solving magnetohydrodynamic (MHD) problems, using multi-GPU systems. Multi-GPU systems facilitate the development of accelerated codes and enable us to investigate larger model sizes and/or more detailed computational domain resolutions. This is a significant advancement over the parent single-GPU MHD code, SMAUG (Griffiths et al., 2015). Here, we demonstrate the validity of the SMAUG + code, describe the parallelisation techniques and investigate performance benchmarks. The initial configuration of the Orszag-Tang vortex simulations are distributed among 4, 16, 64 and 100 GPUs. Furthermore, different simulation box resolutions are applied: 1000 × 1000, 2044 × 2044, 4000 × 4000 and 8000 × 8000 . We also tested the code with the Brio-Wu shock tube simulations with model size of 800 employing up to 10 GPUs. Based on the test results, we observed speed ups and slow downs, depending on the granularity and the communication overhead of certain parallel tasks. The main aim of the code development is to provide massively parallel code without the memory limitation of a single GPU. By using our code, the applied model size could be significantly increased. We demonstrate that we are able to successfully compute numerically valid and large 2D MHD problems.
FISH: A THREE-DIMENSIONAL PARALLEL MAGNETOHYDRODYNAMICS CODE FOR ASTROPHYSICAL APPLICATIONS
International Nuclear Information System (INIS)
Kaeppeli, R.; Whitehouse, S. C.; Scheidegger, S.; Liebendoerfer, M.; Pen, U.-L.
2011-01-01
FISH is a fast and simple ideal magnetohydrodynamics code that scales to ∼10,000 processes for a Cartesian computational domain of ∼1000 3 cells. The simplicity of FISH has been achieved by the rigorous application of the operator splitting technique, while second-order accuracy is maintained by the symmetric ordering of the operators. Between directional sweeps, the three-dimensional data are rotated in memory so that the sweep is always performed in a cache-efficient way along the direction of contiguous memory. Hence, the code only requires a one-dimensional description of the conservation equations to be solved. This approach also enables an elegant novel parallelization of the code that is based on persistent communications with MPI for cubic domain decomposition on machines with distributed memory. This scheme is then combined with an additional OpenMP parallelization of different sweeps that can take advantage of clusters of shared memory. We document the detailed implementation of a second-order total variation diminishing advection scheme based on flux reconstruction. The magnetic fields are evolved by a constrained transport scheme. We show that the subtraction of a simple estimate of the hydrostatic gradient from the total gradients can significantly reduce the dissipation of the advection scheme in simulations of gravitationally bound hydrostatic objects. Through its simplicity and efficiency, FISH is as well suited for hydrodynamics classes as for large-scale astrophysical simulations on high-performance computer clusters. In preparation for the release of a public version, we demonstrate the performance of FISH in a suite of astrophysically orientated test cases.
International Nuclear Information System (INIS)
Salko, Robert K.; Schmidt, Rodney C.; Avramova, Maria N.
2015-01-01
Highlights: • COBRA-TF was adopted by the Consortium for Advanced Simulation of LWRs. • We have improved code performance to support running large-scale LWR simulations. • Code optimization has led to reductions in execution time and memory usage. • An MPI parallelization has reduced full-core simulation time from days to minutes. - Abstract: This paper describes major improvements to the computational infrastructure of the CTF subchannel code so that full-core, pincell-resolved (i.e., one computational subchannel per real bundle flow channel) simulations can now be performed in much shorter run-times, either in stand-alone mode or as part of coupled-code multi-physics calculations. These improvements support the goals of the Department Of Energy Consortium for Advanced Simulation of Light Water Reactors (CASL) Energy Innovation Hub to develop high fidelity multi-physics simulation tools for nuclear energy design and analysis. A set of serial code optimizations—including fixing computational inefficiencies, optimizing the numerical approach, and making smarter data storage choices—are first described and shown to reduce both execution time and memory usage by about a factor of ten. Next, a “single program multiple data” parallelization strategy targeting distributed memory “multiple instruction multiple data” platforms utilizing domain decomposition is presented. In this approach, data communication between processors is accomplished by inserting standard Message-Passing Interface (MPI) calls at strategic points in the code. The domain decomposition approach implemented assigns one MPI process to each fuel assembly, with each domain being represented by its own CTF input file. The creation of CTF input files, both for serial and parallel runs, is also fully automated through use of a pressurized water reactor (PWR) pre-processor utility that uses a greatly simplified set of user input compared with the traditional CTF input. To run CTF in
Cloud identification using genetic algorithms and massively parallel computation
Buckles, Bill P.; Petry, Frederick E.
1996-01-01
As a Guest Computational Investigator under the NASA administered component of the High Performance Computing and Communication Program, we implemented a massively parallel genetic algorithm on the MasPar SIMD computer. Experiments were conducted using Earth Science data in the domains of meteorology and oceanography. Results obtained in these domains are competitive with, and in most cases better than, similar problems solved using other methods. In the meteorological domain, we chose to identify clouds using AVHRR spectral data. Four cloud speciations were used although most researchers settle for three. Results were remarkedly consistent across all tests (91% accuracy). Refinements of this method may lead to more timely and complete information for Global Circulation Models (GCMS) that are prevalent in weather forecasting and global environment studies. In the oceanographic domain, we chose to identify ocean currents from a spectrometer having similar characteristics to AVHRR. Here the results were mixed (60% to 80% accuracy). Given that one is willing to run the experiment several times (say 10), then it is acceptable to claim the higher accuracy rating. This problem has never been successfully automated. Therefore, these results are encouraging even though less impressive than the cloud experiment. Successful conclusion of an automated ocean current detection system would impact coastal fishing, naval tactics, and the study of micro-climates. Finally we contributed to the basic knowledge of GA (genetic algorithm) behavior in parallel environments. We developed better knowledge of the use of subpopulations in the context of shared breeding pools and the migration of individuals. Rigorous experiments were conducted based on quantifiable performance criteria. While much of the work confirmed current wisdom, for the first time we were able to submit conclusive evidence. The software developed under this grant was placed in the public domain. An extensive user
Energy Technology Data Exchange (ETDEWEB)
Xiong, Yi [Colorado School of Mines, Golden, CO (United States); Fakcharoenphol, Perapon [Colorado School of Mines, Golden, CO (United States); Wang, Shihao [Colorado School of Mines, Golden, CO (United States); Winterfeld, Philip H. [Colorado School of Mines, Golden, CO (United States); Zhang, Keni [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Wu, Yu-Shu [Colorado School of Mines, Golden, CO (United States)
2013-12-01
TOUGH2-EGS-MP is a parallel numerical simulation program coupling geomechanics with fluid and heat flow in fractured and porous media, and is applicable for simulation of enhanced geothermal systems (EGS). TOUGH2-EGS-MP is based on the TOUGH2-MP code, the massively parallel version of TOUGH2. In TOUGH2-EGS-MP, the fully-coupled flow-geomechanics model is developed from linear elastic theory for thermo-poro-elastic systems and is formulated in terms of mean normal stress as well as pore pressure and temperature. Reservoir rock properties such as porosity and permeability depend on rock deformation, and the relationships between these two, obtained from poro-elasticity theories and empirical correlations, are incorporated into the simulation. This report provides the user with detailed information on the TOUGH2-EGS-MP mathematical model and instructions for using it for Thermal-Hydrological-Mechanical (THM) simulations. The mathematical model includes the fluid and heat flow equations, geomechanical equation, and discretization of those equations. In addition, the parallel aspects of the code, such as domain partitioning and communication between processors, are also included. Although TOUGH2-EGS-MP has the capability for simulating fluid and heat flows coupled with geomechanical effects, it is up to the user to select the specific coupling process, such as THM or only TH, in a simulation. There are several example problems illustrating applications of this program. These example problems are described in detail and their input data are presented. Their results demonstrate that this program can be used for field-scale geothermal reservoir simulation in porous and fractured media with fluid and heat flow coupled with geomechanical effects.
Energy Technology Data Exchange (ETDEWEB)
1991-10-23
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.
Application of parallel computing to seismic damage process simulation of an arch dam
International Nuclear Information System (INIS)
Zhong Hong; Lin Gao; Li Jianbo
2010-01-01
The simulation of damage process of high arch dam subjected to strong earthquake shocks is significant to the evaluation of its performance and seismic safety, considering the catastrophic effect of dam failure. However, such numerical simulation requires rigorous computational capacity. Conventional serial computing falls short of that and parallel computing is a fairly promising solution to this problem. The parallel finite element code PDPAD was developed for the damage prediction of arch dams utilizing the damage model with inheterogeneity of concrete considered. Developed with programming language Fortran, the code uses a master/slave mode for programming, domain decomposition method for allocation of tasks, MPI (Message Passing Interface) for communication and solvers from AZTEC library for solution of large-scale equations. Speedup test showed that the performance of PDPAD was quite satisfactory. The code was employed to study the damage process of a being-built arch dam on a 4-node PC Cluster, with more than one million degrees of freedom considered. The obtained damage mode was quite similar to that of shaking table test, indicating that the proposed procedure and parallel code PDPAD has a good potential in simulating seismic damage mode of arch dams. With the rapidly growing need for massive computation emerged from engineering problems, parallel computing will find more and more applications in pertinent areas.
Morse, H Stephen
1994-01-01
Practical Parallel Computing provides information pertinent to the fundamental aspects of high-performance parallel processing. This book discusses the development of parallel applications on a variety of equipment.Organized into three parts encompassing 12 chapters, this book begins with an overview of the technology trends that converge to favor massively parallel hardware over traditional mainframes and vector machines. This text then gives a tutorial introduction to parallel hardware architectures. Other chapters provide worked-out examples of programs using several parallel languages. Thi
Algorithms and data structures for massively parallel generic adaptive finite element codes
Bangerth, Wolfgang
2011-12-01
Today\\'s largest supercomputers have 100,000s of processor cores and offer the potential to solve partial differential equations discretized by billions of unknowns. However, the complexity of scaling to such large machines and problem sizes has so far prevented the emergence of generic software libraries that support such computations, although these would lower the threshold of entry and enable many more applications to benefit from large-scale computing. We are concerned with providing this functionality for mesh-adaptive finite element computations. We assume the existence of an "oracle" that implements the generation and modification of an adaptive mesh distributed across many processors, and that responds to queries about its structure. Based on querying the oracle, we develop scalable algorithms and data structures for generic finite element methods. Specifically, we consider the parallel distribution of mesh data, global enumeration of degrees of freedom, constraints, and postprocessing. Our algorithms remove the bottlenecks that typically limit large-scale adaptive finite element analyses. We demonstrate scalability of complete finite element workflows on up to 16,384 processors. An implementation of the proposed algorithms, based on the open source software p4est as mesh oracle, is provided under an open source license through the widely used deal.II finite element software library. © 2011 ACM 0098-3500/2011/12-ART10 $10.00.
International Nuclear Information System (INIS)
Gwo, Jin-Ping; Yeh, Gour-Tsyh
1997-01-01
The objectives of this study are (1) to parallelize a 3-dimensional hydrogeochemistry code and (2) to apply the parallel code to a proposed waste disposal site at the Oak Ridge National Laboratory (ORNL). The 2-dimensional hydrogeochemistry code HYDROGEOCHEM, developed at the Pennsylvania State University for coupled subsurface solute transport and chemical equilibrium processes, was first modified to accommodate 3-dimensional problem domains. A bi-conjugate gradient stabilized linear matrix solver was then incorporated to solve the matrix equation. We chose to parallelize the 3-dimensional code on the Intel Paragons at ORNL by using an HPF (high performance FORTRAN) compiler developed at PGI. The data- and task-parallel algorithms available in the HPF compiler proved to be highly efficient for the geochemistry calculation. This calculation can be easily implemented in HPF formats and is perfectly parallel because the chemical speciation on one finite-element node is virtually independent of those on the others. The parallel code was applied to a subwatershed of the Melton Branch at ORNL. Chemical heterogeneity, in addition to physical heterogeneities of the geological formations, has been identified as one of the major factors that affect the fate and transport of contaminants at ORNL. This study demonstrated an application of the 3-dimensional hydrogeochemistry code on the Melton Branch site. A uranium tailing problem that involved in aqueous complexation and precipitation-dissolution was tested. Performance statistics was collected on the Intel Paragons at ORNL. Implications of these results on the further optimization of the code were discussed
Schnack, D. D.; Glasser, A. H.
1996-11-01
NIMROD is a new code system that is being developed for the analysis of modern fusion experiments. It is being designed from the beginning to make the maximum use of massively parallel computer architectures and computer graphics. The NIMROD physics kernel solves the three-dimensional, time-dependent two-fluid equations with neo-classical effects in toroidal geometry of arbitrary poloidal cross section. The NIMROD system also includes a pre-processor, a grid generator, and a post processor. User interaction with NIMROD is facilitated by a modern graphical user interface (GUI). The NIMROD project is using Quality Function Deployment (QFD) team management techniques to minimize re-engineering and reduce code development time. This paper gives an overview of the NIMROD project. Operation of the GUI is demonstrated, and the first results from the physics kernel are given.
A 3D gyrokinetic particle-in-cell simulation of fusion plasma microturbulence on parallel computers
Williams, T. J.
1992-12-01
One of the grand challenge problems now supported by HPCC is the Numerical Tokamak Project. A goal of this project is the study of low-frequency micro-instabilities in tokamak plasmas, which are believed to cause energy loss via turbulent thermal transport across the magnetic field lines. An important tool in this study is gyrokinetic particle-in-cell (PIC) simulation. Gyrokinetic, as opposed to fully-kinetic, methods are particularly well suited to the task because they are optimized to study the frequency and wavelength domain of the microinstabilities. Furthermore, many researchers now employ low-noise delta(f) methods to greatly reduce statistical noise by modelling only the perturbation of the gyrokinetic distribution function from a fixed background, not the entire distribution function. In spite of the increased efficiency of these improved algorithms over conventional PIC algorithms, gyrokinetic PIC simulations of tokamak micro-turbulence are still highly demanding of computer power--even fully-vectorized codes on vector supercomputers. For this reason, we have worked for several years to redevelop these codes on massively parallel computers. We have developed 3D gyrokinetic PIC simulation codes for SIMD and MIMD parallel processors, using control-parallel, data-parallel, and domain-decomposition message-passing (DDMP) programming paradigms. This poster summarizes our earlier work on codes for the Connection Machine and BBN TC2000 and our development of a generic DDMP code for distributed-memory parallel machines. We discuss the memory-access issues which are of key importance in writing parallel PIC codes, with special emphasis on issues peculiar to gyrokinetic PIC. We outline the domain decompositions in our new DDMP code and discuss the interplay of different domain decompositions suited for the particle-pushing and field-solution components of the PIC algorithm.
International Nuclear Information System (INIS)
Satake, Shinsuke; Okamoto, Masao; Nakajima, Noriyoshi; Takamaru, Hisanori
2005-11-01
A neoclassical transport simulation code (FORTEC-3D) applicable to three-dimensional configurations has been developed using High Performance Fortran (HPF). Adoption of computing techniques for parallelization and a hybrid simulation model to the δf Monte-Carlo method transport simulation, including non-local transport effects in three-dimensional configurations, makes it possible to simulate the dynamism of global, non-local transport phenomena with a self-consistent radial electric field within a reasonable computation time. In this paper, development of the transport code using HPF is reported. Optimization techniques in order to achieve both high vectorization and parallelization efficiency, adoption of a parallel random number generator, and also benchmark results, are shown. (author)
Energy Technology Data Exchange (ETDEWEB)
Moryakov, A. V., E-mail: sailor@orc.ru [National Research Centre Kurchatov Institute (Russian Federation)
2016-12-15
An algorithm for solving the time-dependent transport equation in the P{sub m}S{sub n} group approximation with the use of parallel computations is presented. The algorithm is implemented in the LUCKY-TD code for supercomputers employing the MPI standard for the data exchange between parallel processes.
CERN. Geneva
2016-01-01
Large scale scientific computing raises questions on different levels ranging from the fomulation of the problems to the choice of the best algorithms and their implementation for a specific platform. There are similarities in these different topics that can be exploited by modern-style C++ template metaprogramming techniques to produce readable, maintainable and generic code. Traditional low-level code tend to be fast but platform-dependent, and it obfuscates the meaning of the algorithm. On the other hand, object-oriented approach is nice to read, but may come with an inherent performance penalty. These lectures aim to present he basics of the Expression Template (ET) idiom which allows us to keep the object-oriented approach without sacrificing performance. We will in particular show to to enhance ET to include SIMD vectorization. We will then introduce techniques for abstracting iteration, and introduce thread-level parallelism for use in heavy data-centric loads. We will show to to apply these methods i...
Narayanan, Kiran
2012-07-17
A hybrid parallelization method composed of a coarse-grained genetic algorithm (GA) and fine-grained objective function evaluations is implemented on a heterogeneous computational resource consisting of 16 IBM Blue Gene/P racks, a single x86 cluster node and a high-performance file system. The GA iterator is coupled with a finite-element (FE) analysis code developed in house to facilitate computational steering in order to calculate the optimal impact velocities of a projectile colliding with a polyurea/structural steel composite plate. The FE code is capable of capturing adiabatic shear bands and strain localization, which are typically observed in high-velocity impact applications, and it includes several constitutive models of plasticity, viscoelasticity and viscoplasticity for metals and soft materials, which allow simulation of ductile fracture by void growth. A strong scaling study of the FE code was conducted to determine the optimum number of processes run in parallel. The relative efficiency of the hybrid, multi-level parallelization method is studied in order to determine the parameters for the parallelization. Optimal impact velocities of the projectile calculated using the proposed approach, are reported. © The Author(s) 2012.
Vanaverbeke, S.; Keppens, R.; Poedts, S.; Boffin, H.
2009-01-01
We describe the algorithms implemented in the first version of GRADSPH, a parallel, tree-based, smoothed particle hydrodynamics code for simulating self-gravitating astrophysical systems written in FORTRAN 90. The paper presents details on the implementation of the Smoothed Particle Hydro (SPH)
Laurie, Matthew T; Bertout, Jessica A; Taylor, Sean D; Burton, Joshua N; Shendure, Jay A; Bielas, Jason H
2013-08-01
Due to the high cost of failed runs and suboptimal data yields, quantification and determination of fragment size range are crucial steps in the library preparation process for massively parallel sequencing (or next-generation sequencing). Current library quality control methods commonly involve quantification using real-time quantitative PCR and size determination using gel or capillary electrophoresis. These methods are laborious and subject to a number of significant limitations that can make library calibration unreliable. Herein, we propose and test an alternative method for quality control of sequencing libraries using droplet digital PCR (ddPCR). By exploiting a correlation we have discovered between droplet fluorescence and amplicon size, we achieve the joint quantification and size determination of target DNA with a single ddPCR assay. We demonstrate the accuracy and precision of applying this method to the preparation of sequencing libraries.
PENTACLE: Parallelized particle-particle particle-tree code for planet formation
Iwasawa, Masaki; Oshino, Shoichi; Fujii, Michiko S.; Hori, Yasunori
2017-10-01
We have newly developed a parallelized particle-particle particle-tree code for planet formation, PENTACLE, which is a parallelized hybrid N-body integrator executed on a CPU-based (super)computer. PENTACLE uses a fourth-order Hermite algorithm to calculate gravitational interactions between particles within a cut-off radius and a Barnes-Hut tree method for gravity from particles beyond. It also implements an open-source library designed for full automatic parallelization of particle simulations, FDPS (Framework for Developing Particle Simulator), to parallelize a Barnes-Hut tree algorithm for a memory-distributed supercomputer. These allow us to handle 1-10 million particles in a high-resolution N-body simulation on CPU clusters for collisional dynamics, including physical collisions in a planetesimal disc. In this paper, we show the performance and the accuracy of PENTACLE in terms of \\tilde{R}_cut and a time-step Δt. It turns out that the accuracy of a hybrid N-body simulation is controlled through Δ t / \\tilde{R}_cut and Δ t / \\tilde{R}_cut ˜ 0.1 is necessary to simulate accurately the accretion process of a planet for ≥106 yr. For all those interested in large-scale particle simulations, PENTACLE, customized for planet formation, will be freely available from https://github.com/PENTACLE-Team/PENTACLE under the MIT licence.
Parallel Programming with Intel Parallel Studio XE
Blair-Chappell , Stephen
2012-01-01
Optimize code for multi-core processors with Intel's Parallel Studio Parallel programming is rapidly becoming a "must-know" skill for developers. Yet, where to start? This teach-yourself tutorial is an ideal starting point for developers who already know Windows C and C++ and are eager to add parallelism to their code. With a focus on applying tools, techniques, and language extensions to implement parallelism, this essential resource teaches you how to write programs for multicore and leverage the power of multicore in your programs. Sharing hands-on case studies and real-world examples, the
Fox, Geoffrey C; Messina, Guiseppe C
2014-01-01
A clear illustration of how parallel computers can be successfully appliedto large-scale scientific computations. This book demonstrates how avariety of applications in physics, biology, mathematics and other scienceswere implemented on real parallel computers to produce new scientificresults. It investigates issues of fine-grained parallelism relevant forfuture supercomputers with particular emphasis on hypercube architecture. The authors describe how they used an experimental approach to configuredifferent massively parallel machines, design and implement basic systemsoftware, and develop
Grid-based Parallel Data Streaming Implemented for the Gyrokinetic Toroidal Code
International Nuclear Information System (INIS)
Klasky, S.; Ethier, S.; Lin, Z.; Martins, K.; McCune, D.; Samtaney, R.
2003-01-01
We have developed a threaded parallel data streaming approach using Globus to transfer multi-terabyte simulation data from a remote supercomputer to the scientist's home analysis/visualization cluster, as the simulation executes, with negligible overhead. Data transfer experiments show that this concurrent data transfer approach is more favorable compared with writing to local disk and then transferring this data to be post-processed. The present approach is conducive to using the grid to pipeline the simulation with post-processing and visualization. We have applied this method to the Gyrokinetic Toroidal Code (GTC), a 3-dimensional particle-in-cell code used to study microturbulence in magnetic confinement fusion from first principles plasma theory
Statistical method to compare massive parallel sequencing pipelines.
Elsensohn, M H; Leblay, N; Dimassi, S; Campan-Fournier, A; Labalme, A; Roucher-Boulez, F; Sanlaville, D; Lesca, G; Bardel, C; Roy, P
2017-03-01
Today, sequencing is frequently carried out by Massive Parallel Sequencing (MPS) that cuts drastically sequencing time and expenses. Nevertheless, Sanger sequencing remains the main validation method to confirm the presence of variants. The analysis of MPS data involves the development of several bioinformatic tools, academic or commercial. We present here a statistical method to compare MPS pipelines and test it in a comparison between an academic (BWA-GATK) and a commercial pipeline (TMAP-NextGENe®), with and without reference to a gold standard (here, Sanger sequencing), on a panel of 41 genes in 43 epileptic patients. This method used the number of variants to fit log-linear models for pairwise agreements between pipelines. To assess the heterogeneity of the margins and the odds ratios of agreement, four log-linear models were used: a full model, a homogeneous-margin model, a model with single odds ratio for all patients, and a model with single intercept. Then a log-linear mixed model was fitted considering the biological variability as a random effect. Among the 390,339 base-pairs sequenced, TMAP-NextGENe® and BWA-GATK found, on average, 2253.49 and 1857.14 variants (single nucleotide variants and indels), respectively. Against the gold standard, the pipelines had similar sensitivities (63.47% vs. 63.42%) and close but significantly different specificities (99.57% vs. 99.65%; p < 0.001). Same-trend results were obtained when only single nucleotide variants were considered (99.98% specificity and 76.81% sensitivity for both pipelines). The method allows thus pipeline comparison and selection. It is generalizable to all types of MPS data and all pipelines.
Schnek: A C++ library for the development of parallel simulation codes on regular grids
Schmitz, Holger
2018-05-01
A large number of algorithms across the field of computational physics are formulated on grids with a regular topology. We present Schnek, a library that enables fast development of parallel simulations on regular grids. Schnek contains a number of easy-to-use modules that greatly reduce the amount of administrative code for large-scale simulation codes. The library provides an interface for reading simulation setup files with a hierarchical structure. The structure of the setup file is translated into a hierarchy of simulation modules that the developer can specify. The reader parses and evaluates mathematical expressions and initialises variables or grid data. This enables developers to write modular and flexible simulation codes with minimal effort. Regular grids of arbitrary dimension are defined as well as mechanisms for defining physical domain sizes, grid staggering, and ghost cells on these grids. Ghost cells can be exchanged between neighbouring processes using MPI with a simple interface. The grid data can easily be written into HDF5 files using serial or parallel I/O.
Comprehensive microRNA profiling in B-cells of human centenarians by massively parallel sequencing
Directory of Open Access Journals (Sweden)
Gombar Saurabh
2012-07-01
Full Text Available Abstract Background MicroRNAs (miRNAs are small, non-coding RNAs that regulate gene expression and play a critical role in development, homeostasis, and disease. Despite their demonstrated roles in age-associated pathologies, little is known about the role of miRNAs in human aging and longevity. Results We employed massively parallel sequencing technology to identify miRNAs expressed in B-cells from Ashkenazi Jewish centenarians, i.e., those living to a hundred and a human model of exceptional longevity, and younger controls without a family history of longevity. With data from 26.7 million reads comprising 9.4 × 108 bp from 3 centenarian and 3 control individuals, we discovered a total of 276 known miRNAs and 8 unknown miRNAs ranging several orders of magnitude in expression levels, a typical characteristics of saturated miRNA-sequencing. A total of 22 miRNAs were found to be significantly upregulated, with only 2 miRNAs downregulated, in centenarians as compared to controls. Gene Ontology analysis of the predicted and validated targets of the 24 differentially expressed miRNAs indicated enrichment of functional pathways involved in cell metabolism, cell cycle, cell signaling, and cell differentiation. A cross sectional expression analysis of the differentially expressed miRNAs in B-cells from Ashkenazi Jewish individuals between the 50th and 100th years of age indicated that expression levels of miR-363* declined significantly with age. Centenarians, however, maintained the youthful expression level. This result suggests that miR-363* may be a candidate longevity-associated miRNA. Conclusion Our comprehensive miRNA data provide a resource for further studies to identify genetic pathways associated with aging and longevity in humans.
Wideband aperture array using RF channelizers and massively parallel digital 2D IIR filterbank
Sengupta, Arindam; Madanayake, Arjuna; Gómez-García, Roberto; Engeberg, Erik D.
2014-05-01
Wideband receive-mode beamforming applications in wireless location, electronically-scanned antennas for radar, RF sensing, microwave imaging and wireless communications require digital aperture arrays that offer a relatively constant far-field beam over several octaves of bandwidth. Several beamforming schemes including the well-known true time-delay and the phased array beamformers have been realized using either finite impulse response (FIR) or fast Fourier transform (FFT) digital filter-sum based techniques. These beamforming algorithms offer the desired selectivity at the cost of a high computational complexity and frequency-dependant far-field array patterns. A novel approach to receiver beamforming is the use of massively parallel 2-D infinite impulse response (IIR) fan filterbanks for the synthesis of relatively frequency independent RF beams at an order of magnitude lower multiplier complexity compared to FFT or FIR filter based conventional algorithms. The 2-D IIR filterbanks demand fast digital processing that can support several octaves of RF bandwidth, fast analog-to-digital converters (ADCs) for RF-to-bits type direct conversion of wideband antenna element signals. Fast digital implementation platforms that can realize high-precision recursive filter structures necessary for real-time beamforming, at RF radio bandwidths, are also desired. We propose a novel technique that combines a passive RF channelizer, multichannel ADC technology, and single-phase massively parallel 2-D IIR digital fan filterbanks, realized at low complexity using FPGA and/or ASIC technology. There exists native support for a larger bandwidth than the maximum clock frequency of the digital implementation technology. We also strive to achieve More-than-Moore throughput by processing a wideband RF signal having content with N-fold (B = N Fclk/2) bandwidth compared to the maximum clock frequency Fclk Hz of the digital VLSI platform under consideration. Such increase in bandwidth is
ABINIT: a computer code for matter; Abinit: un code au service de la matiere
Energy Technology Data Exchange (ETDEWEB)
Amadon, B.; Bottin, F.; Bouchet, J.; Dewaele, A.; Jollet, F.; Jomard, G.; Loubeyre, P.; Mazevet, S.; Recoules, V.; Torrent, M.; Zerah, G. [CEA Bruyeres-le-Chatel, 91 (France)
2008-07-01
The PAW (Projector Augmented Wave) method has been implemented in the ABINIT Code that computes electronic structures in atoms. This method relies on the simultaneous use of a set of auxiliary functions (in plane waves) and a sphere around each atom. This method allows the computation of systems including many atoms and gives the expression of energy, forces, stress... in terms of the auxiliary function only. We have generated atomic data for iron at very high pressure (over 200 GPa). We get a bcc-hcp transition around 10 GPa and the magnetic order disappears around 50 GPa. This method has been validated on a series of metals. The development of the PAW method has required a great effort for the massive parallelization of the ABINIT code. (A.C.)
Dewaraja, Yuni K; Ljungberg, Michael; Majumdar, Amitava; Bose, Abhijit; Koral, Kenneth F
2002-02-01
This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.
Directory of Open Access Journals (Sweden)
Asker Noomi
2009-07-01
Full Text Available Abstract Background The teleost Zoarces viviparus (eelpout lives along the coasts of Northern Europe and has long been an established model organism for marine ecology and environmental monitoring. The scarce information about this species genome has however restrained the use of efficient molecular-level assays, such as gene expression microarrays. Results In the present study we present the first comprehensive characterization of the Zoarces viviparus liver transcriptome. From 400,000 reads generated by massively parallel pyrosequencing, more than 50,000 pieces of putative transcripts were assembled, annotated and functionally classified. The data was estimated to cover roughly 40% of the total transcriptome and homologues for about half of the genes of Gasterosteus aculeatus (stickleback were identified. The sequence data was consequently used to design an oligonucleotide microarray for large-scale gene expression analysis. Conclusion Our results show that one run using a Genome Sequencer FLX from 454 Life Science/Roche generates enough genomic information for adequate de novo assembly of a large number of genes in a higher vertebrate. The generated sequence data, including the validated microarray probes, are publicly available to promote genome-wide research in Zoarces viviparus.
Pattern-Driven Automatic Parallelization
Directory of Open Access Journals (Sweden)
Christoph W. Kessler
1996-01-01
Full Text Available This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.
Eduardoff, M; Gross, T E; Santos, C; de la Puente, M; Ballard, D; Strobl, C; Børsting, C; Morling, N; Fusco, L; Hussing, C; Egyed, B; Souto, L; Uacyisrael, J; Syndercombe Court, D; Carracedo, Á; Lareu, M V; Schneider, P M; Parson, W; Phillips, C; Parson, W; Phillips, C
2016-07-01
The EUROFORGEN Global ancestry-informative SNP (AIM-SNPs) panel is a forensic multiplex of 128 markers designed to differentiate an individual's ancestry from amongst the five continental population groups of Africa, Europe, East Asia, Native America, and Oceania. A custom multiplex of AmpliSeq™ PCR primers was designed for the Global AIM-SNPs to perform massively parallel sequencing using the Ion PGM™ system. This study assessed individual SNP genotyping precision using the Ion PGM™, the forensic sensitivity of the multiplex using dilution series, degraded DNA plus simple mixtures, and the ancestry differentiation power of the final panel design, which required substitution of three original ancestry-informative SNPs with alternatives. Fourteen populations that had not been previously analyzed were genotyped using the custom multiplex and these studies allowed assessment of genotyping performance by comparison of data across five laboratories. Results indicate a low level of genotyping error can still occur from sequence misalignment caused by homopolymeric tracts close to the target SNP, despite careful scrutiny of candidate SNPs at the design stage. Such sequence misalignment required the exclusion of component SNP rs2080161 from the Global AIM-SNPs panel. However, the overall genotyping precision and sensitivity of this custom multiplex indicates the Ion PGM™ assay for the Global AIM-SNPs is highly suitable for forensic ancestry analysis with massively parallel sequencing. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Learning Joint-Sparse Codes for Calibration-Free Parallel MR Imaging.
Wang, Shanshan; Tan, Sha; Gao, Yuan; Liu, Qiegen; Ying, Leslie; Xiao, Taohui; Liu, Yuanyuan; Liu, Xin; Zheng, Hairong; Liang, Dong
2018-01-01
The integration of compressed sensing and parallel imaging (CS-PI) has shown an increased popularity in recent years to accelerate magnetic resonance (MR) imaging. Among them, calibration-free techniques have presented encouraging performances due to its capability in robustly handling the sensitivity information. Unfortunately, existing calibration-free methods have only explored joint-sparsity with direct analysis transform projections. To further exploit joint-sparsity and improve reconstruction accuracy, this paper proposes to Learn joINt-sparse coDes for caliBration-free parallEl mR imaGing (LINDBERG) by modeling the parallel MR imaging problem as an - - minimization objective with an norm constraining data fidelity, Frobenius norm enforcing sparse representation error and the mixed norm triggering joint sparsity across multichannels. A corresponding algorithm has been developed to alternatively update the sparse representation, sensitivity encoded images and K-space data. Then, the final image is produced as the square root of sum of squares of all channel images. Experimental results on both physical phantom and in vivo data sets show that the proposed method is comparable and even superior to state-of-the-art CS-PI reconstruction approaches. Specifically, LINDBERG has presented strong capability in suppressing noise and artifacts while reconstructing MR images from highly undersampled multichannel measurements.
Parallel coding of conjunctions in visual search.
Found, A
1998-10-01
Two experiments investigated whether the conjunctive nature of nontarget items influenced search for a conjunction target. Each experiment consisted of two conditions. In both conditions, the target item was a red bar tilted to the right, among white tilted bars and vertical red bars. As well as color and orientation, display items also differed in terms of size. Size was irrelevant to search in that the size of the target varied randomly from trial to trial. In one condition, the size of items correlated with the other attributes of display items (e.g., all red items were big and all white items were small). In the other condition, the size of items varied randomly (i.e., some red items were small and some were big, and some white items were big and some were small). Search was more efficient in the size-correlated condition, consistent with the parallel coding of conjunctions in visual search.
Parallel Implementation of the Multi-Dimensional Spectral Code SPECT3D on large 3D grids.
Golovkin, Igor E.; Macfarlane, Joseph J.; Woodruff, Pamela R.; Pereyra, Nicolas A.
2006-10-01
The multi-dimensional collisional-radiative, spectral analysis code SPECT3D can be used to study radiation from complex plasmas. SPECT3D can generate instantaneous and time-gated images and spectra, space-resolved and streaked spectra, which makes it a valuable tool for post-processing hydrodynamics calculations and direct comparison between simulations and experimental data. On large three dimensional grids, transporting radiation along lines of sight (LOS) requires substantial memory and CPU resources. Currently, the parallel option in SPECT3D is based on parallelization over photon frequencies and allows for a nearly linear speed-up for a variety of problems. In addition, we are introducing a new parallel mechanism that will greatly reduce memory requirements. In the new implementation, spatial domain decomposition will be utilized allowing transport along a LOS to be performed only on the mesh cells the LOS crosses. The ability to operate on a fraction of the grid is crucial for post-processing the results of large-scale three-dimensional hydrodynamics simulations. We will present a parallel implementation of the code and provide a scalability study performed on a Linux cluster.
Energy Technology Data Exchange (ETDEWEB)
Azmy, Yousry
2014-06-10
We employ the Integral Transport Matrix Method (ITMM) as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells' fluxes and between the cells' and boundary surfaces' fluxes. The main goals of this work are to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and parallel performance of the developed methods with increasing number of processes, P. The fastest observed parallel solution method, Parallel Gauss-Seidel (PGS), was used in a weak scaling comparison with the PARTISN transport code, which uses the source iteration (SI) scheme parallelized with the Koch-baker-Alcouffe (KBA) method. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method- even without acceleration/preconditioning-is completitive for optically thick problems as P is increased to the tens of thousands range. For the most optically thick cells tested, PGS reduced execution time by an approximate factor of three for problems with more than 130 million computational cells on P = 32,768. Moreover, the SI-DSA execution times's trend rises generally more steeply with increasing P than the PGS trend. Furthermore, the PGS method outperforms SI for the periodic heterogeneous layers (PHL) configuration problems. The PGS method outperforms SI and SI-DSA on as few as P = 16 for PHL problems and reduces execution time by a factor of ten or more for all problems considered with more than 2 million computational cells on P = 4.096.
Gunnels, John; Lee, Jon; Margulies, Susan
2010-01-01
We provide a first demonstration of the idea that matrix-based algorithms for nonlinear combinatorial optimization problems can be efficiently implemented. Such algorithms were mainly conceived by theoretical computer scientists for proving efficiency. We are able to demonstrate the practicality of our approach by developing an implementation on a massively parallel architecture, and exploiting scalable and efficient parallel implementations of algorithms for ultra high-precision linear algebra. Additionally, we have delineated and implemented the necessary algorithmic and coding changes required in order to address problems several orders of magnitude larger, dealing with the limits of scalability from memory footprint, computational efficiency, reliability, and interconnect perspectives. © Springer and Mathematical Programming Society 2010.
Gunnels, John
2010-06-01
We provide a first demonstration of the idea that matrix-based algorithms for nonlinear combinatorial optimization problems can be efficiently implemented. Such algorithms were mainly conceived by theoretical computer scientists for proving efficiency. We are able to demonstrate the practicality of our approach by developing an implementation on a massively parallel architecture, and exploiting scalable and efficient parallel implementations of algorithms for ultra high-precision linear algebra. Additionally, we have delineated and implemented the necessary algorithmic and coding changes required in order to address problems several orders of magnitude larger, dealing with the limits of scalability from memory footprint, computational efficiency, reliability, and interconnect perspectives. © Springer and Mathematical Programming Society 2010.
Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J
2004-09-01
We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.
International Nuclear Information System (INIS)
Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J.
2004-01-01
We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1x10 8 or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8x10 8 histories. For a smaller number of histories (1x10 8 ) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1x10 8 histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy
Directory of Open Access Journals (Sweden)
Chen-Chi Wu
Full Text Available Despite the clinical utility of genetic diagnosis to address idiopathic sensorineural hearing impairment (SNHI, the current strategy for screening mutations via Sanger sequencing suffers from the limitation that only a limited number of DNA fragments associated with common deafness mutations can be genotyped. Consequently, a definitive genetic diagnosis cannot be achieved in many families with discernible family history. To investigate the diagnostic utility of massively parallel sequencing (MPS, we applied the MPS technique to 12 multiplex families with idiopathic SNHI in which common deafness mutations had previously been ruled out. NimbleGen sequence capture array was designed to target all protein coding sequences (CDSs and 100 bp of the flanking sequence of 80 common deafness genes. We performed MPS on the Illumina HiSeq2000, and applied BWA, SAMtools, Picard, GATK, Variant Tools, ANNOVAR, and IGV for bioinformatics analyses. Initial data filtering with allele frequencies (0.95 prioritized 5 indels (insertions/deletions and 36 missense variants in the 12 multiplex families. After further validation by Sanger sequencing, segregation pattern, and evolutionary conservation of amino acid residues, we identified 4 variants in 4 different genes, which might lead to SNHI in 4 families compatible with autosomal dominant inheritance. These included GJB2 p.R75Q, MYO7A p.T381M, KCNQ4 p.S680F, and MYH9 p.E1256K. Among them, KCNQ4 p.S680F and MYH9 p.E1256K were novel. In conclusion, MPS allows genetic diagnosis in multiplex families with idiopathic SNHI by detecting mutations in relatively uncommon deafness genes.
Scalable parallel prefix solvers for discrete ordinates transport
International Nuclear Information System (INIS)
Pautz, S.; Pandya, T.; Adams, M.
2009-01-01
The well-known 'sweep' algorithm for inverting the streaming-plus-collision term in first-order deterministic radiation transport calculations has some desirable numerical properties. However, it suffers from parallel scaling issues caused by a lack of concurrency. The maximum degree of concurrency, and thus the maximum parallelism, grows more slowly than the problem size for sweeps-based solvers. We investigate a new class of parallel algorithms that involves recasting the streaming-plus-collision problem in prefix form and solving via cyclic reduction. This method, although computationally more expensive at low levels of parallelism than the sweep algorithm, offers better theoretical scalability properties. Previous work has demonstrated this approach for one-dimensional calculations; we show how to extend it to multidimensional calculations. Notably, for multiple dimensions it appears that this approach is limited to long-characteristics discretizations; other discretizations cannot be cast in prefix form. We implement two variants of the algorithm within the radlib/SCEPTRE transport code library at Sandia National Laboratories and show results on two different massively parallel systems. Both the 'forward' and 'symmetric' solvers behave similarly, scaling well to larger degrees of parallelism then sweeps-based solvers. We do observe some issues at the highest levels of parallelism (relative to the system size) and discuss possible causes. We conclude that this approach shows good potential for future parallel systems, but the parallel scalability will depend heavily on the architecture of the communication networks of these systems. (authors)
A Parallel Two-fluid Code for Global Magnetic Reconnection Studies
International Nuclear Information System (INIS)
Breslau, J.A.; Jardin, S.C.
2001-01-01
This paper describes a new algorithm for the computation of two-dimensional resistive magnetohydrodynamic (MHD) and two-fluid studies of magnetic reconnection in plasmas. It has been implemented on several parallel platforms and shows good scalability up to 32 CPUs for reasonable problem sizes. A fixed, nonuniform rectangular mesh is used to resolve the different spatial scales in the reconnection problem. The resistive MHD version of the code uses an implicit/explicit hybrid method, while the two-fluid version uses an alternating-direction implicit (ADI) method. The technique has proven useful for comparing several different theories of collisional and collisionless reconnection
International Nuclear Information System (INIS)
Candel, A.; Kabel, A.; Ko, K.; Lee, L.; Li, Z.; Limborg, C.; Ng, C.; Prudencio, E.; Schussman, G.; Uplenchwar, R.
2007-01-01
Over the past years, SLAC's Advanced Computations Department (ACD) has developed the parallel finite element (FE) particle-in-cell code Pic3P (Pic2P) for simulations of beam-cavity interactions dominated by space-charge effects. As opposed to standard space-charge dominated beam transport codes, which are based on the electrostatic approximation, Pic3P (Pic2P) includes space-charge, retardation and boundary effects as it self-consistently solves the complete set of Maxwell-Lorentz equations using higher-order FE methods on conformal meshes. Use of efficient, large-scale parallel processing allows for the modeling of photoinjectors with unprecedented accuracy, aiding the design and operation of the next-generation of accelerator facilities. Applications to the Linac Coherent Light Source (LCLS) RF gun are presented
A massively parallel GPU-accelerated model for analysis of fully nonlinear free surface waves
DEFF Research Database (Denmark)
Engsig-Karup, Allan Peter; Madsen, Morten G.; Glimberg, Stefan Lemvig
2011-01-01
-storage flexible-order accurate finite difference method that is known to be efficient and scalable on a CPU core (single thread). To achieve parallel performance of the relatively complex numerical model, we investigate a new trend in high-performance computing where many-core GPUs are utilized as high......-throughput co-processors to the CPU. We describe and demonstrate how this approach makes it possible to do fast desktop computations for large nonlinear wave problems in numerical wave tanks (NWTs) with close to 50/100 million total grid points in double/ single precision with 4 GB global device memory...... available. A new code base has been developed in C++ and compute unified device architecture C and is found to improve the runtime more than an order in magnitude in double precision arithmetic for the same accuracy over an existing CPU (single thread) Fortran 90 code when executed on a single modern GPU...
Directory of Open Access Journals (Sweden)
Jian Jiao
2017-11-01
Full Text Available In this paper, we propose a rate-compatible (RC parallel concatenated punctured polar (PCPP codes for incremental redundancy hybrid automatic repeat request (IR-HARQ transmission schemes, which can transmit multiple data blocks over a time-varying channel. The PCPP coding scheme can provide RC polar coding blocks in order to adapt to channel variations. First, we investigate an improved random puncturing (IRP pattern for the PCPP coding scheme due to the code-rate and block length limitations of conventional polar codes. The proposed IRP algorithm only select puncturing bits from the frozen bits set and keep the information bits unchanged during puncturing, which can improve 0.2–1 dB decoding performance more than the existing random puncturing (RP algorithm. Then, we develop a RC IR-HARQ transmission scheme based on PCPP codes. By analyzing the overhead of the previous successful decoded PCPP coding block in our IR-HARQ scheme, the optimal initial code-rate can be determined for each new PCPP coding block over time-varying channels. Simulation results show that the average number of transmissions is about 1.8 times for each PCPP coding block in our RC IR-HARQ scheme with a 2-level PCPP encoding construction, which can reduce half of the average number of transmissions than the existing RC polar coding schemes.
International Nuclear Information System (INIS)
1997-01-01
Research on plane-parallel ionization chambers since the IAEA code of practice (TRS-277) was published in 1987 has explained our knowledge on perturbation and other correction factors in ionization chamber, and also constructional details of these chambers have been shown to be important. Different countries have published, or are in the process of publishing, dosimetry recommendations which include specific procedures for the use of plan parallel ionization chambers. An international working group was formed under the auspieces of the IAEA, first to review the status and the actual validity of the code of practice and second to develop an international code of practice of the use of plane parallel ionization chambers in high energy electron and photon beams used in radiotherapy. This document fulfills the second taste. 153 refs, 21 figs, 18 tabs
International Nuclear Information System (INIS)
McGhee, J.M.; Roberts, R.M.; Morel, J.E.
1997-01-01
A spherical harmonics research code (DANTE) has been developed which is compatible with parallel computer architectures. DANTE provides 3-D, multi-material, deterministic, transport capabilities using an arbitrary finite element mesh. The linearized Boltzmann transport equation is solved in a second order self-adjoint form utilizing a Galerkin finite element spatial differencing scheme. The core solver utilizes a preconditioned conjugate gradient algorithm. Other distinguishing features of the code include options for discrete-ordinates and simplified spherical harmonics angular differencing, an exact Marshak boundary treatment for arbitrarily oriented boundary faces, in-line matrix construction techniques to minimize memory consumption, and an effective diffusion based preconditioner for scattering dominated problems. Algorithm efficiency is demonstrated for a massively parallel SIMD architecture (CM-5), and compatibility with MPP multiprocessor platforms or workstation clusters is anticipated
Monte Carlo simulations of quantum systems on massively parallel supercomputers
International Nuclear Information System (INIS)
Ding, H.Q.
1993-01-01
A large class of quantum physics applications uses operator representations that are discrete integers by nature. This class includes magnetic properties of solids, interacting bosons modeling superfluids and Cooper pairs in superconductors, and Hubbard models for strongly correlated electrons systems. This kind of application typically uses integer data representations and the resulting algorithms are dominated entirely by integer operations. The authors implemented an efficient algorithm for one such application on the Intel Touchstone Delta and iPSC/860. The algorithm uses a multispin coding technique which allows significant data compactification and efficient vectorization of Monte Carlo updates. The algorithm regularly switches between two data decompositions, corresponding naturally to different Monte Carlo updating processes and observable measurements such that only nearest-neighbor communications are needed within a given decomposition. On 128 nodes of Intel Delta, this algorithm updates 183 million spins per second (compared to 21 million on CM-2 and 6.2 million on a Cray Y-MP). A systematic performance analysis shows a better than 90% efficiency in the parallel implementation
MC++: A parallel, portable, Monte Carlo neutron transport code in C++
International Nuclear Information System (INIS)
Lee, S.R.; Cummings, J.C.; Nolen, S.D.
1997-01-01
MC++ is an implicit multi-group Monte Carlo neutron transport code written in C++ and based on the Parallel Object-Oriented Methods and Applications (POOMA) class library. MC++ runs in parallel on and is portable to a wide variety of platforms, including MPPs, SMPs, and clusters of UNIX workstations. MC++ is being developed to provide transport capabilities to the Accelerated Strategic Computing Initiative (ASCI). It is also intended to form the basis of the first transport physics framework (TPF), which is a C++ class library containing appropriate abstractions, objects, and methods for the particle transport problem. The transport problem is briefly described, as well as the current status and algorithms in MC++ for solving the transport equation. The alpha version of the POOMA class library is also discussed, along with the implementation of the transport solution algorithms using POOMA. Finally, a simple test problem is defined and performance and physics results from this problem are discussed on a variety of platforms
Compiler Technology for Parallel Scientific Computation
Directory of Open Access Journals (Sweden)
Can Özturan
1994-01-01
Full Text Available There is a need for compiler technology that, given the source program, will generate efficient parallel codes for different architectures with minimal user involvement. Parallel computation is becoming indispensable in solving large-scale problems in science and engineering. Yet, the use of parallel computation is limited by the high costs of developing the needed software. To overcome this difficulty we advocate a comprehensive approach to the development of scalable architecture-independent software for scientific computation based on our experience with equational programming language (EPL. Our approach is based on a program decomposition, parallel code synthesis, and run-time support for parallel scientific computation. The program decomposition is guided by the source program annotations provided by the user. The synthesis of parallel code is based on configurations that describe the overall computation as a set of interacting components. Run-time support is provided by the compiler-generated code that redistributes computation and data during object program execution. The generated parallel code is optimized using techniques of data alignment, operator placement, wavefront determination, and memory optimization. In this article we discuss annotations, configurations, parallel code generation, and run-time support suitable for parallel programs written in the functional parallel programming language EPL and in Fortran.
Massively parallel computing and the search for jets and black holes at the LHC
Energy Technology Data Exchange (ETDEWEB)
Halyo, V., E-mail: vhalyo@gmail.com; LeGresley, P.; Lujan, P.
2014-04-21
Massively parallel computing at the LHC could be the next leap necessary to reach an era of new discoveries at the LHC after the Higgs discovery. Scientific computing is a critical component of the LHC experiment, including operation, trigger, LHC computing GRID, simulation, and analysis. One way to improve the physics reach of the LHC is to take advantage of the flexibility of the trigger system by integrating coprocessors based on Graphics Processing Units (GPUs) or the Many Integrated Core (MIC) architecture into its server farm. This cutting edge technology provides not only the means to accelerate existing algorithms, but also the opportunity to develop new algorithms that select events in the trigger that previously would have evaded detection. In this paper we describe new algorithms that would allow us to select in the trigger new topological signatures that include non-prompt jet and black hole-like objects in the silicon tracker.
Coding for Parallel Links to Maximize the Expected Value of Decodable Messages
Klimesh, Matthew A.; Chang, Christopher S.
2011-01-01
When multiple parallel communication links are available, it is useful to consider link-utilization strategies that provide tradeoffs between reliability and throughput. Interesting cases arise when there are three or more available links. Under the model considered, the links have known probabilities of being in working order, and each link has a known capacity. The sender has a number of messages to send to the receiver. Each message has a size and a value (i.e., a worth or priority). Messages may be divided into pieces arbitrarily, and the value of each piece is proportional to its size. The goal is to choose combinations of messages to send on the links so that the expected value of the messages decodable by the receiver is maximized. There are three parts to the innovation: (1) Applying coding to parallel links under the model; (2) Linear programming formulation for finding the optimal combinations of messages to send on the links; and (3) Algorithms for assisting in finding feasible combinations of messages, as support for the linear programming formulation. There are similarities between this innovation and methods developed in the field of network coding. However, network coding has generally been concerned with either maximizing throughput in a fixed network, or robust communication of a fixed volume of data. In contrast, under this model, the throughput is expected to vary depending on the state of the network. Examples of error-correcting codes that are useful under this model but which are not needed under previous models have been found. This model can represent either a one-shot communication attempt, or a stream of communications. Under the one-shot model, message sizes and link capacities are quantities of information (e.g., measured in bits), while under the communications stream model, message sizes and link capacities are information rates (e.g., measured in bits/second). This work has the potential to increase the value of data returned from
Programming a real code in a functional language (part 1)
Energy Technology Data Exchange (ETDEWEB)
Hendrickson, C.P.
1991-09-10
For some, functional languages hold the promise of allowing ease of programming massively parallel computers that imperative languages such as Fortran and C do not offer. At LLNL, we have initiated a project to write the physics of a major production code in Sisal, a functional language developed at LLNL in collaboration with researchers throughout the world. We are investigating the expressibility of Sisal, as well as its performance on a shared-memory multiprocessor, the Y-MP. An interesting aspect of the project is that Sisal modules can call Fortran modules, and are callable by them. This eliminates the rewriting of 80% of the production code that would not benefit from parallel execution. Preliminary results indicate that the restrictive nature of the language does not cause problems in expressing the algorithms we have chosen. Some interesting aspects of programming in a mixed functional-imperative environment have surfaced, but can be managed. 8 refs.
A task parallel implementation of fast multipole methods
Taura, Kenjiro; Nakashima, Jun; Yokota, Rio; Maruyama, Naoya
2012-01-01
This paper describes a task parallel implementation of ExaFMM, an open source implementation of fast multipole methods (FMM), using a lightweight task parallel library MassiveThreads. Although there have been many attempts on parallelizing FMM
Vectorization of KENO IV code and an estimate of vector-parallel processing
International Nuclear Information System (INIS)
Asai, Kiyoshi; Higuchi, Kenji; Katakura, Jun-ichi; Kurita, Yutaka.
1986-10-01
The multi-group criticality safety code KENO IV has been vectorized and tested on FACOM VP-100 vector processor. At first the vectorized KENO IV on a scalar processor became slower than the original one by a factor of 1.4 because of the overhead introduced by the vectorization. Making modifications of algorithms and techniques for vectorization, the vectorized version has become faster than the original one by a factor of 1.4 and 3.0 on the vector processor for sample problems of complex and simple geometries, respectively. For further speedup of the code, some improvements on compiler and hardware, especially on addition of Monte Carlo pipelines to the vector processor, are discussed. Finally a pipelined parallel processor system is proposed and its performance is estimated. (author)
Rivier, Leonard Gilles
Using an efficient parallel code solving the primitive equations of atmospheric dynamics, the jet structure of a Jupiter like atmosphere is modeled. In the first part of this thesis, a parallel spectral code solving both the shallow water equations and the multi-level primitive equations of atmospheric dynamics is built. The implementation of this code called BOB is done so that it runs effectively on an inexpensive cluster of workstations. A one dimensional decomposition and transposition method insuring load balancing among processes is used. The Legendre transform is cache-blocked. A "compute on the fly" of the Legendre polynomials used in the spectral method produces a lower memory footprint and enables high resolution runs on relatively small memory machines. Performance studies are done using a cluster of workstations located at the National Center for Atmospheric Research (NCAR). BOB performances are compared to the parallel benchmark code PSTSWM and the dynamical core of NCAR's CCM3.6.6. In both cases, the comparison favors BOB. In the second part of this thesis, the primitive equation version of the code described in part I is used to study the formation of organized zonal jets and equatorial superrotation in a planetary atmosphere where the parameters are chosen to best model the upper atmosphere of Jupiter. Two levels are used in the vertical and only large scale forcing is present. The model is forced towards a baroclinically unstable flow, so that eddies are generated by baroclinic instability. We consider several types of forcing, acting on either the temperature or the momentum field. We show that only under very specific parametric conditions, zonally elongated structures form and persist resembling the jet structure observed near the cloud level top (1 bar) on Jupiter. We also study the effect of an equatorial heat source, meant to be a crude representation of the effect of the deep convective planetary interior onto the outer atmospheric layer. We
Marx, Alain; Lütjens, Hinrich
2017-03-01
A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton-Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the two-fluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations.
Directory of Open Access Journals (Sweden)
Mihai-Victor PRICOP
2010-09-01
Full Text Available The present paper introduces a numerical approach of static linear elasticity equations for anisotropic materials. The domain and boundary conditions are simple, to enhance an easy implementation of the finite difference scheme. SOR and gradient are used to solve the resulting linear system. The simplicity of the geometry is also useful for MPI parallelization of the code.
3D streamers simulation in a pin to plane configuration using massively parallel computing
Plewa, J.-M.; Eichwald, O.; Ducasse, O.; Dessante, P.; Jacobs, C.; Renon, N.; Yousfi, M.
2018-03-01
This paper concerns the 3D simulation of corona discharge using high performance computing (HPC) managed with the message passing interface (MPI) library. In the field of finite volume methods applied on non-adaptive mesh grids and in the case of a specific 3D dynamic benchmark test devoted to streamer studies, the great efficiency of the iterative R&B SOR and BiCGSTAB methods versus the direct MUMPS method was clearly demonstrated in solving the Poisson equation using HPC resources. The optimization of the parallelization and the resulting scalability was undertaken as a function of the HPC architecture for a number of mesh cells ranging from 8 to 512 million and a number of cores ranging from 20 to 1600. The R&B SOR method remains at least about four times faster than the BiCGSTAB method and requires significantly less memory for all tested situations. The R&B SOR method was then implemented in a 3D MPI parallelized code that solves the classical first order model of an atmospheric pressure corona discharge in air. The 3D code capabilities were tested by following the development of one, two and four coplanar streamers generated by initial plasma spots for 6 ns. The preliminary results obtained allowed us to follow in detail the formation of the tree structure of a corona discharge and the effects of the mutual interactions between the streamers in terms of streamer velocity, trajectory and diameter. The computing time for 64 million of mesh cells distributed over 1000 cores using the MPI procedures is about 30 min ns-1, regardless of the number of streamers.
Cardenas, Erick; Wu, Wei-Min; Leigh, Mary Beth; Carley, Jack; Carroll, Sue; Gentry, Terry; Luo, Jian; Watson, David; Gu, Baohua; Ginder-Vogel, Matthew; Kitanidis, Peter K.; Jardine, Philip M.; Zhou, Jizhong; Criddle, Craig S.; Marsh, Terence L.
2010-01-01
Massively parallel sequencing has provided a more affordable and high-throughput method to study microbial communities, although it has mostly been used in an exploratory fashion. We combined pyrosequencing with a strict indicator species statistical analysis to test if bacteria specifically responded to ethanol injection that successfully promoted dissimilatory uranium(VI) reduction in the subsurface of a uranium contamination plume at the Oak Ridge Field Research Center in Tennessee. Remedi...
Massively parallel diffuse optical tomography
Energy Technology Data Exchange (ETDEWEB)
Sandusky, John V.; Pitts, Todd A.
2017-09-05
Diffuse optical tomography systems and methods are described herein. In a general embodiment, the diffuse optical tomography system comprises a plurality of sensor heads, the plurality of sensor heads comprising respective optical emitter systems and respective sensor systems. A sensor head in the plurality of sensors heads is caused to act as an illuminator, such that its optical emitter system transmits a transillumination beam towards a portion of a sample. Other sensor heads in the plurality of sensor heads act as observers, detecting portions of the transillumination beam that radiate from the sample in the fields of view of the respective sensory systems of the other sensor heads. Thus, sensor heads in the plurality of sensors heads generate sensor data in parallel.
Recent progress in 3D EM/EM-PIC simulation with ARGUS and parallel ARGUS
International Nuclear Information System (INIS)
Mankofsky, A.; Petillo, J.; Krueger, W.; Mondelli, A.; McNamara, B.; Philp, R.
1994-01-01
ARGUS is an integrated, 3-D, volumetric simulation model for systems involving electric and magnetic fields and charged particles, including materials embedded in the simulation region. The code offers the capability to carry out time domain and frequency domain electromagnetic simulations of complex physical systems. ARGUS offers a boolean solid model structure input capability that can include essentially arbitrary structures on the computational domain, and a modular architecture that allows multiple physics packages to access the same data structure and to share common code utilities. Physics modules are in place to compute electrostatic and electromagnetic fields, the normal modes of RF structures, and self-consistent particle-in-cell (PIC) simulation in either a time dependent mode or a steady state mode. The PIC modules include multiple particle species, the Lorentz equations of motion, and algorithms for the creation of particles by emission from material surfaces, injection onto the grid, and ionization. In this paper, we present an updated overview of ARGUS, with particular emphasis given in recent algorithmic and computational advances. These include a completely rewritten frequency domain solver which efficiently treats lossy materials and periodic structures, a parallel version of ARGUS with support for both shared memory parallel vector (i.e. CRAY) machines and distributed memory massively parallel MIMD systems, and numerous new applications of the code
Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.
2017-12-01
As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
International Nuclear Information System (INIS)
Bieder, U.; Calvin, C.
2003-01-01
This paper reports the preliminary studies carried out with the CFD (computational fluid dynamics) code Trio U to study the natural gas circulation that may flow in the primary circuit of a pressurized water reactor during a high-pressure severe accident scenario. Two types of 3-dimensional simulations have been performed on one loop using a LES (large eddy simulations) approach. In the first type of calculations, the gas flow in the hot leg has been investigated with a simplified representation of the reactor vessel and the Steam Generator (SG) tubes. Structured and unstructured meshing have been tested on the full-scale geometry with and without radiative heat transfer modelling between walls and gas. The second type of calculations deals with the gas circulation in the SG. The first results show a good agreement with the available experimental data and provide some confidence in the Trio U code to simulate complex natural flows. (authors)
Fortran code for SU(3) lattice gauge theory with and without MPI checkerboard parallelization
Berg, Bernd A.; Wu, Hao
2012-10-01
We document plain Fortran and Fortran MPI checkerboard code for Markov chain Monte Carlo simulations of pure SU(3) lattice gauge theory with the Wilson action in D dimensions. The Fortran code uses periodic boundary conditions and is suitable for pedagogical purposes and small scale simulations. For the Fortran MPI code two geometries are covered: the usual torus with periodic boundary conditions and the double-layered torus as defined in the paper. Parallel computing is performed on checkerboards of sublattices, which partition the full lattice in one, two, and so on, up to D directions (depending on the parameters set). For updating, the Cabibbo-Marinari heatbath algorithm is used. We present validations and test runs of the code. Performance is reported for a number of currently used Fortran compilers and, when applicable, MPI versions. For the parallelized code, performance is studied as a function of the number of processors. Program summary Program title: STMC2LSU3MPI Catalogue identifier: AEMJ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMJ_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 26666 No. of bytes in distributed program, including test data, etc.: 233126 Distribution format: tar.gz Programming language: Fortran 77 compatible with the use of Fortran 90/95 compilers, in part with MPI extensions. Computer: Any capable of compiling and executing Fortran 77 or Fortran 90/95, when needed with MPI extensions. Operating system: Red Hat Enterprise Linux Server 6.1 with OpenMPI + pgf77 11.8-0, Centos 5.3 with OpenMPI + gfortran 4.1.2, Cray XT4 with MPICH2 + pgf90 11.2-0. Has the code been vectorised or parallelized?: Yes, parallelized using MPI extensions. Number of processors used: 2 to 11664 RAM: 200 Mega bytes per process. Classification: 11
A Parallel Decoding Algorithm for Short Polar Codes Based on Error Checking and Correcting
Pan, Xiaofei; Pan, Kegang; Ye, Zhan; Gong, Chao
2014-01-01
We propose a parallel decoding algorithm based on error checking and correcting to improve the performance of the short polar codes. In order to enhance the error-correcting capacity of the decoding algorithm, we first derive the error-checking equations generated on the basis of the frozen nodes, and then we introduce the method to check the errors in the input nodes of the decoder by the solutions of these equations. In order to further correct those checked errors, we adopt the method of modifying the probability messages of the error nodes with constant values according to the maximization principle. Due to the existence of multiple solutions of the error-checking equations, we formulate a CRC-aided optimization problem of finding the optimal solution with three different target functions, so as to improve the accuracy of error checking. Besides, in order to increase the throughput of decoding, we use a parallel method based on the decoding tree to calculate probability messages of all the nodes in the decoder. Numerical results show that the proposed decoding algorithm achieves better performance than that of some existing decoding algorithms with the same code length. PMID:25540813
Energy Technology Data Exchange (ETDEWEB)
Fahey, Mark R. [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Candy, Jeff [General Atomics, San Diego, CA (United States)
2013-11-07
This project initiated the development of TGYRO - a steady-state Gyrokinetic transport code (SSGKT) that integrates micro-scale GYRO turbulence simulations into a framework for practical multi-scale simulation of conventional tokamaks as well as future reactors. Using a lightweight master transport code, multiple independent (each massively parallel) gyrokinetic simulations are coordinated. The capability to evolve profiles using the TGLF model was also added to TGYRO and represents a more typical use-case for TGYRO. The goal of the project was to develop a steady-state Gyrokinetic transport code (SSGKT) that integrates micro-scale gyrokinetic turbulence simulations into a framework for practical multi-scale simulation of a burning plasma core ? the International Thermonuclear Experimental Reactor (ITER) in particular. This multi-scale simulation capability will be used to predict the performance (the fusion energy gain, Q) given the H-mode pedestal temperature and density. At present, projections of this type rely on transport models like GLF23, which are based on rather approximate fits to the results of linear and nonlinear simulations. Our goal is to make these performance projections with precise nonlinear gyrokinetic simulations. The method of approach is to use a lightweight master transport code to coordinate multiple independent (each massively parallel) gyrokinetic simulations using the GYRO code. This project targets the practical multi-scale simulation of a reactor core plasma in order to predict the core temperature and density profiles given the H-mode pedestal temperature and density. A master transport code will provide feedback to O(16) independent gyrokinetic simulations (each massively parallel). A successful feedback scheme offers a novel approach to predictive modeling of an important national and international problem. Success in this area of fusion simulations will allow US scientists to direct the research path of ITER over the next two
Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model
Hamam, Alwaleed A.
2017-03-13
Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it\\'s time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.
Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model
Hamam, Alwaleed A.; Khan, Ayaz H.
2017-01-01
Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it's time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.
Wilkinson, Karl A; Hine, Nicholas D M; Skylaris, Chris-Kriton
2014-11-11
We present a hybrid MPI-OpenMP implementation of Linear-Scaling Density Functional Theory within the ONETEP code. We illustrate its performance on a range of high performance computing (HPC) platforms comprising shared-memory nodes with fast interconnect. Our work has focused on applying OpenMP parallelism to the routines which dominate the computational load, attempting where possible to parallelize different loops from those already parallelized within MPI. This includes 3D FFT box operations, sparse matrix algebra operations, calculation of integrals, and Ewald summation. While the underlying numerical methods are unchanged, these developments represent significant changes to the algorithms used within ONETEP to distribute the workload across CPU cores. The new hybrid code exhibits much-improved strong scaling relative to the MPI-only code and permits calculations with a much higher ratio of cores to atoms. These developments result in a significantly shorter time to solution than was possible using MPI alone and facilitate the application of the ONETEP code to systems larger than previously feasible. We illustrate this with benchmark calculations from an amyloid fibril trimer containing 41,907 atoms. We use the code to study the mechanism of delamination of cellulose nanofibrils when undergoing sonification, a process which is controlled by a large number of interactions that collectively determine the structural properties of the fibrils. Many energy evaluations were needed for these simulations, and as these systems comprise up to 21,276 atoms this would not have been feasible without the developments described here.
Reactor Dosimetry Applications Using RAPTOR-M3G:. a New Parallel 3-D Radiation Transport Code
Longoni, Gianluca; Anderson, Stanwood L.
2009-08-01
The numerical solution of the Linearized Boltzmann Equation (LBE) via the Discrete Ordinates method (SN) requires extensive computational resources for large 3-D neutron and gamma transport applications due to the concurrent discretization of the angular, spatial, and energy domains. This paper will discuss the development RAPTOR-M3G (RApid Parallel Transport Of Radiation - Multiple 3D Geometries), a new 3-D parallel radiation transport code, and its application to the calculation of ex-vessel neutron dosimetry responses in the cavity of a commercial 2-loop Pressurized Water Reactor (PWR). RAPTOR-M3G is based domain decomposition algorithms, where the spatial and angular domains are allocated and processed on multi-processor computer architectures. As compared to traditional single-processor applications, this approach reduces the computational load as well as the memory requirement per processor, yielding an efficient solution methodology for large 3-D problems. Measured neutron dosimetry responses in the reactor cavity air gap will be compared to the RAPTOR-M3G predictions. This paper is organized as follows: Section 1 discusses the RAPTOR-M3G methodology; Section 2 describes the 2-loop PWR model and the numerical results obtained. Section 3 addresses the parallel performance of the code, and Section 4 concludes this paper with final remarks and future work.
A parallel 3D particle-in-cell code with dynamic load balancing
International Nuclear Information System (INIS)
Wolfheimer, Felix; Gjonaj, Erion; Weiland, Thomas
2006-01-01
A parallel 3D electrostatic Particle-In-Cell (PIC) code including an algorithm for modelling Space Charge Limited (SCL) emission [E. Gjonaj, T. Weiland, 3D-modeling of space-charge-limited electron emission. A charge conserving algorithm, Proceedings of the 11th Biennial IEEE Conference on Electromagnetic Field Computation, 2004] is presented. A domain decomposition technique based on orthogonal recursive bisection is used to parallelize the computation on a distributed memory environment of clustered workstations. For problems with a highly nonuniform and time dependent distribution of particles, e.g., bunch dynamics, a dynamic load balancing between the processes is needed to preserve the parallel performance. The algorithm for the detection of a load imbalance and the redistribution of the tasks among the processes is based on a weight function criterion, where the weight of a cell measures the computational load associated with it. The algorithm is studied with two examples. In the first example, multiple electron bunches as occurring in the S-DALINAC [A. Richter, Operational experience at the S-DALINAC, Proceedings of the Fifth European Particle Accelerator Conference, 1996] accelerator are simulated in the absence of space charge fields. In the second example, the SCL emission and electron trajectories in an electron gun are simulated
A parallel 3D particle-in-cell code with dynamic load balancing
Energy Technology Data Exchange (ETDEWEB)
Wolfheimer, Felix [Technische Universitaet Darmstadt, Institut fuer Theorie Elektromagnetischer Felder, Schlossgartenstr.8, 64283 Darmstadt (Germany)]. E-mail: wolfheimer@temf.de; Gjonaj, Erion [Technische Universitaet Darmstadt, Institut fuer Theorie Elektromagnetischer Felder, Schlossgartenstr.8, 64283 Darmstadt (Germany); Weiland, Thomas [Technische Universitaet Darmstadt, Institut fuer Theorie Elektromagnetischer Felder, Schlossgartenstr.8, 64283 Darmstadt (Germany)
2006-03-01
A parallel 3D electrostatic Particle-In-Cell (PIC) code including an algorithm for modelling Space Charge Limited (SCL) emission [E. Gjonaj, T. Weiland, 3D-modeling of space-charge-limited electron emission. A charge conserving algorithm, Proceedings of the 11th Biennial IEEE Conference on Electromagnetic Field Computation, 2004] is presented. A domain decomposition technique based on orthogonal recursive bisection is used to parallelize the computation on a distributed memory environment of clustered workstations. For problems with a highly nonuniform and time dependent distribution of particles, e.g., bunch dynamics, a dynamic load balancing between the processes is needed to preserve the parallel performance. The algorithm for the detection of a load imbalance and the redistribution of the tasks among the processes is based on a weight function criterion, where the weight of a cell measures the computational load associated with it. The algorithm is studied with two examples. In the first example, multiple electron bunches as occurring in the S-DALINAC [A. Richter, Operational experience at the S-DALINAC, Proceedings of the Fifth European Particle Accelerator Conference, 1996] accelerator are simulated in the absence of space charge fields. In the second example, the SCL emission and electron trajectories in an electron gun are simulated.
Grech, Mickael; Derouillat, J.; Beck, A.; Chiaramello, M.; Grassi, A.; Niel, F.; Perez, F.; Vinci, T.; Fle, M.; Aunai, N.; Dargent, J.; Plotnikov, I.; Bouchard, G.; Savoini, P.; Riconda, C.
2016-10-01
Over the last decades, Particle-In-Cell (PIC) codes have been central tools for plasma simulations. Today, new trends in High-Performance Computing (HPC) are emerging, dramatically changing HPC-relevant software design and putting some - if not most - legacy codes far beyond the level of performance expected on the new and future massively-parallel super computers. SMILEI is a new open-source PIC code co-developed by both plasma physicists and HPC specialists, and applied to a wide range of physics-related studies: from laser-plasma interaction to astrophysical plasmas. It benefits from an innovative parallelization strategy that relies on a super-domain-decomposition allowing for enhanced cache-use and efficient dynamic load balancing. Beyond these HPC-related developments, SMILEI also benefits from additional physics modules allowing to deal with binary collisions, field and collisional ionization and radiation back-reaction. This poster presents the SMILEI project, its HPC capabilities and illustrates some of the physics problems tackled with SMILEI.
Los Alamos neutral particle transport codes: New and enhanced capabilities
International Nuclear Information System (INIS)
Alcouffe, R.E.; Baker, R.S.; Brinkley, F.W.; Clark, B.A.; Koch, K.R.; Marr, D.R.
1992-01-01
We present new developments in Los Alamos discrete-ordinates transport codes and introduce THREEDANT, the latest in the series of Los Alamos discrete ordinates transport codes. THREEDANT solves the multigroup, neutral-particle transport equation in X-Y-Z and R-Θ-Z geometries. THREEDANT uses computationally efficient algorithms: Diffusion Synthetic Acceleration (DSA) is used to accelerate the convergence of transport iterations, the DSA solution is accelerated using the multigrid technique. THREEDANT runs on a wide range of computers, from scientific workstations to CRAY supercomputers. The algorithms are highly vectorized on CRAY computers. Recently, the THREEDANT transport algorithm was implemented on the massively parallel CM-2 computer, with performance that is comparable to a single-processor CRAY-YMP We present the results of THREEDANT analysis of test problems
von Davier, Matthias
2016-01-01
This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
Massively parallel Monte Carlo. Experiences running nuclear simulations on a large condor cluster
International Nuclear Information System (INIS)
Tickner, James; O'Dwyer, Joel; Roach, Greg; Uher, Josef; Hitchen, Greg
2010-01-01
The trivially-parallel nature of Monte Carlo (MC) simulations make them ideally suited for running on a distributed, heterogeneous computing environment. We report on the setup and operation of a large, cycle-harvesting Condor computer cluster, used to run MC simulations of nuclear instruments ('jobs') on approximately 4,500 desktop PCs. Successful operation must balance the competing goals of maximizing the availability of machines for running jobs whilst minimizing the impact on users' PC performance. This requires classification of jobs according to anticipated run-time and priority and careful optimization of the parameters used to control job allocation to host machines. To maximize use of a large Condor cluster, we have created a powerful suite of tools to handle job submission and analysis, as the manual creation, submission and evaluation of large numbers (hundred to thousands) of jobs would be too arduous. We describe some of the key aspects of this suite, which has been interfaced to the well-known MCNP and EGSnrc nuclear codes and our in-house PHOTON optical MC code. We report on our practical experiences of operating our Condor cluster and present examples of several large-scale instrument design problems that have been solved using this tool. (author)
Farris, M Heath; Scott, Andrew R; Texter, Pamela A; Bartlett, Marta; Coleman, Patricia; Masters, David
2018-04-11
Single nucleotide polymorphisms (SNPs) located within the human genome have been shown to have utility as markers of identity in the differentiation of DNA from individual contributors. Massively parallel DNA sequencing (MPS) technologies and human genome SNP databases allow for the design of suites of identity-linked target regions, amenable to sequencing in a multiplexed and massively parallel manner. Therefore, tools are needed for leveraging the genotypic information found within SNP databases for the discovery of genomic targets that can be evaluated on MPS platforms. The SNP island target identification algorithm (TIA) was developed as a user-tunable system to leverage SNP information within databases. Using data within the 1000 Genomes Project SNP database, human genome regions were identified that contain globally ubiquitous identity-linked SNPs and that were responsive to targeted resequencing on MPS platforms. Algorithmic filters were used to exclude target regions that did not conform to user-tunable SNP island target characteristics. To validate the accuracy of TIA for discovering these identity-linked SNP islands within the human genome, SNP island target regions were amplified from 70 contributor genomic DNA samples using the polymerase chain reaction. Multiplexed amplicons were sequenced using the Illumina MiSeq platform, and the resulting sequences were analyzed for SNP variations. 166 putative identity-linked SNPs were targeted in the identified genomic regions. Of the 309 SNPs that provided discerning power across individual SNP profiles, 74 previously undefined SNPs were identified during evaluation of targets from individual genomes. Overall, DNA samples of 70 individuals were uniquely identified using a subset of the suite of identity-linked SNP islands. TIA offers a tunable genome search tool for the discovery of targeted genomic regions that are scalable in the population frequency and numbers of SNPs contained within the SNP island regions
SEJITS: embedded specializers to turn patterns-based designs into optimized parallel code
CERN. Geneva
2012-01-01
All software should be parallel software. This is natural result of the transition to a many core world. For a small fraction of the world's programmers (efficiency programmers), this is not a problem. They enjoy mapping algorithms onto the details of a particular system and are well served by low level languages and OpenMP, MPI, or OpenCL. Most programmers, however, are "domain specialists" who write code. They are too busy working in their domain of choice (such as physics) to master the intricacies of each computer they use. How do we make these programmers productive without giving up performance? We have been working with a team at UC Berkeley's ParLab to address this problem. The key is a clear software architecture expressed in terms of design patterns that exposes the concurrency in a problem. The resulting code is written using a patterns-based framework within a high level, productivity language (such as Python). Then a separate system is used by a small group o...
Massive-Star Magnetospheres: Now in 3-D!
Townsend, Richard
Magnetic fields are unexpected in massive stars, due to the absence of a dynamo convection zone beneath their surface layers. Nevertheless, kilogauss-strength, ordered fields were detected in a small subset of these stars over three decades ago, and the intervening years have witnessed the steady expansion of this subset. A distinctive feature of magnetic massive stars is that they harbor magnetospheres --- circumstellar environments where the magnetic field interacts strongly with the star's radiation-driven wind, confining it and channelling it into energetic shocks. A wide range of observational signatures are associated with these magnetospheres, in diagnostics ranging from X-rays all the way through to radio emission. Moreover, these magnetospheres can play an important role in massive-star evolution, by amplifying angular momentum loss in the wind. Recent progress in understanding massive-star magnetospheres has largely been driven by magnetohydrodynamical (MHD) simulations. However, these have been restricted to two- dimensional axisymmetric configurations, with three-dimensional configurations possible only in certain special cases. These restrictions are limiting further progress; we therefore propose to develop completely general three-dimensional models for the magnetospheres of massive stars, on the one hand to understand their observational properties and exploit them as plasma-physics laboratories, and on the other to gain a comprehensive understanding of how they influence the evolution of their host star. For weak- and intermediate-field stars, the models will be based on 3-D MHD simulations using a modified version of the ZEUS-MP code. For strong-field stars, we will extend our existing Rigid Field Hydrodynamics (RFHD) code to handle completely arbitrary field topologies. To explore a putative 'photoionization-moderated mass loss' mechanism for massive-star magnetospheres, we will also further develop a photoionization code we have recently
Zerr, Robert Joseph
2011-12-01
The integral transport matrix method (ITMM) has been used as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells and between the cells and boundary surfaces. The main goals of this work were to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and performance of the developed methods for increasing number of processes. This project compares the effectiveness of the ITMM with the SI scheme parallelized with the Koch-Baker-Alcouffe (KBA) method. The primary parallel solution method involves a decomposition of the domain into smaller spatial sub-domains, each with their own transport matrices, and coupled together via interface boundary angular fluxes. Each sub-domain has its own set of ITMM operators and represents an independent transport problem. Multiple iterative parallel solution methods have investigated, including parallel block Jacobi (PBJ), parallel red/black Gauss-Seidel (PGS), and parallel GMRES (PGMRES). The fastest observed parallel solution method, PGS, was used in a weak scaling comparison with the PARTISN code. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method without acceleration/preconditioning is not competitive for any problem parameters considered. The best comparisons occur for problems that are difficult for SI DSA, namely highly scattering and optically thick. SI DSA execution time curves are generally steeper than the PGS ones. However, until further testing is performed it cannot be concluded that SI DSA does not outperform the ITMM with PGS even on several thousand or tens of
International Nuclear Information System (INIS)
Woodruff, S.B.
1992-01-01
The Transient Reactor Analysis Code (TRAC), which features a two- fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, poor load balancing will degrade efficiency on either vector or data parallel architectures when the data are organized according to spatial location. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. This document discusses why developers algorithms, such as a neural net representation, that do not exhibit algorithms, such as a neural net representation, that do not exhibit load-balancing problems
Sakr, Rita A; Schizas, Michail; Carniello, Jose V Scarpa; Ng, Charlotte K Y; Piscuoglio, Salvatore; Giri, Dilip; Andrade, Victor P; De Brot, Marina; Lim, Raymond S; Towers, Russell; Weigelt, Britta; Reis-Filho, Jorge S; King, Tari A
2016-02-01
Lobular carcinoma in situ (LCIS) has been proposed as a non-obligate precursor of invasive lobular carcinoma (ILC). Here we sought to define the repertoire of somatic genetic alterations in pure LCIS and in synchronous LCIS and ILC using targeted massively parallel sequencing. DNA samples extracted from microdissected LCIS, ILC and matched normal breast tissue or peripheral blood from 30 patients were subjected to massively parallel sequencing targeting all exons of 273 genes, including the genes most frequently mutated in breast cancer and DNA repair-related genes. Single nucleotide variants and insertions and deletions were identified using state-of-the-art bioinformatics approaches. The constellation of somatic mutations found in LCIS (n = 34) and ILC (n = 21) were similar, with the most frequently mutated genes being CDH1 (56% and 66%, respectively), PIK3CA (41% and 52%, respectively) and CBFB (12% and 19%, respectively). Among 19 LCIS and ILC synchronous pairs, 14 (74%) had at least one identical mutation in common, including identical PIK3CA and CDH1 mutations. Paired analysis of independent foci of LCIS from 3 breasts revealed at least one common mutation in each of the 3 pairs (CDH1, PIK3CA, CBFB and PKHD1L1). LCIS and ILC have a similar repertoire of somatic mutations, with PIK3CA and CDH1 being the most frequently mutated genes. The presence of identical mutations between LCIS-LCIS and LCIS-ILC pairs demonstrates that LCIS is a clonal neoplastic lesion, and provides additional evidence that at least some LCIS are non-obligate precursors of ILC. Copyright © 2015 Federation of European Biochemical Societies. Published by Elsevier B.V. All rights reserved.
A massive parallel sequencing workflow for diagnostic genetic testing of mismatch repair genes
Hansen, Maren F; Neckmann, Ulrike; Lavik, Liss A S; Vold, Trine; Gilde, Bodil; Toft, Ragnhild K; Sjursen, Wenche
2014-01-01
The purpose of this study was to develop a massive parallel sequencing (MPS) workflow for diagnostic analysis of mismatch repair (MMR) genes using the GS Junior system (Roche). A pathogenic variant in one of four MMR genes, (MLH1, PMS2, MSH6, and MSH2), is the cause of Lynch Syndrome (LS), which mainly predispose to colorectal cancer. We used an amplicon-based sequencing method allowing specific and preferential amplification of the MMR genes including PMS2, of which several pseudogenes exist. The amplicons were pooled at different ratios to obtain coverage uniformity and maximize the throughput of a single-GS Junior run. In total, 60 previously identified and distinct variants (substitutions and indels), were sequenced by MPS and successfully detected. The heterozygote detection range was from 19% to 63% and dependent on sequence context and coverage. We were able to distinguish between false-positive and true-positive calls in homopolymeric regions by cross-sample comparison and evaluation of flow signal distributions. In addition, we filtered variants according to a predefined status, which facilitated variant annotation. Our study shows that implementation of MPS in routine diagnostics of LS can accelerate sample throughput and reduce costs without compromising sensitivity, compared to Sanger sequencing. PMID:24689082
3D, parallel fluid-structure interaction code
CSIR Research Space (South Africa)
Oxtoby, Oliver F
2011-01-01
Full Text Available The authors describe the development of a 3D parallel Fluid–Structure–Interaction (FSI) solver and its application to benchmark problems. Fluid and solid domains are discretised using and edge-based finite-volume scheme for efficient parallel...
Computer-Aided Parallelizer and Optimizer
Jin, Haoqiang
2011-01-01
The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.
NEW APPROACHES TO EFFICIENCY OF MASSIVE ONLINE COURSE
Directory of Open Access Journals (Sweden)
Liubov S. Lysitsina
2014-09-01
Full Text Available This paper is focused on efficiency of e-learning, in general, and massive online course in programming and information technology, in particular. Several innovative approaches and scenarios have been proposed, developed, implemented and verified by the authors, including 1 a new approach to organize and use automatic immediate feedback that significantly helps a learner to verify developed code and increases an efficiency of learning, 2 a new approach to construct learning interfaces – it is based on “develop a code – get a result – validate a code” technique, 3 three scenarios of visualization and verification of developed code, 4 a new multi-stage approach to solve complex programming assignments, 5 a new implementation of “perfectionism” game mechanics in a massive online course. Overall, due to implementation of proposed and developed approaches, the efficiency of massive online course has been considerably increased, particularly 1 the additional 27.9 % of students were able to complete successfully “Web design and development using HTML5 and CSS3” massive online course at ITMO University, and 2 based on feedback from 5588 students a “perfectionism” game mechanics noticeably improves students’ involvement into course activities and retention factor.
Automatic Loop Parallelization via Compiler Guided Refactoring
DEFF Research Database (Denmark)
Larsen, Per; Ladelsky, Razya; Lidman, Jacob
For many parallel applications, performance relies not on instruction-level parallelism, but on loop-level parallelism. Unfortunately, many modern applications are written in ways that obstruct automatic loop parallelization. Since we cannot identify sufficient parallelization opportunities...... for these codes in a static, off-line compiler, we developed an interactive compilation feedback system that guides the programmer in iteratively modifying application source, thereby improving the compiler’s ability to generate loop-parallel code. We use this compilation system to modify two sequential...... benchmarks, finding that the code parallelized in this way runs up to 8.3 times faster on an octo-core Intel Xeon 5570 system and up to 12.5 times faster on a quad-core IBM POWER6 system. Benchmark performance varies significantly between the systems. This suggests that semi-automatic parallelization should...
Russkova, Tatiana V.
2017-11-01
One tool to improve the performance of Monte Carlo methods for numerical simulation of light transport in the Earth's atmosphere is the parallel technology. A new algorithm oriented to parallel execution on the CUDA-enabled NVIDIA graphics processor is discussed. The efficiency of parallelization is analyzed on the basis of calculating the upward and downward fluxes of solar radiation in both a vertically homogeneous and inhomogeneous models of the atmosphere. The results of testing the new code under various atmospheric conditions including continuous singlelayered and multilayered clouds, and selective molecular absorption are presented. The results of testing the code using video cards with different compute capability are analyzed. It is shown that the changeover of computing from conventional PCs to the architecture of graphics processors gives more than a hundredfold increase in performance and fully reveals the capabilities of the technology used.
PHoToNs–A parallel heterogeneous and threads oriented code for cosmological N-body simulation
Wang, Qiao; Cao, Zong-Yan; Gao, Liang; Chi, Xue-Bin; Meng, Chen; Wang, Jie; Wang, Long
2018-06-01
We introduce a new code for cosmological simulations, PHoToNs, which incorporates features for performing massive cosmological simulations on heterogeneous high performance computer (HPC) systems and threads oriented programming. PHoToNs adopts a hybrid scheme to compute gravitational force, with the conventional Particle-Mesh (PM) algorithm to compute the long-range force, the Tree algorithm to compute the short range force and the direct summation Particle-Particle (PP) algorithm to compute gravity from very close particles. A self-similar space filling a Peano-Hilbert curve is used to decompose the computing domain. Threads programming is advantageously used to more flexibly manage the domain communication, PM calculation and synchronization, as well as Dual Tree Traversal on the CPU+MIC platform. PHoToNs scales well and efficiency of the PP kernel achieves 68.6% of peak performance on MIC and 74.4% on CPU platforms. We also test the accuracy of the code against the much used Gadget-2 in the community and found excellent agreement.
A Highly Parallel and Scalable Motion Estimation Algorithm with GPU for HEVC
Directory of Open Access Journals (Sweden)
Yun-gang Xue
2017-01-01
Full Text Available We propose a highly parallel and scalable motion estimation algorithm, named multilevel resolution motion estimation (MLRME for short, by combining the advantages of local full search and downsampling. By subsampling a video frame, a large amount of computation is saved. While using the local full-search method, it can exploit massive parallelism and make full use of the powerful modern many-core accelerators, such as GPU and Intel Xeon Phi. We implanted the proposed MLRME into HM12.0, and the experimental results showed that the encoding quality of the MLRME method is close to that of the fast motion estimation in HEVC, which declines by less than 1.5%. We also implemented the MLRME with CUDA, which obtained 30–60x speed-up compared to the serial algorithm on single CPU. Specifically, the parallel implementation of MLRME on a GTX 460 GPU can meet the real-time coding requirement with about 25 fps for the 2560×1600 video format, while, for 832×480, the performance is more than 100 fps.
Optimisation of a parallel ocean general circulation model
M. I. Beare; D. P. Stevens
1997-01-01
International audience; This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by...
International Nuclear Information System (INIS)
Heggarty, J.W.
1999-06-01
For almost thirty years, sequential R-matrix computation has been used by atomic physics research groups, from around the world, to model collision phenomena involving the scattering of electrons or positrons with atomic or molecular targets. As considerable progress has been made in the understanding of fundamental scattering processes, new data, obtained from more complex calculations, is of current interest to experimentalists. Performing such calculations, however, places considerable demands on the computational resources to be provided by the target machine, in terms of both processor speed and memory requirement. Indeed, in some instances the computational requirements are so great that the proposed R-matrix calculations are intractable, even when utilising contemporary classic supercomputers. Historically, increases in the computational requirements of R-matrix computation were accommodated by porting the problem codes to a more powerful classic supercomputer. Although this approach has been successful in the past, it is no longer considered to be a satisfactory solution due to the limitations of current (and future) Von Neumann machines. As a consequence, there has been considerable interest in the high performance multicomputers, that have emerged over the last decade which appear to offer the computational resources required by contemporary R-matrix research. Unfortunately, developing codes for these machines is not as simple a task as it was to develop codes for successive classic supercomputers. The difficulty arises from the considerable differences in the computing models that exist between the two types of machine and results in the programming of multicomputers to be widely acknowledged as a difficult, time consuming and error-prone task. Nevertheless, unless parallel R-matrix computation is realised, important theoretical and experimental atomic physics research will continue to be hindered. This thesis describes work that was undertaken in
Genotypic tropism testing by massively parallel sequencing: qualitative and quantitative analysis
Directory of Open Access Journals (Sweden)
Thiele Bernhard
2011-05-01
Full Text Available Abstract Background Inferring viral tropism from genotype is a fast and inexpensive alternative to phenotypic testing. While being highly predictive when performed on clonal samples, sensitivity of predicting CXCR4-using (X4 variants drops substantially in clinical isolates. This is mainly attributed to minor variants not detected by standard bulk-sequencing. Massively parallel sequencing (MPS detects single clones thereby being much more sensitive. Using this technology we wanted to improve genotypic prediction of coreceptor usage. Methods Plasma samples from 55 antiretroviral-treated patients tested for coreceptor usage with the Monogram Trofile Assay were sequenced with standard population-based approaches. Fourteen of these samples were selected for further analysis with MPS. Tropism was predicted from each sequence with geno2pheno[coreceptor]. Results Prediction based on bulk-sequencing yielded 59.1% sensitivity and 90.9% specificity compared to the trofile assay. With MPS, 7600 reads were generated on average per isolate. Minorities of sequences with high confidence in CXCR4-usage were found in all samples, irrespective of phenotype. When using the default false-positive-rate of geno2pheno[coreceptor] (10%, and defining a minority cutoff of 5%, the results were concordant in all but one isolate. Conclusions The combination of MPS and coreceptor usage prediction results in a fast and accurate alternative to phenotypic assays. The detection of X4-viruses in all isolates suggests that coreceptor usage as well as fitness of minorities is important for therapy outcome. The high sensitivity of this technology in combination with a quantitative description of the viral population may allow implementing meaningful cutoffs for predicting response to CCR5-antagonists in the presence of X4-minorities.
Genotypic tropism testing by massively parallel sequencing: qualitative and quantitative analysis.
Däumer, Martin; Kaiser, Rolf; Klein, Rolf; Lengauer, Thomas; Thiele, Bernhard; Thielen, Alexander
2011-05-13
Inferring viral tropism from genotype is a fast and inexpensive alternative to phenotypic testing. While being highly predictive when performed on clonal samples, sensitivity of predicting CXCR4-using (X4) variants drops substantially in clinical isolates. This is mainly attributed to minor variants not detected by standard bulk-sequencing. Massively parallel sequencing (MPS) detects single clones thereby being much more sensitive. Using this technology we wanted to improve genotypic prediction of coreceptor usage. Plasma samples from 55 antiretroviral-treated patients tested for coreceptor usage with the Monogram Trofile Assay were sequenced with standard population-based approaches. Fourteen of these samples were selected for further analysis with MPS. Tropism was predicted from each sequence with geno2pheno[coreceptor]. Prediction based on bulk-sequencing yielded 59.1% sensitivity and 90.9% specificity compared to the trofile assay. With MPS, 7600 reads were generated on average per isolate. Minorities of sequences with high confidence in CXCR4-usage were found in all samples, irrespective of phenotype. When using the default false-positive-rate of geno2pheno[coreceptor] (10%), and defining a minority cutoff of 5%, the results were concordant in all but one isolate. The combination of MPS and coreceptor usage prediction results in a fast and accurate alternative to phenotypic assays. The detection of X4-viruses in all isolates suggests that coreceptor usage as well as fitness of minorities is important for therapy outcome. The high sensitivity of this technology in combination with a quantitative description of the viral population may allow implementing meaningful cutoffs for predicting response to CCR5-antagonists in the presence of X4-minorities.
Implementation and performance of parallelized elegant
International Nuclear Information System (INIS)
Wang, Y.; Borland, M.
2008-01-01
The program elegant is widely used for design and modeling of linacs for free-electron lasers and energy recovery linacs, as well as storage rings and other applications. As part of a multi-year effort, we have parallelized many aspects of the code, including single-particle dynamics, wakefields, and coherent synchrotron radiation. We report on the approach used for gradual parallelization, which proved very beneficial in getting parallel features into the hands of users quickly. We also report details of parallelization of collective effects. Finally, we discuss performance of the parallelized code in various applications.
Implementation of GPU parallel equilibrium reconstruction for plasma control in EAST
Energy Technology Data Exchange (ETDEWEB)
Huang, Yao, E-mail: yaohuang@ipp.ac.cn [Institute of Plasma Physics, Chinese Academy of Sciences, Hefei (China); Xiao, B.J. [Institute of Plasma Physics, Chinese Academy of Sciences, Hefei (China); School of Nuclear Science & Technology, University of Science & Technology of China (China); Luo, Z.P.; Yuan, Q.P.; Pei, X.F. [Institute of Plasma Physics, Chinese Academy of Sciences, Hefei (China); Yue, X.N. [School of Nuclear Science & Technology, University of Science & Technology of China (China)
2016-11-15
Highlights: • We described parallel equilibrium reconstruction code P-EFIT running on GPU was integrated with EAST plasma control system. • Compared with RT-EFIT used in EAST, P-EFIT has better spatial resolution and full algorithm of EFIT per iteration. • With the data interface through RFM, 65 × 65 spatial grids P-EFIT can satisfy the accuracy and time feasibility requirements for plasma control. • Successful control using ISOFLUX/P-EFIT was established in the dedicated experiment during the EAST 2014 campaign. • This work is a stepping-stone towards versatile ISOFLUX/P-EFIT control, such as real-time equilibrium reconstruction with more diagnostics. - Abstract: Implementation of P-EFIT code for plasma control in EAST is described. P-EFIT is based on the EFIT framework, but built with the CUDA™ architecture to take advantage of massively parallel Graphical Processing Unit (GPU) cores to significantly accelerate the computation. 65 × 65 grid size P-EFIT can complete one reconstruction iteration in 300 μs, with one iteration strategy, it can satisfy the needs of real-time plasma shape control. Data interface between P-EFIT and PCS is realized and developed by transferring data through RFM. First application of P-EFIT to discharge control in EAST is described.
COSMOS: Python library for massively parallel workflows.
Gafni, Erik; Luquette, Lovelace J; Lancaster, Alex K; Hawkins, Jared B; Jung, Jae-Yoon; Souilmi, Yassine; Wall, Dennis P; Tonellato, Peter J
2014-10-15
Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services. Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at http://lpm.hms.harvard.edu and http://wall-lab.stanford.edu. dpwall@stanford.edu or peter_tonellato@hms.harvard.edu. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
Directory of Open Access Journals (Sweden)
Ashkan Tousimojarad
2013-12-01
Full Text Available We present the Glasgow Parallel Reduction Machine (GPRM, a novel, flexible framework for parallel task-composition based many-core programming. We allow the programmer to structure programs into task code, written as C++ classes, and communication code, written in a restricted subset of C++ with functional semantics and parallel evaluation. In this paper we discuss the GPRM, the virtual machine framework that enables the parallel task composition approach. We focus the discussion on GPIR, the functional language used as the intermediate representation of the bytecode running on the GPRM. Using examples in this language we show the flexibility and power of our task composition framework. We demonstrate the potential using an implementation of a merge sort algorithm on a 64-core Tilera processor, as well as on a conventional Intel quad-core processor and an AMD 48-core processor system. We also compare our framework with OpenMP tasks in a parallel pointer chasing algorithm running on the Tilera processor. Our results show that the GPRM programs outperform the corresponding OpenMP codes on all test platforms, and can greatly facilitate writing of parallel programs, in particular non-data parallel algorithms such as reductions.
Parallel computing of a climate model on the dawn 1000 by domain decomposition method
Bi, Xunqiang
1997-12-01
In this paper the parallel computing of a grid-point nine-level atmospheric general circulation model on the Dawn 1000 is introduced. The model was developed by the Institute of Atmospheric Physics (IAP), Chinese Academy of Sciences (CAS). The Dawn 1000 is a MIMD massive parallel computer made by National Research Center for Intelligent Computer (NCIC), CAS. A two-dimensional domain decomposition method is adopted to perform the parallel computing. The potential ways to increase the speed-up ratio and exploit more resources of future massively parallel supercomputation are also discussed.
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations
Energy Technology Data Exchange (ETDEWEB)
Detrixhe, Miles, E-mail: mdetrixhe@engineering.ucsb.edu [Department of Mechanical Engineering (United States); University of California Santa Barbara, Santa Barbara, CA, 93106 (United States); Gibou, Frédéric, E-mail: fgibou@engineering.ucsb.edu [Department of Mechanical Engineering (United States); University of California Santa Barbara, Santa Barbara, CA, 93106 (United States); Department of Computer Science (United States); Department of Mathematics (United States)
2016-10-01
The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations
International Nuclear Information System (INIS)
Detrixhe, Miles; Gibou, Frédéric
2016-01-01
The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
PIXIE3D: An efficient, fully implicit, parallel, 3D extended MHD code for fusion plasma modeling
International Nuclear Information System (INIS)
Chacon, L.
2007-01-01
PIXIE3D is a modern, parallel, state-of-the-art extended MHD code that employs fully implicit methods for efficiency and accuracy. It features a general geometry formulation, and is therefore suitable for the study of many magnetic fusion configurations of interest. PIXIE3D advances the state of the art in extended MHD modeling in two fundamental ways. Firstly, it employs a novel conservative finite volume scheme which is remarkably robust and stable, and demands very small physical and/or numerical dissipation. This is a fundamental requirement when one wants to study fusion plasmas with realistic conductivities. Secondly, PIXIE3D features fully-implicit time stepping, employing Newton-Krylov methods for inverting the associated nonlinear systems. These methods have been shown to be scalable and efficient when preconditioned properly. Novel preconditioned ideas (so-called physics based), which were prototypes in the context of reduced MHD, have been adapted for 3D primitive-variable resistive MHD in PIXIE3D, and are currently being extended to Hall MHD. PIXIE3D is fully parallel, employing PETSc for parallelism. PIXIE3D has been thoroughly benchmarked against linear theory and against other available extended MHD codes on nonlinear test problems (such as the GEM reconnection challenge). We are currently in the process of extending such comparisons to fusion-relevant problems in realistic geometries. In this talk, we will describe both the spatial discretization approach and the preconditioning strategy employed for extended MHD in PIXIE3D. We will report on recent benchmarking studies between PIXIE3D and other 3D extended MHD codes, and will demonstrate its usefulness in a variety of fusion-relevant configurations such as Tokamaks and Reversed Field Pinches. (Author)
Maeda, Takuto; Takemura, Shunsuke; Furumura, Takashi
2017-07-01
We have developed an open-source software package, Open-source Seismic Wave Propagation Code (OpenSWPC), for parallel numerical simulations of seismic wave propagation in 3D and 2D (P-SV and SH) viscoelastic media based on the finite difference method in local-to-regional scales. This code is equipped with a frequency-independent attenuation model based on the generalized Zener body and an efficient perfectly matched layer for absorbing boundary condition. A hybrid-style programming using OpenMP and the Message Passing Interface (MPI) is adopted for efficient parallel computation. OpenSWPC has wide applicability for seismological studies and great portability to allowing excellent performance from PC clusters to supercomputers. Without modifying the code, users can conduct seismic wave propagation simulations using their own velocity structure models and the necessary source representations by specifying them in an input parameter file. The code has various modes for different types of velocity structure model input and different source representations such as single force, moment tensor and plane-wave incidence, which can easily be selected via the input parameters. Widely used binary data formats, the Network Common Data Form (NetCDF) and the Seismic Analysis Code (SAC) are adopted for the input of the heterogeneous structure model and the outputs of the simulation results, so users can easily handle the input/output datasets. All codes are written in Fortran 2003 and are available with detailed documents in a public repository.[Figure not available: see fulltext.
International Nuclear Information System (INIS)
Jejcic, A.; Maillard, J.; Maurel, G.; Silva, J.; Wolff-Bacha, F.
1997-01-01
The work in the field of parallel processing has developed as research activities using several numerical Monte Carlo simulations related to basic or applied current problems of nuclear and particle physics. For the applications utilizing the GEANT code development or improvement works were done on parts simulating low energy physical phenomena like radiation, transport and interaction. The problem of actinide burning by means of accelerators was approached using a simulation with the GEANT code. A program of neutron tracking in the range of low energies up to the thermal region has been developed. It is coupled to the GEANT code and permits in a single pass the simulation of a hybrid reactor core receiving a proton burst. Other works in this field refers to simulations for nuclear medicine applications like, for instance, development of biological probes, evaluation and characterization of the gamma cameras (collimators, crystal thickness) as well as the method for dosimetric calculations. Particularly, these calculations are suited for a geometrical parallelization approach especially adapted to parallel machines of the TN310 type. Other works mentioned in the same field refer to simulation of the electron channelling in crystals and simulation of the beam-beam interaction effect in colliders. The GEANT code was also used to simulate the operation of germanium detectors designed for natural and artificial radioactivity monitoring of environment
Nishizawa, Hiroaki; Nishimura, Yoshifumi; Kobayashi, Masato; Irle, Stephan; Nakai, Hiromi
2016-08-05
The linear-scaling divide-and-conquer (DC) quantum chemical methodology is applied to the density-functional tight-binding (DFTB) theory to develop a massively parallel program that achieves on-the-fly molecular reaction dynamics simulations of huge systems from scratch. The functions to perform large scale geometry optimization and molecular dynamics with DC-DFTB potential energy surface are implemented to the program called DC-DFTB-K. A novel interpolation-based algorithm is developed for parallelizing the determination of the Fermi level in the DC method. The performance of the DC-DFTB-K program is assessed using a laboratory computer and the K computer. Numerical tests show the high efficiency of the DC-DFTB-K program, a single-point energy gradient calculation of a one-million-atom system is completed within 60 s using 7290 nodes of the K computer. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Frequency of Usher syndrome type 1 in deaf children by massively parallel DNA sequencing.
Yoshimura, Hidekane; Miyagawa, Maiko; Kumakawa, Kozo; Nishio, Shin-Ya; Usami, Shin-Ichi
2016-05-01
Usher syndrome type 1 (USH1) is the most severe of the three USH subtypes due to its profound hearing loss, absent vestibular response and retinitis pigmentosa appearing at a prepubescent age. Six causative genes have been identified for USH1, making early diagnosis and therapy possible through DNA testing. Targeted exon sequencing of selected genes using massively parallel DNA sequencing (MPS) technology enables clinicians to systematically tackle previously intractable monogenic disorders and improve molecular diagnosis. Using MPS along with direct sequence analysis, we screened 227 unrelated non-syndromic deaf children and detected recessive mutations in USH1 causative genes in five patients (2.2%): three patients harbored MYO7A mutations and one each carried CDH23 or PCDH15 mutations. As indicated by an earlier genotype-phenotype correlation study of the CDH23 and PCDH15 genes, we considered the latter two patients to have USH1. Based on clinical findings, it was also highly likely that one patient with MYO7A mutations possessed USH1 due to a late onset age of walking. This first report describing the frequency (1.3-2.2%) of USH1 among non-syndromic deaf children highlights the importance of comprehensive genetic testing for early disease diagnosis.
MassiveNuS: cosmological massive neutrino simulations
Liu, Jia; Bird, Simeon; Zorrilla Matilla, José Manuel; Hill, J. Colin; Haiman, Zoltán; Madhavacheril, Mathew S.; Petri, Andrea; Spergel, David N.
2018-03-01
The non-zero mass of neutrinos suppresses the growth of cosmic structure on small scales. Since the level of suppression depends on the sum of the masses of the three active neutrino species, the evolution of large-scale structure is a promising tool to constrain the total mass of neutrinos and possibly shed light on the mass hierarchy. In this work, we investigate these effects via a large suite of N-body simulations that include massive neutrinos using an analytic linear-response approximation: the Cosmological Massive Neutrino Simulations (MassiveNuS). The simulations include the effects of radiation on the background expansion, as well as the clustering of neutrinos in response to the nonlinear dark matter evolution. We allow three cosmological parameters to vary: the neutrino mass sum Mν in the range of 0–0.6 eV, the total matter density Ωm, and the primordial power spectrum amplitude As. The rms density fluctuation in spheres of 8 comoving Mpc/h (σ8) is a derived parameter as a result. Our data products include N-body snapshots, halo catalogues, merger trees, ray-traced galaxy lensing convergence maps for four source redshift planes between zs=1–2.5, and ray-traced cosmic microwave background lensing convergence maps. We describe the simulation procedures and code validation in this paper. The data are publicly available at http://columbialensing.org.
Hu, Peng; Fabyanic, Emily; Kwon, Deborah Y; Tang, Sheng; Zhou, Zhaolan; Wu, Hao
2017-12-07
Massively parallel single-cell RNA sequencing can precisely resolve cellular diversity in a high-throughput manner at low cost, but unbiased isolation of intact single cells from complex tissues such as adult mammalian brains is challenging. Here, we integrate sucrose-gradient-assisted purification of nuclei with droplet microfluidics to develop a highly scalable single-nucleus RNA-seq approach (sNucDrop-seq), which is free of enzymatic dissociation and nucleus sorting. By profiling ∼18,000 nuclei isolated from cortical tissues of adult mice, we demonstrate that sNucDrop-seq not only accurately reveals neuronal and non-neuronal subtype composition with high sensitivity but also enables in-depth analysis of transient transcriptional states driven by neuronal activity, at single-cell resolution, in vivo. Copyright © 2017 Elsevier Inc. All rights reserved.
Parallel pic plasma simulation through particle decomposition techniques
International Nuclear Information System (INIS)
Briguglio, S.; Vlad, G.; Di Martino, B.; Naples, Univ. 'Federico II'
1998-02-01
Particle-in-cell (PIC) codes are among the major candidates to yield a satisfactory description of the detail of kinetic effects, such as the resonant wave-particle interaction, relevant in determining the transport mechanism in magnetically confined plasmas. A significant improvement of the simulation performance of such codes con be expected from parallelization, e.g., by distributing the particle population among several parallel processors. Parallelization of a hybrid magnetohydrodynamic-gyrokinetic code has been accomplished within the High Performance Fortran (HPF) framework, and tested on the IBM SP2 parallel system, using a 'particle decomposition' technique. The adopted technique requires a moderate effort in porting the code in parallel form and results in intrinsic load balancing and modest inter processor communication. The performance tests obtained confirm the hypothesis of high effectiveness of the strategy, if targeted towards moderately parallel architectures. Optimal use of resources is also discussed with reference to a specific physics problem [it
Implementation of collisions on GPU architecture in the Vorpal code
Leddy, Jarrod; Averkin, Sergey; Cowan, Ben; Sides, Scott; Werner, Greg; Cary, John
2017-10-01
The Vorpal code contains a variety of collision operators allowing for the simulation of plasmas containing multiple charge species interacting with neutrals, background gas, and EM fields. These existing algorithms have been improved and reimplemented to take advantage of the massive parallelization allowed by GPU architecture. The use of GPUs is most effective when algorithms are single-instruction multiple-data, so particle collisions are an ideal candidate for this parallelization technique due to their nature as a series of independent processes with the same underlying operation. This refactoring required data memory reorganization and careful consideration of device/host data allocation to minimize memory access and data communication per operation. Successful implementation has resulted in an order of magnitude increase in simulation speed for a test-case involving multiple binary collisions using the null collision method. Work supported by DARPA under contract W31P4Q-16-C-0009.
Open-Source Development of the Petascale Reactive Flow and Transport Code PFLOTRAN
Hammond, G. E.; Andre, B.; Bisht, G.; Johnson, T.; Karra, S.; Lichtner, P. C.; Mills, R. T.
2013-12-01
Open-source software development has become increasingly popular in recent years. Open-source encourages collaborative and transparent software development and promotes unlimited free redistribution of source code to the public. Open-source development is good for science as it reveals implementation details that are critical to scientific reproducibility, but generally excluded from journal publications. In addition, research funds that would have been spent on licensing fees can be redirected to code development that benefits more scientists. In 2006, the developers of PFLOTRAN open-sourced their code under the U.S. Department of Energy SciDAC-II program. Since that time, the code has gained popularity among code developers and users from around the world seeking to employ PFLOTRAN to simulate thermal, hydraulic, mechanical and biogeochemical processes in the Earth's surface/subsurface environment. PFLOTRAN is a massively-parallel subsurface reactive multiphase flow and transport simulator designed from the ground up to run efficiently on computing platforms ranging from the laptop to leadership-class supercomputers, all from a single code base. The code employs domain decomposition for parallelism and is founded upon the well-established and open-source parallel PETSc and HDF5 frameworks. PFLOTRAN leverages modern Fortran (i.e. Fortran 2003-2008) in its extensible object-oriented design. The use of this progressive, yet domain-friendly programming language has greatly facilitated collaboration in the code's software development. Over the past year, PFLOTRAN's top-level data structures were refactored as Fortran classes (i.e. extendible derived types) to improve the flexibility of the code, ease the addition of new process models, and enable coupling to external simulators. For instance, PFLOTRAN has been coupled to the parallel electrical resistivity tomography code E4D to enable hydrogeophysical inversion while the same code base can be used as a third
Parallel Tensor Compression for Large-Scale Scientific Data.
Energy Technology Data Exchange (ETDEWEB)
Kolda, Tamara G. [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Ballard, Grey [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Austin, Woody Nathan [Univ. of Texas, Austin, TX (United States)
2015-10-01
As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8 TB of data. By viewing the data as a dense five way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 10000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.
Design considerations for parallel graphics libraries
Crockett, Thomas W.
1994-01-01
Applications which run on parallel supercomputers are often characterized by massive datasets. Converting these vast collections of numbers to visual form has proven to be a powerful aid to comprehension. For a variety of reasons, it may be desirable to provide this visual feedback at runtime. One way to accomplish this is to exploit the available parallelism to perform graphics operations in place. In order to do this, we need appropriate parallel rendering algorithms and library interfaces. This paper provides a tutorial introduction to some of the issues which arise in designing parallel graphics libraries and their underlying rendering algorithms. The focus is on polygon rendering for distributed memory message-passing systems. We illustrate our discussion with examples from PGL, a parallel graphics library which has been developed on the Intel family of parallel systems.
Energy Technology Data Exchange (ETDEWEB)
Xiao, Jianjun, E-mail: jianjun.xiao@kit.edu [Institute of Nuclear and Energy Technologies, Karlsruhe Institute of Technology, P.O. Box 3640, 76021 Karlsruhe (Germany); Travis, John R., E-mail: jack_travis@comcast.com [Engineering and Scientific Software Inc., 3010 Old Pecos Trail, Santa Fe, NM 87505 (United States); Royl, Peter, E-mail: peter.royl@partner.kit.edu [Institute of Nuclear and Energy Technologies, Karlsruhe Institute of Technology, P.O. Box 3640, 76021 Karlsruhe (Germany); Necker, Gottfried, E-mail: gottfried.necker@partner.kit.edu [Institute of Nuclear and Energy Technologies, Karlsruhe Institute of Technology, P.O. Box 3640, 76021 Karlsruhe (Germany); Svishchev, Anatoly, E-mail: anatoly.svishchev@kit.edu [Institute of Nuclear and Energy Technologies, Karlsruhe Institute of Technology, P.O. Box 3640, 76021 Karlsruhe (Germany); Jordan, Thomas, E-mail: thomas.jordan@kit.edu [Institute of Nuclear and Energy Technologies, Karlsruhe Institute of Technology, P.O. Box 3640, 76021 Karlsruhe (Germany)
2016-05-15
Highlights: • 3-D scalable semi-implicit pressure-based CFD code for containment safety analysis. • Robust solution algorithm valid for all-speed flows. • Well validated and widely used CFD code for hydrogen safety analysis. • Code applied in various types of nuclear reactor containments. • Parallelization enables high-fidelity models in large scale containment simulations. - Abstract: GASFLOW is a three dimensional semi-implicit all-speed CFD code which can be used to predict fluid dynamics, chemical kinetics, heat and mass transfer, aerosol transportation and other related phenomena involved in postulated accidents in nuclear reactor containments. The main purpose of the paper is to give a brief review on recent GASFLOW code development, validations and applications in the field of nuclear safety. GASFLOW code has been well validated by international experimental benchmarks, and has been widely applied to hydrogen safety analysis in various types of nuclear power plants in European and Asian countries, which have been summarized in this paper. Furthermore, four benchmark tests of a lid-driven cavity flow, low Mach number jet flow, 1-D shock tube and supersonic flow over a forward-facing step are presented in order to demonstrate the accuracy and wide-ranging capability of ICE’d ALE solution algorithm for all-speed flows. GASFLOW has been successfully parallelized using the paradigms of Message Passing Interface (MPI) and domain decomposition. The parallel version, GASFLOW-MPI, adds great value to large scale containment simulations by enabling high-fidelity models, including more geometric details and more complex physics. It will be helpful for the nuclear safety engineers to better understand the hydrogen safety related physical phenomena during the severe accident, to optimize the design of the hydrogen risk mitigation systems and to fulfill the licensing requirements by the nuclear regulatory authorities. GASFLOW-MPI is targeting a high
DIMACS Workshop on Interconnection Networks and Mapping, and Scheduling Parallel Computations
Rosenberg, Arnold L; Sotteau, Dominique; NSF Science and Technology Center in Discrete Mathematics and Theoretical Computer Science; Interconnection networks and mapping and scheduling parallel computations
1995-01-01
The interconnection network is one of the most basic components of a massively parallel computer system. Such systems consist of hundreds or thousands of processors interconnected to work cooperatively on computations. One of the central problems in parallel computing is the task of mapping a collection of processes onto the processors and routing network of a parallel machine. Once this mapping is done, it is critical to schedule computations within and communication among processor from universities and laboratories, as well as practitioners involved in the design, implementation, and application of massively parallel systems. Focusing on interconnection networks of parallel architectures of today and of the near future , the book includes topics such as network topologies,network properties, message routing, network embeddings, network emulation, mappings, and efficient scheduling. inputs for a process are available where and when the process is scheduled to be computed. This book contains the refereed pro...
A Parallel Adaboost-Backpropagation Neural Network for Massive Image Dataset Classification
Cao, Jianfang; Chen, Lichao; Wang, Min; Shi, Hao; Tian, Yun
2016-01-01
Image classification uses computers to simulate human understanding and cognition of images by automatically categorizing images. This study proposes a faster image classification approach that parallelizes the traditional Adaboost-Backpropagation (BP) neural network using the MapReduce parallel programming model. First, we construct a strong classifier by assembling the outputs of 15 BP neural networks (which are individually regarded as weak classifiers) based on the Adaboost algorithm. Second, we design Map and Reduce tasks for both the parallel Adaboost-BP neural network and the feature extraction algorithm. Finally, we establish an automated classification model by building a Hadoop cluster. We use the Pascal VOC2007 and Caltech256 datasets to train and test the classification model. The results are superior to those obtained using traditional Adaboost-BP neural network or parallel BP neural network approaches. Our approach increased the average classification accuracy rate by approximately 14.5% and 26.0% compared to the traditional Adaboost-BP neural network and parallel BP neural network, respectively. Furthermore, the proposed approach requires less computation time and scales very well as evaluated by speedup, sizeup and scaleup. The proposed approach may provide a foundation for automated large-scale image classification and demonstrates practical value. PMID:27905520
Ordered particles versus ordered pointers in the hybrid ordered plasma simulation (HOPS) code
International Nuclear Information System (INIS)
Anderson, D.V.; Shumaker, D.E.
1993-01-01
From a computational standpoint, particle simulation calculations for plasmas have not adapted well to the transitions from scalar to vector processing nor from serial to parallel environments. They have suffered from inordinate and excessive accessing of computer memory and have been hobbled by relatively inefficient gather-scatter constructs resulting from the use of indirect indexing. Lastly, the many-to-one mapping characteristic of the deposition phase has made it difficult to perform this in parallel. The authors' code sorts and reorders the particles in a spatial order. This allows them to greatly reduce the memory references, to run in directly indexed vector mode, and to employ domain decomposition to achieve parallelization. The field model solves pre-maxwell equations by interatively implicit methods. The OSOP (Ordered Storage Ordered Processing) version of HOPS keeps the particle tables ordered by rebuilding them after each particle pushing phase. Alternatively, the RSOP (Random Storage Ordered Processing) version keeps a table of pointers ordered by rebuilding them. Although OSOP is somewhat faster than RSOP in tests on vector-parallel machines, it is not clear this advantage will carry over to massively parallel computers
Massively parallel self-consistent-field calculations
International Nuclear Information System (INIS)
Tilson, J.L.
1994-01-01
The advent of supercomputers with many computational nodes each with its own independent memory makes possible extremely fast computations. The author's work, as part of the US High Performance Computing and Communications Program (HPCCP), is focused on the development of electronic structure techniques for the solution of Grand Challenge-size molecules containing hundreds of atoms. Their efforts have resulted in a fully scalable Direct-SCF program that is portable and efficient. This code, named NWCHEM, is built around a distributed-data model. This distributed data is managed by a software package called Global Arrays developed within the HPCCP. They present performance results for Direct-SCF calculations of interest to the consortium
Supernovae from massive AGB stars
Poelarends, A.J.T.; Izzard, R.G.; Herwig, F.; Langer, N.; Heger, A.
2006-01-01
We present new computations of the final fate of massive AGB-stars. These stars form ONeMg cores after a phase of carbon burning and are called Super AGB stars (SAGB). Detailed stellar evolutionary models until the thermally pulsing AGB were computed using three di erent stellar evolution codes. The
Energy Technology Data Exchange (ETDEWEB)
Fajeau, M; Nguyen, L T; Saunier, J [Commissariat a l' Energie Atomique, Centre d' Etudes Nucleaires de Saclay, 91 - Gif-sur-Yvette (France)
1966-09-01
This code handles the following problems: -1) Analysis of thermal experiments on a water loop at high or low pressure; steady state or transient behavior; -2) Analysis of thermal and hydrodynamic behavior of water-cooled and moderated reactors, at either high or low pressure, with boiling permitted; fuel elements are assumed to be flat plates: - Flowrate in parallel channels coupled or not by conduction across plates, with conditions of pressure drops or flowrate, variable or not with respect to time is given; the power can be coupled to reactor kinetics calculation or supplied by the code user. The code, containing a schematic representation of safety rod behavior, is a one dimensional, multi-channel code, and has as its complement (FLID), a one-channel, two-dimensional code. (authors) [French] Ce code permet de traiter les problemes ci-dessous: 1. Depouillement d'essais thermiques sur boucle a eau, haute ou basse pression, en regime permanent ou transitoire; 2. Etudes thermiques et hydrauliques de reacteurs a eau, a plaques, a haute ou basse pression, ebullition permise: - repartition entre canaux paralleles, couples on non par conduction a travers plaques, pour des conditions de debit ou de pertes de charge imposees, variables ou non dans le temps; - la puissance peut etre couplee a la neutronique et une representation schematique des actions de securite est prevue. Ce code (Cactus) a une dimension d'espace et plusieurs canaux, a pour complement Flid qui traite l'etude d'un seul canal a deux dimensions. (auteurs)
Lemon : An MPI parallel I/O library for data encapsulation using LIME
Deuzeman, Albert; Reker, Siebren; Urbach, Carsten
We introduce Lemon, an MPI parallel I/O library that provides efficient parallel I/O of both binary and metadata on massively parallel architectures. Motivated by the demands of the lattice Quantum Chromodynamics community, the data is stored in the SciDAC Lattice QCD Interchange Message
Cellular automata a parallel model
Mazoyer, J
1999-01-01
Cellular automata can be viewed both as computational models and modelling systems of real processes. This volume emphasises the first aspect. In articles written by leading researchers, sophisticated massive parallel algorithms (firing squad, life, Fischer's primes recognition) are treated. Their computational power and the specific complexity classes they determine are surveyed, while some recent results in relation to chaos from a new dynamic systems point of view are also presented. Audience: This book will be of interest to specialists of theoretical computer science and the parallelism challenge.
Scudder, Nathan; McNevin, Dennis; Kelty, Sally F; Walsh, Simon J; Robertson, James
2018-03-01
Use of DNA in forensic science will be significantly influenced by new technology in coming years. Massively parallel sequencing and forensic genomics will hasten the broadening of forensic DNA analysis beyond short tandem repeats for identity towards a wider array of genetic markers, in applications as diverse as predictive phenotyping, ancestry assignment, and full mitochondrial genome analysis. With these new applications come a range of legal and policy implications, as forensic science touches on areas as diverse as 'big data', privacy and protected health information. Although these applications have the potential to make a more immediate and decisive forensic intelligence contribution to criminal investigations, they raise policy issues that will require detailed consideration if this potential is to be realised. The purpose of this paper is to identify the scope of the issues that will confront forensic and user communities. Copyright © 2017 The Chartered Society of Forensic Sciences. All rights reserved.
International Nuclear Information System (INIS)
Fischer, G.A.
2010-01-01
The PCA Benchmark is analyzed using RAPTOR-M3G, a parallel SN radiation transport code. A variety of mesh structures, angular quadrature sets, cross section treatments, and reactor dosimetry cross sections are presented. The results show that RAPTOR-M3G is generally suitable for PWR neutron dosimetry applications. (authors)
Multitasking TORT under UNICOS: Parallel performance models and measurements
International Nuclear Information System (INIS)
Barnett, A.; Azmy, Y.Y.
1999-01-01
The existing parallel algorithms in the TORT discrete ordinates code were updated to function in a UNICOS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead
A parallel algorithm for transient solid dynamics simulations with contact detection
International Nuclear Information System (INIS)
Attaway, S.; Hendrickson, B.; Plimpton, S.; Gardner, D.; Vaughan, C.; Heinstein, M.; Peery, J.
1996-01-01
Solid dynamics simulations with Lagrangian finite elements are used to model a wide variety of problems, such as the calculation of impact damage to shipping containers for nuclear waste and the analysis of vehicular crashes. Using parallel computers for these simulations has been hindered by the difficulty of searching efficiently for material surface contacts in parallel. A new parallel algorithm for calculation of arbitrary material contacts in finite element simulations has been developed and implemented in the PRONTO3D transient solid dynamics code. This paper will explore some of the issues involved in developing efficient, portable, parallel finite element models for nonlinear transient solid dynamics simulations. The contact-detection problem poses interesting challenges for efficient implementation of a solid dynamics simulation on a parallel computer. The finite element mesh is typically partitioned so that each processor owns a localized region of the finite element mesh. This mesh partitioning is optimal for the finite element portion of the calculation since each processor must communicate only with the few connected neighboring processors that share boundaries with the decomposed mesh. However, contacts can occur between surfaces that may be owned by any two arbitrary processors. Hence, a global search across all processors is required at every time step to search for these contacts. Load-imbalance can become a problem since the finite element decomposition divides the volumetric mesh evenly across processors but typically leaves the surface elements unevenly distributed. In practice, these complications have been limiting factors in the performance and scalability of transient solid dynamics on massively parallel computers. In this paper the authors present a new parallel algorithm for contact detection that overcomes many of these limitations
Directory of Open Access Journals (Sweden)
Kim De Leeneer
Full Text Available Despite improvements in terms of sequence quality and price per basepair, Sanger sequencing remains restricted to screening of individual disease genes. The development of massively parallel sequencing (MPS technologies heralded an era in which molecular diagnostics for multigenic disorders becomes reality. Here, we outline different PCR amplification based strategies for the screening of a multitude of genes in a patient cohort. We performed a thorough evaluation in terms of set-up, coverage and sequencing variants on the data of 10 GS-FLX experiments (over 200 patients. Crucially, we determined the actual coverage that is required for reliable diagnostic results using MPS, and provide a tool to calculate the number of patients that can be screened in a single run. Finally, we provide an overview of factors contributing to false negative or false positive mutation calls and suggest ways to maximize sensitivity and specificity, both important in a routine setting. By describing practical strategies for screening of multigenic disorders in a multitude of samples and providing answers to questions about minimum required coverage, the number of patients that can be screened in a single run and the factors that may affect sensitivity and specificity we hope to facilitate the implementation of MPS technology in molecular diagnostics.
Practical parallel programming
Bauer, Barr E
2014-01-01
This is the book that will teach programmers to write faster, more efficient code for parallel processors. The reader is introduced to a vast array of procedures and paradigms on which actual coding may be based. Examples and real-life simulations using these devices are presented in C and FORTRAN.
Directory of Open Access Journals (Sweden)
Kim Vancampenhout
Full Text Available The advent of massive parallel sequencing (MPS has revolutionized the field of human molecular genetics, including the diagnostic study of mitochondrial (mt DNA dysfunction. The analysis of the complete mitochondrial genome using MPS platforms is now common and will soon outrun conventional sequencing. However, the development of a robust and reliable protocol is rather challenging. A previous pilot study for the re-sequencing of human mtDNA revealed an uneven coverage, affecting predominantly part of the plus strand. In an attempt to address this problem, we undertook a comparative study of standard and modified protocols for the Ion Torrent PGM system. We could not improve strand representation by altering the recommended shearing methodology of the standard workflow or omitting the DNA polymerase amplification step from the library construction process. However, we were able to associate coverage bias of the plus strand with a specific sequence motif. Additionally, we compared coverage and variant calling across technologies. The same samples were also sequenced on a MiSeq device which showed that coverage and heteroplasmic variant calling were much improved.
Fernández-Caballero Rico, Jose Ángel; Chueca Porcuna, Natalia; Álvarez Estévez, Marta; Mosquera Gutiérrez, María Del Mar; Marcos Maeso, María Ángeles; García, Federico
2018-02-01
To show how to generate a consensus sequence from the information of massive parallel sequences data obtained from routine HIV anti-retroviral resistance studies, and that may be suitable for molecular epidemiology studies. Paired Sanger (Trugene-Siemens) and next-generation sequencing (NGS) (454 GSJunior-Roche) HIV RT and protease sequences from 62 patients were studied. NGS consensus sequences were generated using Mesquite, using 10%, 15%, and 20% thresholds. Molecular evolutionary genetics analysis (MEGA) was used for phylogenetic studies. At a 10% threshold, NGS-Sanger sequences from 17/62 patients were phylogenetically related, with a median bootstrap-value of 88% (IQR83.5-95.5). Association increased to 36/62 sequences, median bootstrap 94% (IQR85.5-98)], using a 15% threshold. Maximum association was at the 20% threshold, with 61/62 sequences associated, and a median bootstrap value of 99% (IQR98-100). A safe method is presented to generate consensus sequences from HIV-NGS data at 20% threshold, which will prove useful for molecular epidemiological studies. Copyright © 2016 Elsevier España, S.L.U. and Sociedad Española de Enfermedades Infecciosas y Microbiología Clínica. All rights reserved.
An inherently parallel method for solving discretized diffusion equations
International Nuclear Information System (INIS)
Eccleston, B.R.; Palmer, T.S.
1999-01-01
A Monte Carlo approach to solving linear systems of equations is being investigated in the context of the solution of discretized diffusion equations. While the technique was originally devised decades ago, changes in computer architectures (namely, massively parallel machines) have driven the authors to revisit this technique. There are a number of potential advantages to this approach: (1) Analog Monte Carlo techniques are inherently parallel; this is not necessarily true to today's more advanced linear equation solvers (multigrid, conjugate gradient, etc.); (2) Some forms of this technique are adaptive in that they allow the user to specify locations in the problem where resolution is of particular importance and to concentrate the work at those locations; and (3) These techniques permit the solution of very large systems of equations in that matrix elements need not be stored. The user could trade calculational speed for storage if elements of the matrix are calculated on the fly. The goal of this study is to compare the parallel performance of Monte Carlo linear solvers to that of a more traditional parallelized linear solver. The authors observe the linear speedup that they expect from the Monte Carlo algorithm, given that there is no domain decomposition to cause significant communication overhead. Overall, PETSc outperforms the Monte Carlo solver for the test problem. The PETSc parallel performance improves with larger numbers of unknowns for a given number of processors. Parallel performance of the Monte Carlo technique is independent of the size of the matrix and the number of processes. They are investigating modifications to the scheme to accommodate matrix problems with positive off-diagonal elements. They are also currently coding an on-the-fly version of the algorithm to investigate the solution of very large linear systems
Energy Technology Data Exchange (ETDEWEB)
Lober, R.R.; Tautges, T.J.; Vaughan, C.T.
1997-03-01
Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.
Malas, Tareq M.
2016-07-21
Understanding and optimizing the properties of solar cells is becoming a key issue in the search for alternatives to nuclear and fossil energy sources. A theoretical analysis via numerical simulations involves solving Maxwell\\'s Equations in discretized form and typically requires substantial computing effort. We start from a hybrid-parallel (MPI+OpenMP) production code that implements the Time Harmonic Inverse Iteration Method (THIIM) with Finite-Difference Frequency Domain (FDFD) discretization. Although this algorithm has the characteristics of a strongly bandwidth-bound stencil update scheme, it is significantly different from the popular stencil types that have been exhaustively studied in the high performance computing literature to date. We apply a recently developed stencil optimization technique, multicore wavefront diamond tiling with multi-dimensional cache block sharing, and describe in detail the peculiarities that need to be considered due to the special stencil structure. Concurrency in updating the components of the electric and magnetic fields provides an additional level of parallelism. The dependence of the cache size requirement of the optimized code on the blocking parameters is modeled accurately, and an auto-tuner searches for optimal configurations in the remaining parameter space. We were able to completely decouple the execution from the memory bandwidth bottleneck, accelerating the implementation by a factor of three to four compared to an optimal implementation with pure spatial blocking on an 18-core Intel Haswell CPU.
International Nuclear Information System (INIS)
Li Hanyu; Zhou Haijing; Dong Zhiwei; Liao Cheng; Chang Lei; Cao Xiaolin; Xiao Li
2010-01-01
A large-scale parallel electromagnetic field simulation program JEMS-FDTD(J Electromagnetic Solver-Finite Difference Time Domain) is designed and implemented on JASMIN (J parallel Adaptive Structured Mesh applications INfrastructure). This program can simulate propagation, radiation, couple of electromagnetic field by solving Maxwell equations on structured mesh explicitly with FDTD method. JEMS-FDTD is able to simulate billion-mesh-scale problems on thousands of processors. In this article, the program is verified by simulating the radiation of an electric dipole. A beam waveguide is simulated to demonstrate the capability of large scale parallel computation. A parallel performance test indicates that a high parallel efficiency is obtained. (authors)
Directory of Open Access Journals (Sweden)
Pietro Quaglio
2017-05-01
Full Text Available Repeated, precise sequences of spikes are largely considered a signature of activation of cell assemblies. These repeated sequences are commonly known under the name of spatio-temporal patterns (STPs. STPs are hypothesized to play a role in the communication of information in the computational process operated by the cerebral cortex. A variety of statistical methods for the detection of STPs have been developed and applied to electrophysiological recordings, but such methods scale poorly with the current size of available parallel spike train recordings (more than 100 neurons. In this work, we introduce a novel method capable of overcoming the computational and statistical limits of existing analysis techniques in detecting repeating STPs within massively parallel spike trains (MPST. We employ advanced data mining techniques to efficiently extract repeating sequences of spikes from the data. Then, we introduce and compare two alternative approaches to distinguish statistically significant patterns from chance sequences. The first approach uses a measure known as conceptual stability, of which we investigate a computationally cheap approximation for applications to such large data sets. The second approach is based on the evaluation of pattern statistical significance. In particular, we provide an extension to STPs of a method we recently introduced for the evaluation of statistical significance of synchronous spike patterns. The performance of the two approaches is evaluated in terms of computational load and statistical power on a variety of artificial data sets that replicate specific features of experimental data. Both methods provide an effective and robust procedure for detection of STPs in MPST data. The method based on significance evaluation shows the best overall performance, although at a higher computational cost. We name the novel procedure the spatio-temporal Spike PAttern Detection and Evaluation (SPADE analysis.
Allsopp, Nicholas
2012-04-01
We report scaling results on the world\\'s largest supercomputer of our recently developed Billions-Body Molecular Dynamics (BBMD) package, which was especially designed for massively parallel simulations of the short-range atomic dynamics in structural glasses and amorphous materials. The code was able to scale up to 72 racks of an IBM BlueGene/P, with a measured 89% efficiency for a system with 100 billion particles. The code speed, with 0.13. s per iteration in the case of 1 billion particles, paves the way to the study of billion-body structural glasses with a resolution increase of two orders of magnitude with respect to the largest simulation ever reported. We demonstrate the effectiveness of our code by studying the liquid-glass transition of an exceptionally large system made by a binary mixture of 1 billion particles. © 2012.
A parallel solver for huge dense linear systems
Badia, J. M.; Movilla, J. L.; Climente, J. I.; Castillo, M.; Marqués, M.; Mayo, R.; Quintana-Ortí, E. S.; Planelles, J.
2011-11-01
HDSS (Huge Dense Linear System Solver) is a Fortran Application Programming Interface (API) to facilitate the parallel solution of very large dense systems to scientists and engineers. The API makes use of parallelism to yield an efficient solution of the systems on a wide range of parallel platforms, from clusters of processors to massively parallel multiprocessors. It exploits out-of-core strategies to leverage the secondary memory in order to solve huge linear systems O(100.000). The API is based on the parallel linear algebra library PLAPACK, and on its Out-Of-Core (OOC) extension POOCLAPACK. Both PLAPACK and POOCLAPACK use the Message Passing Interface (MPI) as the communication layer and BLAS to perform the local matrix operations. The API provides a friendly interface to the users, hiding almost all the technical aspects related to the parallel execution of the code and the use of the secondary memory to solve the systems. In particular, the API can automatically select the best way to store and solve the systems, depending of the dimension of the system, the number of processes and the main memory of the platform. Experimental results on several parallel platforms report high performance, reaching more than 1 TFLOP with 64 cores to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors. New version program summaryProgram title: Huge Dense System Solver (HDSS) Catalogue identifier: AEHU_v1_1 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHU_v1_1.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 87 062 No. of bytes in distributed program, including test data, etc.: 1 069 110 Distribution format: tar.gz Programming language: Fortran90, C Computer: Parallel architectures: multiprocessors, computer clusters Operating system
Utilizing GPUs to Accelerate Turbomachinery CFD Codes
MacCalla, Weylin; Kulkarni, Sameer
2016-01-01
GPU computing has established itself as a way to accelerate parallel codes in the high performance computing world. This work focuses on speeding up APNASA, a legacy CFD code used at NASA Glenn Research Center, while also drawing conclusions about the nature of GPU computing and the requirements to make GPGPU worthwhile on legacy codes. Rewriting and restructuring of the source code was avoided to limit the introduction of new bugs. The code was profiled and investigated for parallelization potential, then OpenACC directives were used to indicate parallel parts of the code. The use of OpenACC directives was not able to reduce the runtime of APNASA on either the NVIDIA Tesla discrete graphics card, or the AMD accelerated processing unit. Additionally, it was found that in order to justify the use of GPGPU, the amount of parallel work being done within a kernel would have to greatly exceed the work being done by any one portion of the APNASA code. It was determined that in order for an application like APNASA to be accelerated on the GPU, it should not be modular in nature, and the parallel portions of the code must contain a large portion of the code's computation time.
Use of parallel counters for triggering
International Nuclear Information System (INIS)
Nikityuk, N.M.
1991-01-01
Results of investigation of using parallel counters, majority coincidence schemes, parallel compressors for triggering in multichannel high energy spectrometers are described. Concrete examples of methods of constructing fast and economic new devices used to determine multiplicity hits t>900 registered in a hodoscopic plane and a pixel detector are given. For this purpose the author uses the syndrome coding method and cellular arrays. In addition, an effective coding matrix has been created which can be used for light signal coding. For example, such signals are supplied from scintillators to photomultipliers. 23 refs.; 21 figs
Johnson, T.; Rucker, D. F.; Wellman, D.
2013-12-01
revealed the general footprint of vadose zone contamination beneath infiltration galleries. In 2011, the USDOE commissioned an effort to re-invert the B-Complex ERT data as a whole using a recently developed massively parallel 3D ERT inversion code. The computational mesh included approximately 1.085 million elements and closely honored the 37m of topographic relief as determined by LiDAR imaging. The water table and tank boundaries were also incorporated into the mesh to facilitate regularization disconnects, enabling sharp conductivity contrasts where they occur naturally without penalty. The data were inverted using 1024 processors, requiring 910 Gb of memory and 11.5 hours of computation time. The imaging results revealed previously unrealized detail concerning the distribution and behavior of contaminants migrating through the vadose zone, and are currently being used by site cleanup operators and regulators to understand the origin of a groundwater nitrate plume emerging from one of the infiltration galleries. The results overall demonstrate the utility of high performance computing, unstructured meshing, and custom regularization constraints for optimal processing of massive ERT data sets enabled by modern ERT survey hardware.
Overview of development and design of MPACT: Michigan parallel characteristics transport code
Energy Technology Data Exchange (ETDEWEB)
Kochunas, B.; Collins, B.; Jabaay, D.; Downar, T. J.; Martin, W. R. [Department of Nuclear Engineering and Radiological Sciences, University of Michigan, 2200 Bonisteel, Ann Arbor, MI 48109 (United States)
2013-07-01
MPACT (Michigan Parallel Characteristics Transport Code) is a new reactor analysis tool. It is being developed by students and research staff at the University of Michigan to be used for an advanced pin-resolved transport capability within VERA (Virtual Environment for Reactor Analysis). VERA is the end-user reactor simulation tool being produced by the Consortium for the Advanced Simulation of Light Water Reactors (CASL). The MPACT development project is itself unique for the way it is changing how students do research to achieve the instructional and research goals of an academic institution, while providing immediate value to industry. The MPACT code makes use of modern lean/agile software processes and extensive testing to maintain a level of productivity and quality required by CASL. MPACT's design relies heavily on object-oriented programming concepts and design patterns and is programmed in Fortran 2003. These designs are explained and illustrated as to how they can be readily extended to incorporate new capabilities and research ideas in support of academic research objectives. The transport methods currently implemented in MPACT include the 2-D and 3-D method of characteristics (MOC) and 2-D and 3-D method of collision direction probabilities (CDP). For the cross section resonance treatment, presently the subgroup method and the new embedded self-shielding method (ESSM) are implemented within MPACT. (authors)
Massive Predictive Modeling using Oracle R Enterprise
CERN. Geneva
2014-01-01
R is fast becoming the lingua franca for analyzing data via statistics, visualization, and predictive analytics. For enterprise-scale data, R users have three main concerns: scalability, performance, and production deployment. Oracle's R-based technologies - Oracle R Distribution, Oracle R Enterprise, Oracle R Connector for Hadoop, and the R package ROracle - address these concerns. In this talk, we introduce Oracle's R technologies, highlighting how each enables R users to achieve scalability and performance while making production deployment of R results a natural outcome of the data analyst/scientist efforts. The focus then turns to Oracle R Enterprise with code examples using the transparency layer and embedded R execution, targeting massive predictive modeling. One goal behind massive predictive modeling is to build models per entity, such as customers, zip codes, simulations, in an effort to understand behavior and tailor predictions at the entity level. Predictions...
Homemade Buckeye-Pi: A Learning Many-Node Platform for High-Performance Parallel Computing
Amooie, M. A.; Moortgat, J.
2017-12-01
We report on the "Buckeye-Pi" cluster, the supercomputer developed in The Ohio State University School of Earth Sciences from 128 inexpensive Raspberry Pi (RPi) 3 Model B single-board computers. Each RPi is equipped with fast Quad Core 1.2GHz ARMv8 64bit processor, 1GB of RAM, and 32GB microSD card for local storage. Therefore, the cluster has a total RAM of 128GB that is distributed on the individual nodes and a flash capacity of 4TB with 512 processors, while it benefits from low power consumption, easy portability, and low total cost. The cluster uses the Message Passing Interface protocol to manage the communications between each node. These features render our platform the most powerful RPi supercomputer to date and suitable for educational applications in high-performance-computing (HPC) and handling of large datasets. In particular, we use the Buckeye-Pi to implement optimized parallel codes in our in-house simulator for subsurface media flows with the goal of achieving a massively-parallelized scalable code. We present benchmarking results for the computational performance across various number of RPi nodes. We believe our project could inspire scientists and students to consider the proposed unconventional cluster architecture as a mainstream and a feasible learning platform for challenging engineering and scientific problems.
Monte Carlo calculations on a parallel computer using MORSE-C.G
International Nuclear Information System (INIS)
Wood, J.
1995-01-01
The general purpose particle transport Monte Carlo code, MORSE-C.G., is implemented on a parallel computing transputer-based system having MIMD architecture. Example problems are solved which are representative of the 3-principal types of problem that can be solved by the original serial code, namely, fixed source, eigenvalue (k-eff) and time-dependent. The results from the parallelized version of the code are compared in tables with the serial code run on a mainframe serial computer, and with an independent, deterministic transport code. The performance of the parallel computer as the number of processors is varied is shown graphically. For the parallel strategy used, the loss of efficiency as the number of processors is increased, is investigated. (author)
Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB
Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.
2017-01-01
Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.
Parallel-Architecture Simulator Development Using Hardware Transactional Memory
Armejach Sanosa, Adrià
2009-01-01
To address the need for a simpler parallel programming model, Transactional Memory (TM) has been developed and promises good parallel performance with easy-to-write parallel code. Unlike lock-based approaches, with TM, programmers do not need to explicitly specify and manage the synchronization among threads. However, programmers simply mark code segments as transactions, and the TM system manages the concurrency control for them. TM can be implemented either in software (STM) or hardware (HT...
Hoekstra, A.G.; Sloot, P.M.A.; Haan, M.J.; Hertzberger, L.O.; van Leeuwen, J.
1991-01-01
New developments in Computer Science, both hardware and software, offer researchers, such as physicists, unprecedented possibilities to solve their computational intensive problems.However, full exploitation of e.g. new massively parallel computers, parallel languages or runtime environments
Regional-scale calculation of the LS factor using parallel processing
Liu, Kai; Tang, Guoan; Jiang, Ling; Zhu, A.-Xing; Yang, Jianyi; Song, Xiaodong
2015-05-01
With the increase of data resolution and the increasing application of USLE over large areas, the existing serial implementation of algorithms for computing the LS factor is becoming a bottleneck. In this paper, a parallel processing model based on message passing interface (MPI) is presented for the calculation of the LS factor, so that massive datasets at a regional scale can be processed efficiently. The parallel model contains algorithms for calculating flow direction, flow accumulation, drainage network, slope, slope length and the LS factor. According to the existence of data dependence, the algorithms are divided into local algorithms and global algorithms. Parallel strategy are designed according to the algorithm characters including the decomposition method for maintaining the integrity of the results, optimized workflow for reducing the time taken for exporting the unnecessary intermediate data and a buffer-communication-computation strategy for improving the communication efficiency. Experiments on a multi-node system show that the proposed parallel model allows efficient calculation of the LS factor at a regional scale with a massive dataset.
AMIDST: Analysis of MassIve Data STreams
DEFF Research Database (Denmark)
Masegosa, Andres; Martinez, Ana Maria; Borchani, Hanen
2015-01-01
The Analysis of MassIve Data STreams (AMIDST) Java toolbox provides a collection of scalable and parallel algorithms for inference and learning of hybrid Bayesian networks from data streams. The toolbox, available at http://amidst.github.io/toolbox/ under the Apache Software License version 2.......0, also efficiently leverages existing functionalities and algorithms by interfacing to software tools such as HUGIN and MOA....
A Parallel Approach to Fractal Image Compression
Lubomir Dedera
2004-01-01
The paper deals with a parallel approach to coding and decoding algorithms in fractal image compressionand presents experimental results comparing sequential and parallel algorithms from the point of view of achieved bothcoding and decoding time and effectiveness of parallelization.
Lee, Li-Yu; Lin, Gigin; Chen, Shu-Jen; Lu, Yen-Jung; Huang, Huei-Jean; Yen, Chi-Feng; Han, Chien Min; Lee, Yun-Shien; Wang, Tzu-Hao; Chao, Angel
2017-01-01
Benign metastasizing leiomyoma (BML) is a rare disease entity typically presenting as multiple extrauterine leiomyomas associated with a uterine leiomyoma. It has been hypothesized that the extrauterine leiomyomata represent distant metastasis of the uterine leiomyoma. To date, the only molecular evidence supporting this hypothesis was derived from clonality analyses based on X-chromosome inactivation assays. Here, we sought to address this issue by examining paired specimens of synchronous pulmonary and uterine leiomyomata from three patients using targeted massively parallel sequencing and molecular inversion probe array analysis for detecting somatic mutations and copy number aberrations. We detected identical non-hot-spot somatic mutations and similar patterns of copy number aberrations (CNAs) in paired pulmonary and uterine leiomyomata from two patients, indicating the clonal relationship between pulmonary and uterine leiomyomata. In addition to loss of chromosome 22q found in the literature, we identified additional recurrent CNAs including losses of chromosome 3q and 11q. In conclusion, our findings of the clonal relationship between synchronous pulmonary and uterine leiomyomas support the hypothesis that BML represents a condition wherein a uterine leiomyoma disseminates to distant extrauterine locations. PMID:28533481
Wu, Ren-Chin; Chao, An-Shine; Lee, Li-Yu; Lin, Gigin; Chen, Shu-Jen; Lu, Yen-Jung; Huang, Huei-Jean; Yen, Chi-Feng; Han, Chien Min; Lee, Yun-Shien; Wang, Tzu-Hao; Chao, Angel
2017-07-18
Benign metastasizing leiomyoma (BML) is a rare disease entity typically presenting as multiple extrauterine leiomyomas associated with a uterine leiomyoma. It has been hypothesized that the extrauterine leiomyomata represent distant metastasis of the uterine leiomyoma. To date, the only molecular evidence supporting this hypothesis was derived from clonality analyses based on X-chromosome inactivation assays. Here, we sought to address this issue by examining paired specimens of synchronous pulmonary and uterine leiomyomata from three patients using targeted massively parallel sequencing and molecular inversion probe array analysis for detecting somatic mutations and copy number aberrations. We detected identical non-hot-spot somatic mutations and similar patterns of copy number aberrations (CNAs) in paired pulmonary and uterine leiomyomata from two patients, indicating the clonal relationship between pulmonary and uterine leiomyomata. In addition to loss of chromosome 22q found in the literature, we identified additional recurrent CNAs including losses of chromosome 3q and 11q. In conclusion, our findings of the clonal relationship between synchronous pulmonary and uterine leiomyomas support the hypothesis that BML represents a condition wherein a uterine leiomyoma disseminates to distant extrauterine locations.
Development of an object-oriented simulation code for repository performance assessment
International Nuclear Information System (INIS)
Tsujimoto, Keiichi; Ahn, J.
1999-01-01
As understanding for mechanisms of radioactivity confinement by a deep geologic repository improves at the individual process level, it has become imperative to evaluate consequences of individual processes to the performance of the whole repository system. For this goal, the authors have developed a model for radionuclide transport in, and release from, the repository region by incorporating multiple-member decay chains and multiple waste canisters. A computer code has been developed with C++, an object-oriented language. By utilizing the feature that a geologic repository consists of thousands of objects of the same kind, such as the waste canister, the repository region is divided into multiple compartments and objects for simulation of radionuclide transport. Massive computational tasks are distributed over, and executed by, multiple networked workstations, with the help of parallel virtual machine (PVM) technology. Temporal change of the mass distribution of 28 radionuclides in the repository region for the time period of 100 million yr has been successfully obtained by the code
Parallel adaptive simulations on unstructured meshes
International Nuclear Information System (INIS)
Shephard, M S; Jansen, K E; Sahni, O; Diachin, L A
2007-01-01
This paper discusses methods being developed by the ITAPS center to support the execution of parallel adaptive simulations on unstructured meshes. The paper first outlines the ITAPS approach to the development of interoperable mesh, geometry and field services to support the needs of SciDAC application in these areas. The paper then demonstrates the ability of unstructured adaptive meshing methods built on such interoperable services to effectively solve important physics problems. Attention is then focused on ITAPs' developing ability to solve adaptive unstructured mesh problems on massively parallel computers
International Nuclear Information System (INIS)
Nemoto, Toshiyuki; Watanabe, Hideo; Fujita, Toyozo; Kawai, Wataru; Harada, Hiroo; Gorai, Kazuo; Yamasaki, Kazuhiko; Shoji, Makoto; Fujii, Minoru.
1996-07-01
At Center for Promotion of Computational Science and Engineering, time consuming eight nuclear codes suggested by users have been vectorized, parallelized on the VPP500 computer system. In addition, two nuclear codes used on the VP2600 computer system were implemented on the VPP500 computer system. Neutron and photon transport calculation code MVP/GMVP and relativistic quantum molecular dynamics code QMDRELP have been parallelized. Extended quantum molecular dynamics code EQMD and adiabatic base calculation code HSABC have been parallelized and vectorized. Ballooning turbulence simulation code CURBAL, 3-D non-stationary compressible fluid dynamics code STREAM V3.1, operating plasma analysis code TOSCA and eddy current analysis code EDDYCAL have been vectorized. Reactor safety analysis code RELAP5/MOD2/C36-05 and RELAP5/MOD3 were implemented on the VPP500 computer system. (author)
Energy Technology Data Exchange (ETDEWEB)
Nemoto, Toshiyuki; Watanabe, Hideo; Fujita, Toyozo [Fujitsu Ltd., Tokyo (Japan); Kawai, Wataru; Harada, Hiroo; Gorai, Kazuo; Yamasaki, Kazuhiko; Shoji, Makoto; Fujii, Minoru
1996-06-01
At Center for Promotion of Computational Science and Engineering, time consuming eight nuclear codes suggested by users have been vectorized, parallelized on the VPP500 computer system. In addition, two nuclear codes used on the VP2600 computer system were implemented on the VPP500 computer system. Neutron and photon transport calculation code MVP/GMVP and relativistic quantum molecular dynamics code QMDRELP have been parallelized. Extended quantum molecular dynamics code EQMD and adiabatic base calculation code HSABC have been parallelized and vectorized. Ballooning turbulence simulation code CURBAL, 3-D non-stationary compressible fluid dynamics code STREAM V3.1, operating plasma analysis code TOSCA and eddy current analysis code EDDYCAL have been vectorized. Reactor safety analysis code RELAP5/MOD2/C36-05 and RELAP5/MOD3 were implemented on the VPP500 computer system. (author)
Defining the best parallelization strategy for a diphasic compressible fluid mechanics code
International Nuclear Information System (INIS)
Berthou, Jean-Yves; Fayolle, Eric; Faucher, Eric; Scliffet, Laurent
2000-01-01
parallelization strategy we recommend for codes comparable to ECOSS. (author)
Defining the best parallelization strategy for a diphasic compressible fluid mechanics code
Energy Technology Data Exchange (ETDEWEB)
Berthou, Jean-Yves; Fayolle, Eric [Electricite de France, Research and Development division, Modeling and Information Technologies Department, CLAMART CEDEX (France); Faucher, Eric; Scliffet, Laurent [Electricite de France, Research and Development Division, Mechanics and Component Technology Branch Department, Moret sur Loing (France)
2000-09-01
parallelization strategy we recommend for codes comparable to ECOSS. (author)
Porting of serial molecular dynamics code on MIMD platforms
International Nuclear Information System (INIS)
Celino, M.
1995-05-01
A molecular Dynamics (MD) code, utilized for the study of atomistic models of metallic systems has been parallelized for MIMD (Multiple Instructions Multiple Data) parallel platforms by means of the Parallel Virtual Machine (PVM) message passing library. Since the parallelization implies modifications of the sequential algorithms, these are described from the point of view of the Statistical Mechanics theory. Furthermore, techniques and parallelization strategies utilized and the MD parallel code are described in detail. Benchmarks on several MIMD platforms (IBM SP1 and SP2, Cray T3D, Cluster of workstations) allow performances evaluation of the code versus the different characteristics of the parallel platforms
Implementing Shared Memory Parallelism in MCBEND
Directory of Open Access Journals (Sweden)
Bird Adam
2017-01-01
Full Text Available MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers’s ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.
A Parallel Approach to Fractal Image Compression
Directory of Open Access Journals (Sweden)
Lubomir Dedera
2004-01-01
Full Text Available The paper deals with a parallel approach to coding and decoding algorithms in fractal image compressionand presents experimental results comparing sequential and parallel algorithms from the point of view of achieved bothcoding and decoding time and effectiveness of parallelization.
Smoldyn on graphics processing units: massively parallel Brownian dynamics simulations.
Dematté, Lorenzo
2012-01-01
Space is a very important aspect in the simulation of biochemical systems; recently, the need for simulation algorithms able to cope with space is becoming more and more compelling. Complex and detailed models of biochemical systems need to deal with the movement of single molecules and particles, taking into consideration localized fluctuations, transportation phenomena, and diffusion. A common drawback of spatial models lies in their complexity: models can become very large, and their simulation could be time consuming, especially if we want to capture the systems behavior in a reliable way using stochastic methods in conjunction with a high spatial resolution. In order to deliver the promise done by systems biology to be able to understand a system as whole, we need to scale up the size of models we are able to simulate, moving from sequential to parallel simulation algorithms. In this paper, we analyze Smoldyn, a widely diffused algorithm for stochastic simulation of chemical reactions with spatial resolution and single molecule detail, and we propose an alternative, innovative implementation that exploits the parallelism of Graphics Processing Units (GPUs). The implementation executes the most computational demanding steps (computation of diffusion, unimolecular, and bimolecular reaction, as well as the most common cases of molecule-surface interaction) on the GPU, computing them in parallel on each molecule of the system. The implementation offers good speed-ups and real time, high quality graphics output
Particle In Cell Codes on Highly Parallel Architectures
Tableman, Adam
2014-10-01
We describe strategies and examples of Particle-In-Cell Codes running on Nvidia GPU and Intel Phi architectures. This includes basic implementations in skeletons codes and full-scale development versions (encompassing 1D, 2D, and 3D codes) in Osiris. Both the similarities and differences between Intel's and Nvidia's hardware will be examined. Work supported by grants NSF ACI 1339893, DOE DE SC 000849, DOE DE SC 0008316, DOE DE NA 0001833, and DOE DE FC02 04ER 54780.
Kim, M.-H.; Cho, J. H.; Park, S.-J.; Eden, J. G.
2017-08-01
Plasmachemical systems based on the production of a specific molecule (O3) in literally thousands of microchannel plasmas simultaneously have been demonstrated, developed and engineered over the past seven years, and commercialized. At the heart of this new plasma technology is the plasma chip, a flat aluminum strip fabricated by photolithographic and wet chemical processes and comprising 24-48 channels, micromachined into nanoporous aluminum oxide, with embedded electrodes. By integrating 4-6 chips into a module, the mass output of an ozone microplasma system is scaled linearly with the number of modules operating in parallel. A 115 g/hr (2.7 kg/day) ozone system, for example, is realized by the combined output of 18 modules comprising 72 chips and 1,800 microchannels. The implications of this plasma processing architecture for scaling ozone production capability, and reducing capital and service costs when introducing redundancy into the system, are profound. In contrast to conventional ozone generator technology, microplasma systems operate reliably (albeit with reduced output) in ambient air and humidity levels up to 90%, a characteristic attributable to the water adsorption/desorption properties and electrical breakdown strength of nanoporous alumina. Extensive testing has documented chip and system lifetimes (MTBF) beyond 5,000 hours, and efficiencies >130 g/kWh when oxygen is the feedstock gas. Furthermore, the weight and volume of microplasma systems are a factor of 3-10 lower than those for conventional ozone systems of comparable output. Massively-parallel plasmachemical processing offers functionality, performance, and commercial value beyond that afforded by conventional technology, and is currently in operation in more than 30 countries worldwide.
Malloy, Matt; Thiel, Brad; Bunday, Benjamin D.; Wurm, Stefan; Jindal, Vibhu; Mukhtar, Maseeh; Quoi, Kathy; Kemen, Thomas; Zeidler, Dirk; Eberle, Anna Lena; Garbowski, Tomasz; Dellemann, Gregor; Peters, Jan Hendrik
2015-09-01
The new device architectures and materials being introduced for sub-10nm manufacturing, combined with the complexity of multiple patterning and the need for improved hotspot detection strategies, have pushed current wafer inspection technologies to their limits. In parallel, gaps in mask inspection capability are growing as new generations of mask technologies are developed to support these sub-10nm wafer manufacturing requirements. In particular, the challenges associated with nanoimprint and extreme ultraviolet (EUV) mask inspection require new strategies that enable fast inspection at high sensitivity. The tradeoffs between sensitivity and throughput for optical and e-beam inspection are well understood. Optical inspection offers the highest throughput and is the current workhorse of the industry for both wafer and mask inspection. E-beam inspection offers the highest sensitivity but has historically lacked the throughput required for widespread adoption in the manufacturing environment. It is unlikely that continued incremental improvements to either technology will meet tomorrow's requirements, and therefore a new inspection technology approach is required; one that combines the high-throughput performance of optical with the high-sensitivity capabilities of e-beam inspection. To support the industry in meeting these challenges SUNY Poly SEMATECH has evaluated disruptive technologies that can meet the requirements for high volume manufacturing (HVM), for both the wafer fab [1] and the mask shop. Highspeed massively parallel e-beam defect inspection has been identified as the leading candidate for addressing the key gaps limiting today's patterned defect inspection techniques. As of late 2014 SUNY Poly SEMATECH completed a review, system analysis, and proof of concept evaluation of multiple e-beam technologies for defect inspection. A champion approach has been identified based on a multibeam technology from Carl Zeiss. This paper includes a discussion on the
Directory of Open Access Journals (Sweden)
Zhang Wei
2005-01-01
Full Text Available The optimum and many suboptimum iterative soft-input soft-output (SISO multiuser detectors require a priori information about the multiuser system, such as the users' transmitted signature waveforms, relative delays, as well as the channel impulse response. In this paper, we employ adaptive algorithms in the SISO multiuser detector in order to avoid the need for this a priori information. First, we derive the optimum SISO parallel decision-feedback detector for asynchronous coded DS-CDMA systems. Then, we propose two adaptive versions of this SISO detector, which are based on the normalized least mean square (NLMS and recursive least squares (RLS algorithms. Our SISO adaptive detectors effectively exploit the a priori information of coded symbols, whose soft inputs are obtained from a bank of single-user decoders. Furthermore, we consider how to select practical finite feedforward and feedback filter lengths to obtain a good tradeoff between the performance and computational complexity of the receiver.
An optimization of a GPU-based parallel wind field module
International Nuclear Information System (INIS)
Pinheiro, André L.S.; Shirru, Roberto
2017-01-01
Atmospheric radionuclide dispersion systems (ARDS) are important tools to predict the impact of radioactive releases from Nuclear Power Plants and guide people evacuation from affected areas. Four modules comprise ARDS: Source Term, Wind Field, Plume Dispersion and Doses Calculations. The slowest is the Wind Field Module that was previously parallelized using the CUDA C language. The statement purpose of this work is to show the speedup gain with the optimization of the already parallel code of the GPU-based Wind Field module, based in WEST model (Extrapolated from Stability and Terrain). Due to the parallelization done in the wind field module, it was observed that some CUDA processors became idle, thus contributing to a reduction in speedup. It was proposed in this work a way of allocating these idle CUDA processors in order to increase the speedup. An acceleration of about 4 times can be seen in the comparative case study between the regular CUDA code and the optimized CUDA code. These results are quite motivating and point out that even after a parallelization of code, a parallel code optimization should be taken into account. (author)
An optimization of a GPU-based parallel wind field module
Energy Technology Data Exchange (ETDEWEB)
Pinheiro, André L.S.; Shirru, Roberto [Coordenacao de Pos-Graduacao e Pesquisa de Engenharia (PEN/COPPE/UFRJ), Rio de Janeiro, RJ (Brazil). Programa de Engenharia Nuclear; Pereira, Cláudio M.N.A., E-mail: apinheiro99@gmail.com, E-mail: schirru@lmp.ufrj.br, E-mail: cmnap@ien.gov.br [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil)
2017-07-01
Atmospheric radionuclide dispersion systems (ARDS) are important tools to predict the impact of radioactive releases from Nuclear Power Plants and guide people evacuation from affected areas. Four modules comprise ARDS: Source Term, Wind Field, Plume Dispersion and Doses Calculations. The slowest is the Wind Field Module that was previously parallelized using the CUDA C language. The statement purpose of this work is to show the speedup gain with the optimization of the already parallel code of the GPU-based Wind Field module, based in WEST model (Extrapolated from Stability and Terrain). Due to the parallelization done in the wind field module, it was observed that some CUDA processors became idle, thus contributing to a reduction in speedup. It was proposed in this work a way of allocating these idle CUDA processors in order to increase the speedup. An acceleration of about 4 times can be seen in the comparative case study between the regular CUDA code and the optimized CUDA code. These results are quite motivating and point out that even after a parallelization of code, a parallel code optimization should be taken into account. (author)
International Nuclear Information System (INIS)
Jimenez, Miguel A.; Martin-Valdepenas, Juan M.; Martin-Fuertes, Francisco; Fernandez, Jose A.
2007-01-01
A detailed chemistry model has been adapted and developed for surface chemistry, heat and mass transfer between H 2 /CO/air/steam/CO 2 mixtures and vertical parallel Pt-coated surfaces. This model is based onto a simplified Deutschmann reaction scheme for methane surface combustion and the analysis by Elenbaas for buoyancy-induced heat transfer between parallel plates. Mass transfer is treated by the heat and mass transfer analogy. The proposed model is able to simulate the H 2 /CO recombination phenomena characteristic of parallel-plate Passive Autocatalytic Recombiners (PARs), which have been proposed and implemented as a promising hydrogen-control strategy in the safety of nuclear power stations or other industries. The transient model is able to approach the warm-up phase of the PAR and its shut-down as well as the dynamic changes within the surrounding atmosphere. The model has been implemented within the MELCOR code and assessed against results of the Battelle Model Containment tests of the Zx series. Results show accurate predictions and a better performance than traditional methods in integral codes, i.e. empirical correlations, which are also much case-specific. Influence of CO present in the mixture on the PAR performance is also addressed in this paper
Development of a parallelization method for KENO V.a
International Nuclear Information System (INIS)
Basoglu, B.; Bentley, C.; Dunn, M.
1995-01-01
The KENO V.a codes is a widely used Monte carlo codes that is part of the SCALE modular codes system for performing standardized computer analysis of nuclear systems for licensing evaluation. In the past few years, attempts have been made to speed up KENO V.a using new generation computers. In this paper we report on the initial development of a parallel version of KENO V.a for the Kendall Square Research supercomputer (KSRI) at ORNL. Investigations thus far have shown that the parallel code provides accurate results with significantly reduced computation times relative to the conventional KENO V.a code
Applications of Transport/Reaction Codes to Problems in Cell Modeling; TOPICAL
International Nuclear Information System (INIS)
MEANS, SHAWN A.; RINTOUL, MARK DANIEL; SHADID, JOHN N.
2001-01-01
We demonstrate two specific examples that show how our exiting capabilities in solving large systems of partial differential equations associated with transport/reaction systems can be easily applied to outstanding problems in computational biology. First, we examine a three-dimensional model for calcium wave propagation in a Xenopus Laevis frog egg and verify that a proposed model for the distribution of calcium release sites agrees with experimental results as a function of both space and time. Next, we create a model of the neuron's terminus based on experimental observations and show that the sodium-calcium exchanger is not the route of sodium's modulation of neurotransmitter release. These state-of-the-art simulations were performed on massively parallel platforms and required almost no modification of existing Sandia codes
Parallel-vector algorithms for particle simulations on shared-memory multiprocessors
International Nuclear Information System (INIS)
Nishiura, Daisuke; Sakaguchi, Hide
2011-01-01
Over the last few decades, the computational demands of massive particle-based simulations for both scientific and industrial purposes have been continuously increasing. Hence, considerable efforts are being made to develop parallel computing techniques on various platforms. In such simulations, particles freely move within a given space, and so on a distributed-memory system, load balancing, i.e., assigning an equal number of particles to each processor, is not guaranteed. However, shared-memory systems achieve better load balancing for particle models, but suffer from the intrinsic drawback of memory access competition, particularly during (1) paring of contact candidates from among neighboring particles and (2) force summation for each particle. Here, novel algorithms are proposed to overcome these two problems. For the first problem, the key is a pre-conditioning process during which particle labels are sorted by a cell label in the domain to which the particles belong. Then, a list of contact candidates is constructed by pairing the sorted particle labels. For the latter problem, a table comprising the list indexes of the contact candidate pairs is created and used to sum the contact forces acting on each particle for all contacts according to Newton's third law. With just these methods, memory access competition is avoided without additional redundant procedures. The parallel efficiency and compatibility of these two algorithms were evaluated in discrete element method (DEM) simulations on four types of shared-memory parallel computers: a multicore multiprocessor computer, scalar supercomputer, vector supercomputer, and graphics processing unit. The computational efficiency of a DEM code was found to be drastically improved with our algorithms on all but the scalar supercomputer. Thus, the developed parallel algorithms are useful on shared-memory parallel computers with sufficient memory bandwidth.
Parallel algorithms for interactive manipulation of digital terrain models
Davis, E. W.; Mcallister, D. F.; Nagaraj, V.
1988-01-01
Interactive three-dimensional graphics applications, such as terrain data representation and manipulation, require extensive arithmetic processing. Massively parallel machines are attractive for this application since they offer high computational rates, and grid connected architectures provide a natural mapping for grid based terrain models. Presented here are algorithms for data movement on the massive parallel processor (MPP) in support of pan and zoom functions over large data grids. It is an extension of earlier work that demonstrated real-time performance of graphics functions on grids that were equal in size to the physical dimensions of the MPP. When the dimensions of a data grid exceed the processing array size, data is packed in the array memory. Windows of the total data grid are interactively selected for processing. Movement of packed data is needed to distribute items across the array for efficient parallel processing. Execution time for data movement was found to exceed that for arithmetic aspects of graphics functions. Performance figures are given for routines written in MPP Pascal.
Parallel processing architecture for H.264 deblocking filter on multi-core platforms
Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao
2012-03-01
Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking
Massively parallel algorithms for trace-driven cache simulations
Nicol, David M.; Greenberg, Albert G.; Lubachevsky, Boris D.
1991-01-01
Trace driven cache simulation is central to computer design. A trace is a very long sequence of reference lines from main memory. At the t(exp th) instant, reference x sub t is hashed into a set of cache locations, the contents of which are then compared with x sub t. If at the t sup th instant x sub t is not present in the cache, then it is said to be a miss, and is loaded into the cache set, possibly forcing the replacement of some other memory line, and making x sub t present for the (t+1) sup st instant. The problem of parallel simulation of a subtrace of N references directed to a C line cache set is considered, with the aim of determining which references are misses and related statistics. A simulation method is presented for the Least Recently Used (LRU) policy, which regradless of the set size C runs in time O(log N) using N processors on the exclusive read, exclusive write (EREW) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. Timings are presented of the second algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference based line replacement policies are considered, which includes LRU as well as the Least Frequently Used and Random replacement policies. A simulation method is presented for any such policy that on any trace of length N directed to a C line set runs in the O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well suited for SIMD implementation.
Energy Technology Data Exchange (ETDEWEB)
Verma, Prakash; Morales, Jorge A., E-mail: jorge.morales@ttu.edu [Department of Chemistry and Biochemistry, Texas Tech University, P.O. Box 41061, Lubbock, Texas 79409-1061 (United States); Perera, Ajith [Department of Chemistry and Biochemistry, Texas Tech University, P.O. Box 41061, Lubbock, Texas 79409-1061 (United States); Department of Chemistry, Quantum Theory Project, University of Florida, Gainesville, Florida 32611 (United States)
2013-11-07
Coupled cluster (CC) methods provide highly accurate predictions of molecular properties, but their high computational cost has precluded their routine application to large systems. Fortunately, recent computational developments in the ACES III program by the Bartlett group [the OED/ERD atomic integral package, the super instruction processor, and the super instruction architecture language] permit overcoming that limitation by providing a framework for massively parallel CC implementations. In that scheme, we are further extending those parallel CC efforts to systematically predict the three main electron spin resonance (ESR) tensors (A-, g-, and D-tensors) to be reported in a series of papers. In this paper inaugurating that series, we report our new ACES III parallel capabilities that calculate isotropic hyperfine coupling constants in 38 neutral, cationic, and anionic radicals that include the {sup 11}B, {sup 17}O, {sup 9}Be, {sup 19}F, {sup 1}H, {sup 13}C, {sup 35}Cl, {sup 33}S,{sup 14}N, {sup 31}P, and {sup 67}Zn nuclei. Present parallel calculations are conducted at the Hartree-Fock (HF), second-order many-body perturbation theory [MBPT(2)], CC singles and doubles (CCSD), and CCSD with perturbative triples [CCSD(T)] levels using Roos augmented double- and triple-zeta atomic natural orbitals basis sets. HF results consistently overestimate isotropic hyperfine coupling constants. However, inclusion of electron correlation effects in the simplest way via MBPT(2) provides significant improvements in the predictions, but not without occasional failures. In contrast, CCSD results are consistently in very good agreement with experimental results. Inclusion of perturbative triples to CCSD via CCSD(T) leads to small improvements in the predictions, which might not compensate for the extra computational effort at a non-iterative N{sup 7}-scaling in CCSD(T). The importance of these accurate computations of isotropic hyperfine coupling constants to elucidate
Parallel Monte Carlo reactor neutronics
International Nuclear Information System (INIS)
Blomquist, R.N.; Brown, F.B.
1994-01-01
The issues affecting implementation of parallel algorithms for large-scale engineering Monte Carlo neutron transport simulations are discussed. For nuclear reactor calculations, these include load balancing, recoding effort, reproducibility, domain decomposition techniques, I/O minimization, and strategies for different parallel architectures. Two codes were parallelized and tested for performance. The architectures employed include SIMD, MIMD-distributed memory, and workstation network with uneven interactive load. Speedups linear with the number of nodes were achieved
Malloy, Matt; Thiel, Brad; Bunday, Benjamin D.; Wurm, Stefan; Mukhtar, Maseeh; Quoi, Kathy; Kemen, Thomas; Zeidler, Dirk; Eberle, Anna Lena; Garbowski, Tomasz; Dellemann, Gregor; Peters, Jan Hendrik
2015-03-01
SEMATECH aims to identify and enable disruptive technologies to meet the ever-increasing demands of semiconductor high volume manufacturing (HVM). As such, a program was initiated in 2012 focused on high-speed e-beam defect inspection as a complement, and eventual successor, to bright field optical patterned defect inspection [1]. The primary goal is to enable a new technology to overcome the key gaps that are limiting modern day inspection in the fab; primarily, throughput and sensitivity to detect ultra-small critical defects. The program specifically targets revolutionary solutions based on massively parallel e-beam technologies, as opposed to incremental improvements to existing e-beam and optical inspection platforms. Wafer inspection is the primary target, but attention is also being paid to next generation mask inspection. During the first phase of the multi-year program multiple technologies were reviewed, a down-selection was made to the top candidates, and evaluations began on proof of concept systems. A champion technology has been selected and as of late 2014 the program has begun to move into the core technology maturation phase in order to enable eventual commercialization of an HVM system. Performance data from early proof of concept systems will be shown along with roadmaps to achieving HVM performance. SEMATECH's vision for moving from early-stage development to commercialization will be shown, including plans for development with industry leading technology providers.
LiNbO3: A photovoltaic substrate for massive parallel manipulation and patterning of nano-objects
International Nuclear Information System (INIS)
Carrascosa, M.; García-Cabañes, A.; Jubera, M.; Ramiro, J. B.; Agulló-López, F.
2015-01-01
The application of evanescent photovoltaic (PV) fields, generated by visible illumination of Fe:LiNbO 3 substrates, for parallel massive trapping and manipulation of micro- and nano-objects is critically reviewed. The technique has been often referred to as photovoltaic or photorefractive tweezers. The main advantage of the new method is that the involved electrophoretic and/or dielectrophoretic forces do not require any electrodes and large scale manipulation of nano-objects can be easily achieved using the patterning capabilities of light. The paper describes the experimental techniques for particle trapping and the main reported experimental results obtained with a variety of micro- and nano-particles (dielectric and conductive) and different illumination configurations (single beam, holographic geometry, and spatial light modulator projection). The report also pays attention to the physical basis of the method, namely, the coupling of the evanescent photorefractive fields to the dielectric response of the nano-particles. The role of a number of physical parameters such as the contrast and spatial periodicities of the illumination pattern or the particle deposition method is discussed. Moreover, the main properties of the obtained particle patterns in relation to potential applications are summarized, and first demonstrations reviewed. Finally, the PV method is discussed in comparison to other patterning strategies, such as those based on the pyroelectric response and the electric fields associated to domain poling of ferroelectric materials
Integrated massively parallel sequencing of 15 autosomal STRs and Amelogenin using