WorldWideScience

Sample records for unit gpu hardware

  1. Travel Software using GPU Hardware

    CERN Document Server

    Szalwinski, Chris M; Dimov, Veliko Atanasov; CERN. Geneva. ATS Department

    2015-01-01

    Travel is the main multi-particle tracking code being used at CERN for the beam dynamics calculations through hadron and ion linear accelerators. It uses two routines for the calculation of space charge forces, namely, rings of charges and point-to-point. This report presents the studies to improve the performance of Travel using GPU hardware. The studies showed that the performance of Travel with the point-to-point simulations of space-charge effects can be speeded up at least 72 times using current GPU hardware. Simple recompilation of the source code using an Intel compiler can improve performance at least 4 times without GPU support. The limited memory of the GPU is the bottleneck. Two algorithms were investigated on this point: repeated computation and tiling. The repeating computation algorithm is simpler and is the currently recommended solution. The tiling algorithm was more complicated and degraded performance. Both build and test instructions for the parallelized version of the software are inclu...

  2. Basket Option Pricing Using GP-GPU Hardware Acceleration

    KAUST Repository

    Douglas, Craig C.

    2010-08-01

    We introduce a basket option pricing problem arisen in financial mathematics. We discretized the problem based on the alternating direction implicit (ADI) method and parallel cyclic reduction is applied to solve the set of tridiagonal matrices generated by the ADI method. To reduce the computational time of the problem, a general purpose graphics processing units (GP-GPU) environment is considered. Numerical results confirm the convergence and efficiency of the proposed method. © 2010 IEEE.

  3. RUMD: A general purpose molecular dynamics package optimized to utilize GPU hardware down to a few thousand particles

    DEFF Research Database (Denmark)

    Bailey, Nicholas; Ingebrigtsen, Trond; Hansen, Jesper Schmidt

    2017-01-01

    RUMD is a general purpose, high-performance molecular dynamics (MD) simulation package running on graphical processing units (GPU’s). RUMD addresses the challenge of utilizing the many-core nature of modern GPU hardware when simulating small to medium system sizes (roughly from a few thousand up...

  4. Optimizing memory-bound SYMV kernel on GPU hardware accelerators

    KAUST Repository

    Abdelfattah, Ahmad

    2013-01-01

    Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively. © 2013 Springer-Verlag.

  5. RUMD: A general purpose molecular dynamics package optimized to utilize GPU hardware down to a few thousand particles

    Directory of Open Access Journals (Sweden)

    Nicholas P. Bailey, Trond S. Ingebrigtsen, Jesper Schmidt Hansen, Arno A. Veldhorst, Lasse Bøhling, Claire A. Lemarchand, Andreas E. Olsen, Andreas K. Bacher, Lorenzo Costigliola, Ulf R. Pedersen, Heine Larsen, Jeppe C. Dyre, Thomas B. Schrøder

    2017-12-01

    Full Text Available RUMD is a general purpose, high-performance molecular dynamics (MD simulation package running on graphical processing units (GPU's. RUMD addresses the challenge of utilizing the many-core nature of modern GPU hardware when simulating small to medium system sizes (roughly from a few thousand up to hundred thousand particles. It has a performance that is comparable to other GPU-MD codes at large system sizes and substantially better at smaller sizes.RUMD is open-source and consists of a library written in C++ and the CUDA extension to C, an easy-to-use Python interface, and a set of tools for set-up and post-simulation data analysis. The paper describes RUMD's main features, optimizations and performance benchmarks.

  6. Optimizing memory-bound SYMV kernel on GPU hardware accelerators

    KAUST Repository

    Abdelfattah, Ahmad; Dongarra, Jack; Keyes, David E.; Ltaief, Hatem

    2013-01-01

    and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double

  7. Parallelized Local Volatility Estimation Using GP-GPU Hardware Acceleration

    KAUST Repository

    Douglas, Craig C.

    2010-01-01

    We introduce an inverse problem for the local volatility model in option pricing. We solve the problem using the Levenberg-Marquardt algorithm and use the notion of the Fréchet derivative when calculating the Jacobian matrix. We analyze the existence of the Fréchet derivative and its numerical computation. To reduce the computational time of the inverse problem, a GP-GPU environment is considered for parallel computation. Numerical results confirm the validity and efficiency of the proposed method. ©2010 IEEE.

  8. Fast parallel tandem mass spectral library searching using GPU hardware acceleration.

    Science.gov (United States)

    Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K; Martin, Daniel B

    2011-06-03

    Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate-limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper, we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching), is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA, which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment.

  9. GPU Computing For Particle Tracking

    International Nuclear Information System (INIS)

    Nishimura, Hiroshi; Song, Kai; Muriki, Krishna; Sun, Changchun; James, Susan; Qin, Yong

    2011-01-01

    This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculation of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ (2) is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.

  10. Hardware based redundant multi-threading inside a GPU for improved reliability

    Science.gov (United States)

    Sridharan, Vilas; Gurumurthi, Sudhanva

    2015-05-05

    A system and method for verifying computation output using computer hardware are provided. Instances of computation are generated and processed on hardware-based processors. As instances of computation are processed, each instance of computation receives a load accessible to other instances of computation. Instances of output are generated by processing the instances of computation. The instances of output are verified against each other in a hardware based processor to ensure accuracy of the output.

  11. Hardware dependencies of GPU-accelerated beamformer performances for microwave breast cancer detection

    Directory of Open Access Journals (Sweden)

    Salomon Christoph J.

    2016-09-01

    Full Text Available UWB microwave imaging has proven to be a promising technique for early-stage breast cancer detection. The extensive image reconstruction time can be accelerated by parallelizing the execution of the underlying beamforming algorithms. However, the efficiency of the parallelization will most likely depend on the grade of parallelism of the imaging algorithm and of the utilized hardware. This paper investigates the dependencies of two different beamforming algorithms on multiple hardware specification of several graphics boards. The parallel implementation is realized by using NVIDIA’s CUDA. Three conclusions are drawn about the behavior of the parallel implementation and how to efficiently use the accessible hardware.

  12. Efficient Sphere Detector Algorithm for Massive MIMO using GPU Hardware Accelerator

    KAUST Repository

    Arfaoui, Mohamed-Amine

    2016-06-01

    To further enhance the capacity of next generation wireless communication systems, massive MIMO has recently appeared as a necessary enabling technology to achieve high performance signal processing for large-scale multiple antennas. However, massive MIMO systems inevitably generate signal processing overheads, which translate into ever-increasing rate of complexity, and therefore, such system may not maintain the inherent real-time requirement of wireless systems. We redesign the non-linear sphere decoder method to increase the performance of the system, cast most memory-bound computations into compute-bound operations to reduce the overall complexity, and maintain the real-time processing thanks to the GPU computational power. We show a comprehensive complexity and performance analysis on an unprecedented MIMO system scale, which can ease the design phase toward simulating future massive MIMO wireless systems.

  13. Efficient Sphere Detector Algorithm for Massive MIMO using GPU Hardware Accelerator

    KAUST Repository

    Arfaoui, Mohamed-Amine; Ltaief, Hatem; Rezki, Zouheir; Alouini, Mohamed-Slim; Keyes, David E.

    2016-01-01

    To further enhance the capacity of next generation wireless communication systems, massive MIMO has recently appeared as a necessary enabling technology to achieve high performance signal processing for large-scale multiple antennas. However, massive MIMO systems inevitably generate signal processing overheads, which translate into ever-increasing rate of complexity, and therefore, such system may not maintain the inherent real-time requirement of wireless systems. We redesign the non-linear sphere decoder method to increase the performance of the system, cast most memory-bound computations into compute-bound operations to reduce the overall complexity, and maintain the real-time processing thanks to the GPU computational power. We show a comprehensive complexity and performance analysis on an unprecedented MIMO system scale, which can ease the design phase toward simulating future massive MIMO wireless systems.

  14. Online real-time reconstruction of adaptive TSENSE with commodity CPU / GPU hardware

    DEFF Research Database (Denmark)

    Roujol, Sebastien; de Senneville, Baudouin; Vahala, E.

    2009-01-01

    A real-time reconstruction for adaptive TSENSE is presented that is optimized for MR-guidance of interventional procedures. The proposed method allows high frame-rate imaging with low image latencies, even when large coil arrays are employed and can be implemented on affordable commodity hardware....

  15. Graphics processing unit (GPU) real-time infrared scene generation

    Science.gov (United States)

    Christie, Chad L.; Gouthas, Efthimios (Themie); Williams, Owen M.

    2007-04-01

    VIRSuite, the GPU-based suite of software tools developed at DSTO for real-time infrared scene generation, is described. The tools include the painting of scene objects with radiometrically-associated colours, translucent object generation, polar plot validation and versatile scene generation. Special features include radiometric scaling within the GPU and the presence of zoom anti-aliasing at the core of VIRSuite. Extension of the zoom anti-aliasing construct to cover target embedding and the treatment of translucent objects is described.

  16. Massive Parallelism of Monte-Carlo Simulation on Low-End Hardware using Graphic Processing Units

    Energy Technology Data Exchange (ETDEWEB)

    Mburu, Joe Mwangi; Hah, Chang Joo Hah [KEPCO International Nuclear Graduate School, Ulsan (Korea, Republic of)

    2014-05-15

    Within the past decade, research has been done on utilizing GPU massive parallelization in core simulation with impressive results but unfortunately, not much commercial application has been done in the nuclear field especially in reactor core simulation. The purpose of this paper is to give an introductory concept on the topic and illustrate the potential of exploiting the massive parallel nature of GPU computing on a simple monte-carlo simulation with very minimal hardware specifications. To do a comparative analysis, a simple two dimension monte-carlo simulation is implemented for both the CPU and GPU in order to evaluate performance gain based on the computing devices. The heterogeneous platform utilized in this analysis is done on a slow notebook with only 1GHz processor. The end results are quite surprising whereby high speedups obtained are almost a factor of 10. In this work, we have utilized heterogeneous computing in a GPU-based approach in applying potential high arithmetic intensive calculation. By applying a complex monte-carlo simulation on GPU platform, we have speed up the computational process by almost a factor of 10 based on one million neutrons. This shows how easy, cheap and efficient it is in using GPU in accelerating scientific computing and the results should encourage in exploring further this avenue especially in nuclear reactor physics simulation where deterministic and stochastic calculations are quite favourable in parallelization.

  17. Massive Parallelism of Monte-Carlo Simulation on Low-End Hardware using Graphic Processing Units

    International Nuclear Information System (INIS)

    Mburu, Joe Mwangi; Hah, Chang Joo Hah

    2014-01-01

    Within the past decade, research has been done on utilizing GPU massive parallelization in core simulation with impressive results but unfortunately, not much commercial application has been done in the nuclear field especially in reactor core simulation. The purpose of this paper is to give an introductory concept on the topic and illustrate the potential of exploiting the massive parallel nature of GPU computing on a simple monte-carlo simulation with very minimal hardware specifications. To do a comparative analysis, a simple two dimension monte-carlo simulation is implemented for both the CPU and GPU in order to evaluate performance gain based on the computing devices. The heterogeneous platform utilized in this analysis is done on a slow notebook with only 1GHz processor. The end results are quite surprising whereby high speedups obtained are almost a factor of 10. In this work, we have utilized heterogeneous computing in a GPU-based approach in applying potential high arithmetic intensive calculation. By applying a complex monte-carlo simulation on GPU platform, we have speed up the computational process by almost a factor of 10 based on one million neutrons. This shows how easy, cheap and efficient it is in using GPU in accelerating scientific computing and the results should encourage in exploring further this avenue especially in nuclear reactor physics simulation where deterministic and stochastic calculations are quite favourable in parallelization

  18. permGPU: Using graphics processing units in RNA microarray association studies

    Directory of Open Access Journals (Sweden)

    George Stephen L

    2010-06-01

    Full Text Available Abstract Background Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. Results We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. Conclusions permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.

  19. Hardware interface unit for control of shuttle RMS vibrations

    Science.gov (United States)

    Lindsay, Thomas S.; Hansen, Joseph M.; Manouchehri, Davoud; Forouhar, Kamran

    1994-01-01

    Vibration of the Shuttle Remote Manipulator System (RMS) increases the time for task completion and reduces task safety for manipulator-assisted operations. If the dynamics of the manipulator and the payload can be physically isolated, performance should improve. Rockwell has developed a self contained hardware unit which interfaces between a manipulator arm and payload. The End Point Control Unit (EPCU) is built and is being tested at Rockwell and at the Langley/Marshall Coupled, Multibody Spacecraft Control Research Facility in NASA's Marshall Space Flight Center in Huntsville, Alabama.

  20. The Performance Improvement of the Lagrangian Particle Dispersion Model (LPDM) Using Graphics Processing Unit (GPU) Computing

    Science.gov (United States)

    2017-08-01

    used for its GPU computing capability during the experiment. It has Nvidia Tesla K40 GPU accelerators containing 32 GPU nodes consisting of 1024...cores. CUDA is a parallel computing platform and application programming interface (API) model that was created and designed by Nvidia to give direct...Agricultural and Forest Meteorology. 1995:76:277–291, ISSN 0168-1923. 3. GPU vs. CPU? What is GPU computing? Santa Clara (CA): Nvidia Corporation; 2017

  1. GPU-based high-performance computing for radiation therapy

    International Nuclear Information System (INIS)

    Jia, Xun; Jiang, Steve B; Ziegenhein, Peter

    2014-01-01

    Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of study has been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this paper, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. (topical review)

  2. Software Graphics Processing Unit (sGPU) for Deep Space Applications

    Science.gov (United States)

    McCabe, Mary; Salazar, George; Steele, Glen

    2015-01-01

    A graphics processing capability will be required for deep space missions and must include a range of applications, from safety-critical vehicle health status to telemedicine for crew health. However, preliminary radiation testing of commercial graphics processing cards suggest they cannot operate in the deep space radiation environment. Investigation into an Software Graphics Processing Unit (sGPU)comprised of commercial-equivalent radiation hardened/tolerant single board computers, field programmable gate arrays, and safety-critical display software shows promising results. Preliminary performance of approximately 30 frames per second (FPS) has been achieved. Use of multi-core processors may provide a significant increase in performance.

  3. R-GPU : A reconfigurable GPU architecture

    NARCIS (Netherlands)

    van den Braak, G.J.; Corporaal, H.

    2016-01-01

    Over the last decade, Graphics Processing Unit (GPU) architectures have evolved from a fixed-function graphics pipeline to a programmable, energy-efficient compute accelerator for massively parallel applications. The compute power arises from the GPU's Single Instruction/Multiple Threads

  4. SU-D-BRD-03: A Gateway for GPU Computing in Cancer Radiotherapy Research

    Energy Technology Data Exchange (ETDEWEB)

    Jia, X; Folkerts, M [The University of Texas Southwestern Medical Ctr, Dallas, TX (United States); Shi, F; Yan, H; Yan, Y; Jiang, S [UT Southwestern Medical Center, Dallas, TX (United States); Sivagnanam, S; Majumdar, A [University of California San Diego, La Jolla, CA (United States)

    2014-06-01

    Purpose: Graphics Processing Unit (GPU) has become increasingly important in radiotherapy. However, it is still difficult for general clinical researchers to access GPU codes developed by other researchers, and for developers to objectively benchmark their codes. Moreover, it is quite often to see repeated efforts spent on developing low-quality GPU codes. The goal of this project is to establish an infrastructure for testing GPU codes, cross comparing them, and facilitating code distributions in radiotherapy community. Methods: We developed a system called Gateway for GPU Computing in Cancer Radiotherapy Research (GCR2). A number of GPU codes developed by our group and other developers can be accessed via a web interface. To use the services, researchers first upload their test data or use the standard data provided by our system. Then they can select the GPU device on which the code will be executed. Our system offers all mainstream GPU hardware for code benchmarking purpose. After the code running is complete, the system automatically summarizes and displays the computing results. We also released a SDK to allow the developers to build their own algorithm implementation and submit their binary codes to the system. The submitted code is then systematically benchmarked using a variety of GPU hardware and representative data provided by our system. The developers can also compare their codes with others and generate benchmarking reports. Results: It is found that the developed system is fully functioning. Through a user-friendly web interface, researchers are able to test various GPU codes. Developers also benefit from this platform by comprehensively benchmarking their codes on various GPU platforms and representative clinical data sets. Conclusion: We have developed an open platform allowing the clinical researchers and developers to access the GPUs and GPU codes. This development will facilitate the utilization of GPU in radiation therapy field.

  5. Survey of using GPU CUDA programming model in medical image analysis

    Directory of Open Access Journals (Sweden)

    T. Kalaiselvi

    2017-01-01

    Full Text Available With the technology development of medical industry, processing data is expanding rapidly and computation time also increases due to many factors like 3D, 4D treatment planning, the increasing sophistication of MRI pulse sequences and the growing complexity of algorithms. Graphics processing unit (GPU addresses these problems and gives the solutions for using their features such as, high computation throughput, high memory bandwidth, support for floating-point arithmetic and low cost. Compute unified device architecture (CUDA is a popular GPU programming model introduced by NVIDIA for parallel computing. This review paper briefly discusses the need of GPU CUDA computing in the medical image analysis. The GPU performances of existing algorithms are analyzed and the computational gain is discussed. A few open issues, hardware configurations and optimization principles of existing methods are discussed. This survey concludes the few optimization techniques with the medical imaging algorithms on GPU. Finally, limitation and future scope of GPU programming are discussed.

  6. 78 FR 22347 - GPU Nuclear Inc., Three Mile Island Nuclear Power Station, Unit 2, Exemption From Certain...

    Science.gov (United States)

    2013-04-15

    ... NUCLEAR REGULATORY COMMISSION [Docket No. 50-320; NRC-2013-0065] GPU Nuclear Inc., Three Mile Island Nuclear Power Station, Unit 2, Exemption From Certain Security Requirements AGENCY: Nuclear... and State Materials and Environmental Management Programs, U.S. Nuclear Regulatory Commission...

  7. Numerical simulation of lava flow using a GPU SPH model

    Directory of Open Access Journals (Sweden)

    Eugenio Rustico

    2011-12-01

    Full Text Available A smoothed particle hydrodynamics (SPH method for lava-flow modeling was implemented on a graphical processing unit (GPU using the compute unified device architecture (CUDA developed by NVIDIA. This resulted in speed-ups of up to two orders of magnitude. The three-dimensional model can simulate lava flow on a real topography with free-surface, non-Newtonian fluids, and with phase change. The entire SPH code has three main components, neighbor list construction, force computation, and integration of the equation of motion, and it is computed on the GPU, fully exploiting the computational power. The simulation speed achieved is one to two orders of magnitude faster than the equivalent central processing unit (CPU code. This GPU implementation of SPH allows high resolution SPH modeling in hours and days, rather than in weeks and months, on inexpensive and readily available hardware.

  8. Performance evaluation for volumetric segmentation of multiple sclerosis lesions using MATLAB and computing engine in the graphical processing unit (GPU)

    Science.gov (United States)

    Le, Anh H.; Park, Young W.; Ma, Kevin; Jacobs, Colin; Liu, Brent J.

    2010-03-01

    Multiple Sclerosis (MS) is a progressive neurological disease affecting myelin pathways in the brain. Multiple lesions in the white matter can cause paralysis and severe motor disabilities of the affected patient. To solve the issue of inconsistency and user-dependency in manual lesion measurement of MRI, we have proposed a 3-D automated lesion quantification algorithm to enable objective and efficient lesion volume tracking. The computer-aided detection (CAD) of MS, written in MATLAB, utilizes K-Nearest Neighbors (KNN) method to compute the probability of lesions on a per-voxel basis. Despite the highly optimized algorithm of imaging processing that is used in CAD development, MS CAD integration and evaluation in clinical workflow is technically challenging due to the requirement of high computation rates and memory bandwidth in the recursive nature of the algorithm. In this paper, we present the development and evaluation of using a computing engine in the graphical processing unit (GPU) with MATLAB for segmentation of MS lesions. The paper investigates the utilization of a high-end GPU for parallel computing of KNN in the MATLAB environment to improve algorithm performance. The integration is accomplished using NVIDIA's CUDA developmental toolkit for MATLAB. The results of this study will validate the practicality and effectiveness of the prototype MS CAD in a clinical setting. The GPU method may allow MS CAD to rapidly integrate in an electronic patient record or any disease-centric health care system.

  9. Transportable GPU (General Processor Units) chip set technology for standard computer architectures

    Science.gov (United States)

    Fosdick, R. E.; Denison, H. C.

    1982-11-01

    The USAFR-developed GPU Chip Set has been utilized by Tracor to implement both USAF and Navy Standard 16-Bit Airborne Computer Architectures. Both configurations are currently being delivered into DOD full-scale development programs. Leadless Hermetic Chip Carrier packaging has facilitated implementation of both architectures on single 41/2 x 5 substrates. The CMOS and CMOS/SOS implementations of the GPU Chip Set have allowed both CPU implementations to use less than 3 watts of power each. Recent efforts by Tracor for USAF have included the definition of a next-generation GPU Chip Set that will retain the application-proven architecture of the current chip set while offering the added cost advantages of transportability across ISO-CMOS and CMOS/SOS processes and across numerous semiconductor manufacturers using a newly-defined set of common design rules. The Enhanced GPU Chip Set will increase speed by an approximate factor of 3 while significantly reducing chip counts and costs of standard CPU implementations.

  10. GPU accelerated manifold correction method for spinning compact binaries

    Science.gov (United States)

    Ran, Chong-xi; Liu, Song; Zhong, Shuang-ying

    2018-04-01

    The graphics processing unit (GPU) acceleration of the manifold correction algorithm based on the compute unified device architecture (CUDA) technology is designed to simulate the dynamic evolution of the Post-Newtonian (PN) Hamiltonian formulation of spinning compact binaries. The feasibility and the efficiency of parallel computation on GPU have been confirmed by various numerical experiments. The numerical comparisons show that the accuracy on GPU execution of manifold corrections method has a good agreement with the execution of codes on merely central processing unit (CPU-based) method. The acceleration ability when the codes are implemented on GPU can increase enormously through the use of shared memory and register optimization techniques without additional hardware costs, implying that the speedup is nearly 13 times as compared with the codes executed on CPU for phase space scan (including 314 × 314 orbits). In addition, GPU-accelerated manifold correction method is used to numerically study how dynamics are affected by the spin-induced quadrupole-monopole interaction for black hole binary system.

  11. GPU based Monte Carlo for PET image reconstruction: detector modeling

    International Nuclear Information System (INIS)

    Légrády; Cserkaszky, Á; Lantos, J.; Patay, G.; Bükki, T.

    2011-01-01

    Monte Carlo (MC) calculations and Graphical Processing Units (GPUs) are almost like the dedicated hardware designed for the specific task given the similarities between visible light transport and neutral particle trajectories. A GPU based MC gamma transport code has been developed for Positron Emission Tomography iterative image reconstruction calculating the projection from unknowns to data at each iteration step taking into account the full physics of the system. This paper describes the simplified scintillation detector modeling and its effect on convergence. (author)

  12. Implementing the lattice Boltzmann model on commodity graphics hardware

    International Nuclear Information System (INIS)

    Kaufman, Arie; Fan, Zhe; Petkov, Kaloian

    2009-01-01

    Modern graphics processing units (GPUs) can perform general-purpose computations in addition to the native specialized graphics operations. Due to the highly parallel nature of graphics processing, the GPU has evolved into a many-core coprocessor that supports high data parallelism. Its performance has been growing at a rate of squared Moore's law, and its peak floating point performance exceeds that of the CPU by an order of magnitude. Therefore, it is a viable platform for time-sensitive and computationally intensive applications. The lattice Boltzmann model (LBM) computations are carried out via linear operations at discrete lattice sites, which can be implemented efficiently using a GPU-based architecture. Our simulations produce results comparable to the CPU version while improving performance by an order of magnitude. We have demonstrated that the GPU is well suited for interactive simulations in many applications, including simulating fire, smoke, lightweight objects in wind, jellyfish swimming in water, and heat shimmering and mirage (using the hybrid thermal LBM). We further advocate the use of a GPU cluster for large scale LBM simulations and for high performance computing. The Stony Brook Visual Computing Cluster has been the platform for several applications, including simulations of real-time plume dispersion in complex urban environments and thermal fluid dynamics in a pressurized water reactor. Major GPU vendors have been targeting the high performance computing market with GPU hardware implementations. Software toolkits such as NVIDIA CUDA provide a convenient development platform that abstracts the GPU and allows access to its underlying stream computing architecture. However, software programming for a GPU cluster remains a challenging task. We have therefore developed the Zippy framework to simplify GPU cluster programming. Zippy is based on global arrays combined with the stream programming model and it hides the low-level details of the

  13. GPU-BSM: a GPU-based tool to map bisulfite-treated reads.

    Directory of Open Access Journals (Sweden)

    Andrea Manconi

    Full Text Available Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.

  14. Impact of memory bottleneck on the performance of graphics processing units

    Science.gov (United States)

    Son, Dong Oh; Choi, Hong Jun; Kim, Jong Myon; Kim, Cheol Hong

    2015-12-01

    Recent graphics processing units (GPUs) can process general-purpose applications as well as graphics applications with the help of various user-friendly application programming interfaces (APIs) supported by GPU vendors. Unfortunately, utilizing the hardware resource in the GPU efficiently is a challenging problem, since the GPU architecture is totally different to the traditional CPU architecture. To solve this problem, many studies have focused on the techniques for improving the system performance using GPUs. In this work, we analyze the GPU performance varying GPU parameters such as the number of cores and clock frequency. According to our simulations, the GPU performance can be improved by 125.8% and 16.2% on average as the number of cores and clock frequency increase, respectively. However, the performance is saturated when memory bottleneck problems incur due to huge data requests to the memory. The performance of GPUs can be improved as the memory bottleneck is reduced by changing GPU parameters dynamically.

  15. The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography.

    Science.gov (United States)

    Zhang, Bo; Yang, Xiang; Yang, Fei; Yang, Xin; Qin, Chenghu; Han, Dong; Ma, Xibo; Liu, Kai; Tian, Jie

    2010-09-13

    In molecular imaging (MI), especially the optical molecular imaging, bioluminescence tomography (BLT) emerges as an effective imaging modality for small animal imaging. The finite element methods (FEMs), especially the adaptive finite element (AFE) framework, play an important role in BLT. The processing speed of the FEMs and the AFE framework still needs to be improved, although the multi-thread CPU technology and the multi CPU technology have already been applied. In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU. The numerical experiments are designed to show the necessity, effect and application of the proposed CUBLAS and CULA based GPU acceleration. From the results of the experiments, we can reach the conclusion that the proposed CUBLAS and CULA based GPU acceleration method can improve the processing speed of the AFE framework very much while getting a balance between cost and performance.

  16. GPU Computing Gems Emerald Edition

    CERN Document Server

    Hwu, Wen-mei W

    2011-01-01

    ".the perfect companion to Programming Massively Parallel Processors by Hwu & Kirk." -Nicolas Pinto, Research Scientist at Harvard & MIT, NVIDIA Fellow 2009-2010 Graphics processing units (GPUs) can do much more than render graphics. Scientists and researchers increasingly look to GPUs to improve the efficiency and performance of computationally-intensive experiments across a range of disciplines. GPU Computing Gems: Emerald Edition brings their techniques to you, showcasing GPU-based solutions including: Black hole simulations with CUDA GPU-accelerated computation and interactive display of

  17. GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.

    Science.gov (United States)

    Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H

    2012-09-01

    Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC

  18. Implementation of the Lattice Boltzmann Method on Heterogeneous Hardware and Platforms using OpenCL

    Directory of Open Access Journals (Sweden)

    TEKIC, P. M.

    2012-02-01

    Full Text Available The Lattice Boltzmann method (LBM has become an alternative method for computational fluid dynamics with a wide range of applications. Besides its numerical stability and accuracy, one of the major advantages of LBM is its relatively easy parallelization and, hence, it is especially well fitted to many-core hardware as graphics processing units (GPU. The majority of work concerning LBM implementation on GPU's has used the CUDA programming model, supported exclusively by NVIDIA. Recently, the open standard for parallel programming of heterogeneous systems (OpenCL has been introduced. OpenCL standard matures and is supported on processors from most vendors. In this paper, we make use of the OpenCL framework for the lattice Boltzmann method simulation, using hardware accelerators - AMD ATI Radeon GPU, AMD Dual-Core CPU and NVIDIA GeForce GPU's. Application has been developed using a combination of Java and OpenCL programming languages. Java bindings for OpenCL have been utilized. This approach offers the benefits of hardware and operating system independence, as well as speeding up of lattice Boltzmann algorithm. It has been showed that the developed lattice Boltzmann source code can be executed without modification on all of the used hardware accelerators. Performance results have been presented and compared for the hardware accelerators that have been utilized.

  19. GPU applications for data processing

    Energy Technology Data Exchange (ETDEWEB)

    Vladymyrov, Mykhailo, E-mail: mykhailo.vladymyrov@cern.ch [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); Aleksandrov, Andrey [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); INFN sezione di Napoli, I-80125 Napoli (Italy); Tioukov, Valeri [INFN sezione di Napoli, I-80125 Napoli (Italy)

    2015-12-31

    Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.

  20. The gputools package enables GPU computing in R.

    Science.gov (United States)

    Buckner, Joshua; Wilson, Justin; Seligman, Mark; Athey, Brian; Watson, Stanley; Meng, Fan

    2010-01-01

    By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers. R users can take advantage of the better performance provided by an Nvidia GPU. The package is available from CRAN, the R project's repository of packages, at http://cran.r-project.org/web/packages/gputools More information about our gputools R package is available at http://brainarray.mbni.med.umich.edu/brainarray/Rgpgpu

  1. Spins Dynamics in a Dissipative Environment: Hierarchal Equations of Motion Approach Using a Graphics Processing Unit (GPU).

    Science.gov (United States)

    Tsuchimoto, Masashi; Tanimura, Yoshitaka

    2015-08-11

    A system with many energy states coupled to a harmonic oscillator bath is considered. To study quantum non-Markovian system-bath dynamics numerically rigorously and nonperturbatively, we developed a computer code for the reduced hierarchy equations of motion (HEOM) for a graphics processor unit (GPU) that can treat the system as large as 4096 energy states. The code employs a Padé spectrum decomposition (PSD) for a construction of HEOM and the exponential integrators. Dynamics of a quantum spin glass system are studied by calculating the free induction decay signal for the cases of 3 × 2 to 3 × 4 triangular lattices with antiferromagnetic interactions. We found that spins relax faster at lower temperature due to transitions through a quantum coherent state, as represented by the off-diagonal elements of the reduced density matrix, while it has been known that the spins relax slower due to suppression of thermal activation in a classical case. The decay of the spins are qualitatively similar regardless of the lattice sizes. The pathway of spin relaxation is analyzed under a sudden temperature drop condition. The Compute Unified Device Architecture (CUDA) based source code used in the present calculations is provided as Supporting Information .

  2. Modeling of Tsunami Equations and Atmospheric Swirling Flows with a Graphics Processing Unit (GPU) and Radial Basis Functions (RBF)

    Science.gov (United States)

    Schmidt, J.; Piret, C.; Zhang, N.; Kadlec, B. J.; Liu, Y.; Yuen, D. A.; Wright, G. B.; Sevre, E. O.

    2008-12-01

    The faster growth curves in the speed of GPUs relative to CPUs in recent years and its rapidly gained popularity has spawned a new area of development in computational technology. There is much potential in utilizing GPUs for solving evolutionary partial differential equations and producing the attendant visualization. We are concerned with modeling tsunami waves, where computational time is of extreme essence, for broadcasting warnings. In order to test the efficacy of the GPU on the set of shallow-water equations, we employed the NVIDIA board 8600M GT on a MacBook Pro. We have compared the relative speeds between the CPU and the GPU on a single processor for two types of spatial discretization based on second-order finite-differences and radial basis functions. RBFs are a more novel method based on a gridless and a multi- scale, adaptive framework. Using the NVIDIA 8600M GT, we received a speed up factor of 8 in favor of GPU for the finite-difference method and a factor of 7 for the RBF scheme. We have also studied the atmospheric dynamics problem of swirling flows over a spherical surface and found a speed-up of 5.3 using the GPU. The time steps employed for the RBF method are larger than those used in finite-differences, because of the much fewer number of nodal points needed by RBF. Thus, in modeling the same physical time, RBF acting in concert with GPU would be the fastest way to go.

  3. Graphics Processing Units (GPU) and the Goddard Earth Observing System atmospheric model (GEOS-5): Implementation and Potential Applications

    Science.gov (United States)

    Putnam, William M.

    2011-01-01

    Earth system models like the Goddard Earth Observing System model (GEOS-5) have been pushing the limits of large clusters of multi-core microprocessors, producing breath-taking fidelity in resolving cloud systems at a global scale. GPU computing presents an opportunity for improving the efficiency of these leading edge models. A GPU implementation of GEOS-5 will facilitate the use of cloud-system resolving resolutions in data assimilation and weather prediction, at resolutions near 3.5 km, improving our ability to extract detailed information from high-resolution satellite observations and ultimately produce better weather and climate predictions

  4. SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU

    Energy Technology Data Exchange (ETDEWEB)

    Moriya, S; Sato, M [Komazawa University, Setagaya, Tokyo (Japan); Tachibana, H [National Cancer Center Hospital East, Kashiwa, Chiba (Japan)

    2015-06-15

    Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation running on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.

  5. Fast DRR splat rendering using common consumer graphics hardware

    International Nuclear Information System (INIS)

    Spoerk, Jakob; Bergmann, Helmar; Wanschitz, Felix; Dong, Shuo; Birkfellner, Wolfgang

    2007-01-01

    Digitally rendered radiographs (DRR) are a vital part of various medical image processing applications such as 2D/3D registration for patient pose determination in image-guided radiotherapy procedures. This paper presents a technique to accelerate DRR creation by using conventional graphics hardware for the rendering process. DRR computation itself is done by an efficient volume rendering method named wobbled splatting. For programming the graphics hardware, NVIDIAs C for Graphics (Cg) is used. The description of an algorithm used for rendering DRRs on the graphics hardware is presented, together with a benchmark comparing this technique to a CPU-based wobbled splatting program. Results show a reduction of rendering time by about 70%-90% depending on the amount of data. For instance, rendering a volume of 2x10 6 voxels is feasible at an update rate of 38 Hz compared to 6 Hz on a common Intel-based PC using the graphics processing unit (GPU) of a conventional graphics adapter. In addition, wobbled splatting using graphics hardware for DRR computation provides higher resolution DRRs with comparable image quality due to special processing characteristics of the GPU. We conclude that DRR generation on common graphics hardware using the freely available Cg environment is a major step toward 2D/3D registration in clinical routine

  6. Multi-GPU implementation of a VMAT treatment plan optimization algorithm

    International Nuclear Information System (INIS)

    Tian, Zhen; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B.; Peng, Fei

    2015-01-01

    Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is

  7. Multi-GPU implementation of a VMAT treatment plan optimization algorithm

    Energy Technology Data Exchange (ETDEWEB)

    Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun; Jia, Xun, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Jiang, Steve B., E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu [Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, Texas 75390 (United States); Peng, Fei [Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 (United States)

    2015-06-15

    Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is

  8. Acceleration of PIC simulation with GPU

    International Nuclear Information System (INIS)

    Suzuki, Junya; Shimazu, Hironori; Fukazawa, Keiichiro; Den, Mitsue

    2011-01-01

    Particle-in-cell (PIC) is a simulation technique for plasma physics. The large number of particles in high-resolution plasma simulation increases the volume computation required, making it vital to increase computation speed. In this study, we attempt to accelerate computation speed on graphics processing units (GPUs) using KEMPO, a PIC simulation code package. We perform two tests for benchmarking, with small and large grid sizes. In these tests, we run KEMPO1 code using a CPU only, both a CPU and a GPU, and a GPU only. The results showed that performance using only a GPU was twice that of using a CPU alone. While, execution time for using both a CPU and GPU is comparable to the tests with a CPU alone, because of the significant bottleneck in communication between the CPU and GPU. (author)

  9. Graph coarsening and clustering on the GPU

    NARCIS (Netherlands)

    Fagginger Auer, B.O.; Bisseling, R.H.

    2013-01-01

    Agglomerative clustering is an effective greedy way to quickly generate graph clusterings of high modularity in a small amount of time. In an effort to use the power offered by multi-core CPU and GPU hardware to solve the clustering problem, we introduce a fine-grained sharedmemory parallel graph

  10. Distributed GPU Computing in GIScience

    Science.gov (United States)

    Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.

    2013-12-01

    Transactions on, 9(3), 378-394. 2. Li, J., Jiang, Y., Yang, C., Huang, Q., & Rice, M. (2013). Visualizing 3D/4D Environmental Data Using Many-core Graphics Processing Units (GPUs) and Multi-core Central Processing Units (CPUs). Computers & Geosciences, 59(9), 78-89. 3. Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5), 879-899.

  11. CMFD and GPU acceleration on method of characteristics for hexagonal cores

    International Nuclear Information System (INIS)

    Han, Yu; Jiang, Xiaofeng; Wang, Dezhong

    2014-01-01

    Highlights: • A merged hex-mesh CMFD method solved via tri-diagonal matrix inversion. • Alternative hardware acceleration of using inexpensive GPU. • A hex-core benchmark with solution to confirm two acceleration methods. - Abstract: Coarse Mesh Finite Difference (CMFD) has been widely adopted as an effective way to accelerate the source iteration of transport calculation. However in a core with hexagonal assemblies there are non-hexagonal meshes around the edges of assemblies, causing a problem for CMFD if the CMFD equations are still to be solved via tri-diagonal matrix inversion by simply scanning the whole core meshes in different directions. To solve this problem, we propose an unequal mesh CMFD formulation that combines the non-hexagonal cells on the boundary of neighboring assemblies into non-regular hexagonal cells. We also investigated the alternative hardware acceleration of using graphics processing units (GPU) with graphics card in a personal computer. The tool CUDA is employed, which is a parallel computing platform and programming model invented by the company NVIDIA for harnessing the power of GPU. To investigate and implement these two acceleration methods, a 2-D hexagonal core transport code using the method of characteristics (MOC) is developed. A hexagonal mini-core benchmark problem is established to confirm the accuracy of the MOC code and to assess the effectiveness of CMFD and GPU parallel acceleration. For this benchmark problem, the CMFD acceleration increases the speed 16 times while the GPU acceleration speeds it up 25 times. When used simultaneously, they provide a speed gain of 292 times

  12. CMFD and GPU acceleration on method of characteristics for hexagonal cores

    Energy Technology Data Exchange (ETDEWEB)

    Han, Yu, E-mail: hanyu1203@gmail.com [School of Nuclear Science and Engineering, Shanghai Jiaotong University, Shanghai 200240 (China); Jiang, Xiaofeng [Shanghai NuStar Nuclear Power Technology Co., Ltd., No. 81 South Qinzhou Road, XuJiaHui District, Shanghai 200000 (China); Wang, Dezhong [School of Nuclear Science and Engineering, Shanghai Jiaotong University, Shanghai 200240 (China)

    2014-12-15

    Highlights: • A merged hex-mesh CMFD method solved via tri-diagonal matrix inversion. • Alternative hardware acceleration of using inexpensive GPU. • A hex-core benchmark with solution to confirm two acceleration methods. - Abstract: Coarse Mesh Finite Difference (CMFD) has been widely adopted as an effective way to accelerate the source iteration of transport calculation. However in a core with hexagonal assemblies there are non-hexagonal meshes around the edges of assemblies, causing a problem for CMFD if the CMFD equations are still to be solved via tri-diagonal matrix inversion by simply scanning the whole core meshes in different directions. To solve this problem, we propose an unequal mesh CMFD formulation that combines the non-hexagonal cells on the boundary of neighboring assemblies into non-regular hexagonal cells. We also investigated the alternative hardware acceleration of using graphics processing units (GPU) with graphics card in a personal computer. The tool CUDA is employed, which is a parallel computing platform and programming model invented by the company NVIDIA for harnessing the power of GPU. To investigate and implement these two acceleration methods, a 2-D hexagonal core transport code using the method of characteristics (MOC) is developed. A hexagonal mini-core benchmark problem is established to confirm the accuracy of the MOC code and to assess the effectiveness of CMFD and GPU parallel acceleration. For this benchmark problem, the CMFD acceleration increases the speed 16 times while the GPU acceleration speeds it up 25 times. When used simultaneously, they provide a speed gain of 292 times.

  13. A hardware-software system for the automation of verification and calibration of oil metering units secondary equipment

    Science.gov (United States)

    Boyarnikov, A. V.; Boyarnikova, L. V.; Kozhushko, A. A.; Sekachev, A. F.

    2017-08-01

    In the article the process of verification (calibration) of oil metering units secondary equipment is considered. The purpose of the work is to increase the reliability and reduce the complexity of this process by developing a software and hardware system that provides automated verification and calibration. The hardware part of this complex carries out the commutation of the measuring channels of the verified controller and the reference channels of the calibrator in accordance with the introduced algorithm. The developed software allows controlling the commutation of channels, setting values on the calibrator, reading the measured data from the controller, calculating errors and compiling protocols. This system can be used for checking the controllers of the secondary equipment of the oil metering units in the automatic verification mode (with the open communication protocol) or in the semi-automatic verification mode (without it). The peculiar feature of the approach used is the development of a universal signal switch operating under software control, which can be configured for various verification methods (calibration), which allows to cover the entire range of controllers of metering units secondary equipment. The use of automatic verification with the help of a hardware and software system allows to shorten the verification time by 5-10 times and to increase the reliability of measurements, excluding the influence of the human factor.

  14. GAMER: A GRAPHIC PROCESSING UNIT ACCELERATED ADAPTIVE-MESH-REFINEMENT CODE FOR ASTROPHYSICS

    International Nuclear Information System (INIS)

    Schive, H.-Y.; Tsai, Y.-C.; Chiueh Tzihong

    2010-01-01

    We present the newly developed code, GPU-accelerated Adaptive-MEsh-Refinement code (GAMER), which adopts a novel approach in improving the performance of adaptive-mesh-refinement (AMR) astrophysical simulations by a large factor with the use of the graphic processing unit (GPU). The AMR implementation is based on a hierarchy of grid patches with an oct-tree data structure. We adopt a three-dimensional relaxing total variation diminishing scheme for the hydrodynamic solver and a multi-level relaxation scheme for the Poisson solver. Both solvers have been implemented in GPU, by which hundreds of patches can be advanced in parallel. The computational overhead associated with the data transfer between the CPU and GPU is carefully reduced by utilizing the capability of asynchronous memory copies in GPU, and the computing time of the ghost-zone values for each patch is diminished by overlapping it with the GPU computations. We demonstrate the accuracy of the code by performing several standard test problems in astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster system. We measure the performance of the code by performing purely baryonic cosmological simulations in different hardware implementations, in which detailed timing analyses provide comparison between the computations with and without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are demonstrated using one GPU with 4096 3 effective resolution and 16 GPUs with 8192 3 effective resolution, respectively.

  15. Accelerating epistasis analysis in human genetics with consumer graphics hardware

    Directory of Open Access Journals (Sweden)

    Cancare Fabio

    2009-07-01

    Full Text Available Abstract Background Human geneticists are now capable of measuring more than one million DNA sequence variations from across the human genome. The new challenge is to develop computationally feasible methods capable of analyzing these data for associations with common human disease, particularly in the context of epistasis. Epistasis describes the situation where multiple genes interact in a complex non-linear manner to determine an individual's disease risk and is thought to be ubiquitous for common diseases. Multifactor Dimensionality Reduction (MDR is an algorithm capable of detecting epistasis. An exhaustive analysis with MDR is often computationally expensive, particularly for high order interactions. This challenge has previously been met with parallel computation and expensive hardware. The option we examine here exploits commodity hardware designed for computer graphics. In modern computers Graphics Processing Units (GPUs have more memory bandwidth and computational capability than Central Processing Units (CPUs and are well suited to this problem. Advances in the video game industry have led to an economy of scale creating a situation where these powerful components are readily available at very low cost. Here we implement and evaluate the performance of the MDR algorithm on GPUs. Of primary interest are the time required for an epistasis analysis and the price to performance ratio of available solutions. Findings We found that using MDR on GPUs consistently increased performance per machine over both a feature rich Java software package and a C++ cluster implementation. The performance of a GPU workstation running a GPU implementation reduces computation time by a factor of 160 compared to an 8-core workstation running the Java implementation on CPUs. This GPU workstation performs similarly to 150 cores running an optimized C++ implementation on a Beowulf cluster. Furthermore this GPU system provides extremely cost effective

  16. Accelerating epistasis analysis in human genetics with consumer graphics hardware.

    Science.gov (United States)

    Sinnott-Armstrong, Nicholas A; Greene, Casey S; Cancare, Fabio; Moore, Jason H

    2009-07-24

    Human geneticists are now capable of measuring more than one million DNA sequence variations from across the human genome. The new challenge is to develop computationally feasible methods capable of analyzing these data for associations with common human disease, particularly in the context of epistasis. Epistasis describes the situation where multiple genes interact in a complex non-linear manner to determine an individual's disease risk and is thought to be ubiquitous for common diseases. Multifactor Dimensionality Reduction (MDR) is an algorithm capable of detecting epistasis. An exhaustive analysis with MDR is often computationally expensive, particularly for high order interactions. This challenge has previously been met with parallel computation and expensive hardware. The option we examine here exploits commodity hardware designed for computer graphics. In modern computers Graphics Processing Units (GPUs) have more memory bandwidth and computational capability than Central Processing Units (CPUs) and are well suited to this problem. Advances in the video game industry have led to an economy of scale creating a situation where these powerful components are readily available at very low cost. Here we implement and evaluate the performance of the MDR algorithm on GPUs. Of primary interest are the time required for an epistasis analysis and the price to performance ratio of available solutions. We found that using MDR on GPUs consistently increased performance per machine over both a feature rich Java software package and a C++ cluster implementation. The performance of a GPU workstation running a GPU implementation reduces computation time by a factor of 160 compared to an 8-core workstation running the Java implementation on CPUs. This GPU workstation performs similarly to 150 cores running an optimized C++ implementation on a Beowulf cluster. Furthermore this GPU system provides extremely cost effective performance while leaving the CPU available for other

  17. GPU computing and applications

    CERN Document Server

    See, Simon

    2015-01-01

    This book presents a collection of state of the art research on GPU Computing and Application. The major part of this book is selected from the work presented at the 2013 Symposium on GPU Computing and Applications held in Nanyang Technological University, Singapore (Oct 9, 2013). Three major domains of GPU application are covered in the book including (1) Engineering design and simulation; (2) Biomedical Sciences; and (3) Interactive & Digital Media. The book also addresses the fundamental issues in GPU computing with a focus on big data processing. Researchers and developers in GPU Computing and Applications will benefit from this book. Training professionals and educators can also benefit from this book to learn the possible application of GPU technology in various areas.

  18. Simulating spin models on GPU

    Science.gov (United States)

    Weigel, Martin

    2011-09-01

    Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.

  19. High-speed, multi-input, multi-output control using GPU processing in the HBT-EP tokamak

    Energy Technology Data Exchange (ETDEWEB)

    Rath, N., E-mail: Nikolaus@rath.org [Columbia University, Rm 200 Mudd, 500 W 120th St, New York, NY - 10027 (United States); Bialek, J.; Byrne, P.J.; DeBono, B.; Levesque, J.P.; Li, B.; Mauel, M.E.; Maurer, D.A.; Navratil, G.A.; Shiraki, D. [Columbia University, Rm 200 Mudd, 500 W 120th St, New York, NY - 10027 (United States)

    2012-12-15

    Highlights: Black-Right-Pointing-Pointer We present a GPU based system for magnetic control of perturbed equilibria. Black-Right-Pointing-Pointer Cycle times are below 5 {mu}s and I/O latencies below 10 {mu}s for 96 inputs and 64 outputs. Black-Right-Pointing-Pointer A new architecture removes host RAM and CPU from the control cycle. Black-Right-Pointing-Pointer GPU and DA/AD modules operate independently and communicate via PCIe peer-to-peer connections. Black-Right-Pointing-Pointer The Linux host system does not require real-time extensions. - Abstract: We report on the design of a new plasma control system for the HBT-EP tokamak that utilizes a graphical processing unit (GPU) to magnetically control the 3D perturbed equilibrium state [1] of the plasma. The control system achieves cycle times of 5 {mu}s and I/O latencies below 10 {mu}s for up to 96 inputs and 64 outputs. The number of state variables is in the same order. To handle the resulting computational complexity under the given time constraints, the control algorithms are designed for massively parallel processing. The necessary hardware resources are provided by an NVIDIA Tesla M2050 GPU, offering a total of 448 computing cores running at 1.3 GHz each. A new control architecture allows control input from magnetic diagnostics to be pushed directly into GPU memory by a D-TACQ ACQ196 digitizer, and control output to be pulled directly from GPU memory by two D-TACQ AO32 analog output modules. By using peer-to-peer PCI express connections, this technique completely eliminates the use of host RAM and central processing unit (CPU) from the control cycle, permitting single-digit microsecond latencies on a standard Linux host system without any real-time extensions.

  20. Seismic Shot Processing on GPU

    OpenAIRE

    Johansen, Owe

    2009-01-01

    Today s petroleum industry demand an ever increasing amount of compu- tational resources. Seismic processing applications in use by these types of companies have generally been using large clusters of compute nodes, whose only computing resource has been the CPU. However, using Graphics Pro- cessing Units (GPU) for general purpose programming is these days becoming increasingly more popular in the high performance computing area. In 2007, NVIDIA corporation launched their framework for develo...

  1. GPU - Accelerated Monte Carlo electron transport methods: development and application for radiation dose calculations using 6 GPU cards

    International Nuclear Information System (INIS)

    Su, L.; Du, X.; Liu, T.; Xu, X. G.

    2013-01-01

    An electron-photon coupled Monte Carlo code ARCHER - Accelerated Radiation-transport Computations in Heterogeneous EnviRonments - is being developed at Rensselaer Polytechnic Institute as a software test-bed for emerging heterogeneous high performance computers that utilize accelerators such as GPUs (Graphics Processing Units). This paper presents the preliminary code development and the testing involving radiation dose related problems. In particular, the paper discusses the electron transport simulations using the class-II condensed history method. The considered electron energy ranges from a few hundreds of keV to 30 MeV. As for photon part, photoelectric effect, Compton scattering and pair production were simulated. Voxelized geometry was supported. A serial CPU (Central Processing Unit)code was first written in C++. The code was then transplanted to the GPU using the CUDA C 5.0 standards. The hardware involved a desktop PC with an Intel Xeon X5660 CPU and six NVIDIA Tesla M2090 GPUs. The code was tested for a case of 20 MeV electron beam incident perpendicularly on a water-aluminum-water phantom. The depth and later dose profiles were found to agree with results obtained from well tested MC codes. Using six GPU cards, 6*10 6 electron histories were simulated within 2 seconds. In comparison, the same case running the EGSnrc and MCNPX codes required 1645 seconds and 9213 seconds, respectively. On-going work continues to test the code for different medical applications such as radiotherapy and brachytherapy. (authors)

  2. Space Station Freedom electrical power system hardware commonality with the United States Polar Platform

    Science.gov (United States)

    Rieker, Lorra L.; Haraburda, Francis M.

    1989-01-01

    Information is presented on how the concept of commonality is being implemented with respect to electric power system hardware for the Space Station Freedom and the U.S. Polar Platform. Included is a historical account of the candidate common items which have the potential to serve the same power system functions on both Freedom and the Polar Platform. The Space Station program and objectives are described, focusing on the test and development responsibilities. The program definition and preliminary design phase and the design and development phase are discussed. The goal of this work is to reduce the program cost.

  3. Application of GPU to computational multiphase fluid dynamics

    International Nuclear Information System (INIS)

    Nagatake, T; Kunugi, T

    2010-01-01

    The MARS (Multi-interfaces Advection and Reconstruction Solver) [1] is one of the surface volume tracking methods for multi-phase flows. Nowadays, the performance of GPU (Graphics Processing Unit) is much higher than the CPU (Central Processing Unit). In this study, the GPU was applied to the MARS in order to accelerate the computation of multi-phase flows (GPU-MARS), and the performance of the GPU-MARS was discussed. From the performance of the interface tracking method for the analyses of one-directional advection problem, it is found that the computing time of GPU(single GTX280) was around 4 times faster than that of the CPU (Xeon 5040, 4 threads parallelized). From the performance of Poisson Solver by using the algorithm developed in this study, it is found that the performance of the GPU showed around 30 times faster than that of the CPU. Finally, it is confirmed that the GPU showed the large acceleration of the fluid flow computation (GPU-MARS) compared to the CPU. However, it is also found that the double-precision computation of the GPU must perform with very high precision.

  4. Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU

    Science.gov (United States)

    Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan

    2013-01-01

    This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507

  5. Hardware-oriented reliability centered maintenance for the diesel generators of Wolsong unit 1

    International Nuclear Information System (INIS)

    Bae, Sang Min; Park, Jin Hee; Kim, Tae Woon; Lee, Yoon Kee; Song, Jin Bae

    1997-01-01

    The DGs (Diesel Generators) in NPP (Nuclear Power Plant) has been used for the emergency electric power source to shut down the nuclear reactor safely in case of station blackout. The RCM (Reliability Centered Maintenance) has been applied to DGs for increasing the safety of NPP. The structured defects of DG were not remedied by the improvement of maintenance method. As the first stage of RCM, to find the structured defects, its failure modes were searched and analyzed through the ten year maintenance information. The structured defects such as the air compressor, the lubricating oil pressure, and the insufficient load were the root causes of main failures. The air reservoir reinstallation, the lubricating oil tube modification, the load bank installation, and the qualitative instrumentation were the solutions for the hardware oriented RCM of DGs. There remains the software oriented RCM such as the rejection of useless maintenance, the preventive maintenance, the database of maintenance information, and the predictive maintenance

  6. NMF-mGPU: non-negative matrix factorization on multi-GPU systems.

    Science.gov (United States)

    Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto

    2015-02-13

    In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the

  7. Using the CPU and GPU for real-time video enhancement on a mobile computer

    CSIR Research Space (South Africa)

    Bachoo, AK

    2010-09-01

    Full Text Available . In this paper, the current advances in mobile CPU and GPU hardware are used to implement video enhancement algorithms in a new way on a mobile computer. Both the CPU and GPU are used effectively to achieve realtime performance for complex image enhancement...

  8. Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

    Science.gov (United States)

    2015-06-01

    implementation of the direct interaction called particle-to-particle kernel for a shared-memory single GPU device using the Compute Unified Device Architecture ...GPU-defined P2P kernel we developed using the Compute Unified Device Architecture (CUDA).9 A brief outline of the rest of this work follows. The...Employed The computing environment used for this work is a 64-node heterogeneous cluster consisting of 48 IBM dx360M4 nodes, each with one Intel Phi

  9. Validation of GPU based TomoTherapy dose calculation engine.

    Science.gov (United States)

    Chen, Quan; Lu, Weiguo; Chen, Yu; Chen, Mingli; Henderson, Douglas; Sterpin, Edmond

    2012-04-01

    The graphic processing unit (GPU) based TomoTherapy convolution/superposition(C/S) dose engine (GPU dose engine) achieves a dramatic performance improvement over the traditional CPU-cluster based TomoTherapy dose engine (CPU dose engine). Besides the architecture difference between the GPU and CPU, there are several algorithm changes from the CPU dose engine to the GPU dose engine. These changes made the GPU dose slightly different from the CPU-cluster dose. In order for the commercial release of the GPU dose engine, its accuracy has to be validated. Thirty eight TomoTherapy phantom plans and 19 patient plans were calculated with both dose engines to evaluate the equivalency between the two dose engines. Gamma indices (Γ) were used for the equivalency evaluation. The GPU dose was further verified with the absolute point dose measurement with ion chamber and film measurements for phantom plans. Monte Carlo calculation was used as a reference for both dose engines in the accuracy evaluation in heterogeneous phantom and actual patients. The GPU dose engine showed excellent agreement with the current CPU dose engine. The majority of cases had over 99.99% of voxels with Γ(1%, 1 mm) engine also showed similar degree of accuracy in heterogeneous media as the current TomoTherapy dose engine. It is verified and validated that the ultrafast TomoTherapy GPU dose engine can safely replace the existing TomoTherapy cluster based dose engine without degradation in dose accuracy.

  10. GPU Accelerated Vector Median Filter

    Science.gov (United States)

    Aras, Rifat; Shen, Yuzhong

    2011-01-01

    Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .

  11. GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform

    Directory of Open Access Journals (Sweden)

    Ronglin Jiang

    2014-01-01

    Full Text Available This paper introduces a (finite difference time domain FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI and Open Multiprocessing (OpenMP. Since both Central Processing Unit (CPU and Graphics Processing Unit (GPU resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with 16×18 elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.

  12. High energy electromagnetic particle transportation on the GPU

    Energy Technology Data Exchange (ETDEWEB)

    Canal, P. [Fermilab; Elvira, D. [Fermilab; Jun, S. Y. [Fermilab; Kowalkowski, J. [Fermilab; Paterno, M. [Fermilab; Apostolakis, J. [CERN

    2014-01-01

    We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.

  13. DEM GPU studies of industrial scale particle simulations for granular flow civil engineering applications

    Science.gov (United States)

    Pizette, Patrick; Govender, Nicolin; Wilke, Daniel N.; Abriak, Nor-Edine

    2017-06-01

    The use of the Discrete Element Method (DEM) for industrial civil engineering industrial applications is currently limited due to the computational demands when large numbers of particles are considered. The graphics processing unit (GPU) with its highly parallelized hardware architecture shows potential to enable solution of civil engineering problems using discrete granular approaches. We demonstrate in this study the pratical utility of a validated GPU-enabled DEM modeling environment to simulate industrial scale granular problems. As illustration, the flow discharge of storage silos using 8 and 17 million particles is considered. DEM simulations have been performed to investigate the influence of particle size (equivalent size for the 20/40-mesh gravel) and induced shear stress for two hopper shapes. The preliminary results indicate that the shape of the hopper significantly influences the discharge rates for the same material. Specifically, this work shows that GPU-enabled DEM modeling environments can model industrial scale problems on a single portable computer within a day for 30 seconds of process time.

  14. GPU-Based FFT Computation for Multi-Gigabit WirelessHD Baseband Processing

    Directory of Open Access Journals (Sweden)

    Nicholas Hinitt

    2010-01-01

    Full Text Available The next generation Graphics Processing Units (GPUs are being considered for non-graphics applications. Millimeter wave (60 Ghz wireless networks that are capable of multi-gigabit per second (Gbps transfer rates require a significant baseband throughput. In this work, we consider the baseband of WirelessHD, a 60 GHz communications system, which can provide a data rate of up to 3.8 Gbps over a short range wireless link. Thus, we explore the feasibility of achieving gigabit baseband throughput using the GPUs. One of the most computationally intensive functions commonly used in baseband communications, the Fast Fourier Transform (FFT algorithm, is implemented on an NVIDIA GPU using their general-purpose computing platform called the Compute Unified Device Architecture (CUDA. The paper, first, investigates the implementation of an FFT algorithm using the GPU hardware and exploiting the computational capability available. It then outlines the limitations discovered and the methods used to overcome these challenges. Finally a new algorithm to compute FFT is proposed, which reduces interprocessor communication. It is further optimized by improving memory access, enabling the processing rate to exceed 4 Gbps, achieving a processing time of a 512-point FFT in less than 200 ns using a two-GPU solution.

  15. Development of a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI)

    International Nuclear Information System (INIS)

    Huang Bormin; Mielikainen, Jarno; Oh, Hyunjong; Allen Huang, Hung-Lung

    2011-01-01

    Satellite-observed radiance is a nonlinear functional of surface properties and atmospheric temperature and absorbing gas profiles as described by the radiative transfer equation (RTE). In the era of hyperspectral sounders with thousands of high-resolution channels, the computation of the radiative transfer model becomes more time-consuming. The radiative transfer model performance in operational numerical weather prediction systems still limits the number of channels we can use in hyperspectral sounders to only a few hundreds. To take the full advantage of such high-resolution infrared observations, a computationally efficient radiative transfer model is needed to facilitate satellite data assimilation. In recent years the programmable commodity graphics processing unit (GPU) has evolved into a highly parallel, multi-threaded, many-core processor with tremendous computational speed and very high memory bandwidth. The radiative transfer model is very suitable for the GPU implementation to take advantage of the hardware's efficiency and parallelism where radiances of many channels can be calculated in parallel in GPUs. In this paper, we develop a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) launched in 2006 onboard the first European meteorological polar-orbiting satellites, METOP-A. Each IASI spectrum has 8461 spectral channels. The IASI radiative transfer model consists of three modules. The first module for computing the regression predictors takes less than 0.004% of CPU time, while the second module for transmittance computation and the third module for radiance computation take approximately 92.5% and 7.5%, respectively. Our GPU-based IASI radiative transfer model is developed to run on a low-cost personal supercomputer with four GPUs with total 960 compute cores, delivering near 4 TFlops theoretical peak performance. By massively parallelizing the second and third modules, we reached 364x

  16. GPU: the biggest key processor for AI and parallel processing

    Science.gov (United States)

    Baji, Toru

    2017-07-01

    Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.

  17. GPU in Physics Computation: Case Geant4 Navigation

    CERN Document Server

    Seiskari, Otto; Niemi, Tapio

    2012-01-01

    General purpose computing on graphic processing units (GPU) is a potential method of speeding up scientific computation with low cost and high energy efficiency. We experimented with the particle physics simulation toolkit Geant4 used at CERN to benchmark its geometry navigation functionality on a GPU. The goal was to find out whether Geant4 physics simulations could benefit from GPU acceleration and how difficult it is to modify Geant4 code to run in a GPU. We ported selected parts of Geant4 code to C99 & CUDA and implemented a simple gamma physics simulation utilizing this code to measure efficiency. The performance of the program was tested by running it on two different platforms: NVIDIA GeForce 470 GTX GPU and a 12-core AMD CPU system. Our conclusion was that GPUs can be a competitive alternate for multi-core computers but porting existing software in an efficient way is challenging.

  18. GPU Acceleration of DSP for Communication Receivers.

    Science.gov (United States)

    Gunther, Jake; Gunther, Hyrum; Moon, Todd

    2017-09-01

    Graphics processing unit (GPU) implementations of signal processing algorithms can outperform CPU-based implementations. This paper describes the GPU implementation of several algorithms encountered in a wide range of high-data rate communication receivers including filters, multirate filters, numerically controlled oscillators, and multi-stage digital down converters. These structures are tested by processing the 20 MHz wide FM radio band (88-108 MHz). Two receiver structures are explored: a single channel receiver and a filter bank channelizer. Both run in real time on NVIDIA GeForce GTX 1080 graphics card.

  19. Quick plasma equilibrium reconstruction based on GPU

    International Nuclear Information System (INIS)

    Xiao Bingjia; Huang, Y.; Luo, Z.P.; Yuan, Q.P.; Lao, L.

    2014-01-01

    A parallel code named P-EFIT which could complete an equilibrium reconstruction iteration in 250 μs is described. It is built with the CUDA TM architecture by using Graphical Processing Unit (GPU). It is described for the optimization of middle-scale matrix multiplication on GPU and an algorithm which could solve block tri-diagonal linear system efficiently in parallel. Benchmark test is conducted. Static test proves the accuracy of the P-EFIT and simulation-test proves the feasibility of using P-EFIT for real-time reconstruction on 65x65 computation grids. (author)

  20. GPU Pro 4 advanced rendering techniques

    CERN Document Server

    Engel, Wolfgang

    2013-01-01

    GPU Pro4: Advanced Rendering Techniques presents ready-to-use ideas and procedures that can help solve many of your day-to-day graphics programming challenges. Focusing on interactive media and games, the book covers up-to-date methods producing real-time graphics. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Sebastien St-Laurent have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book begins with discussions on the abi

  1. GPU Pro 5 advanced rendering techniques

    CERN Document Server

    Engel, Wolfgang

    2014-01-01

    In GPU Pro5: Advanced Rendering Techniques, section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Marius Bjorge have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book covers rendering, lighting, effects in image space, mobile devices, 3D engine design, and compute. It explores rasterization of liquids, ray tracing of art assets that would otherwise be used in a rasterized engine, physically based area lights, volumetric light

  2. High-Performance Matrix-Vector Multiplication on the GPU

    DEFF Research Database (Denmark)

    Sørensen, Hans Henrik Brandenborg

    2012-01-01

    In this paper, we develop a high-performance GPU kernel for one of the most popular dense linear algebra operations, the matrix-vector multiplication. The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for scientific computing...

  3. NAVIER-STOKES EM GPU

    OpenAIRE

    ALEX LAIER BORDIGNON

    2006-01-01

    Nesse trabalho, mostramos como simular um fluido em duas dimensões em um domínio com fronteiras arbitrárias. Nosso trabalho é baseado no esquema stable fluids desenvolvido por Joe Stam. A implementação é feita na GPU (Graphics Processing Unit), permitindo velocidade de interação com o fluido. Fazemos uso da linguagem Cg (C for Graphics), desenvolvida pela companhia NVidia. Nossas principais contribuições são o tratamento das múltiplas fronteiras, o...

  4. GPU-accelerated computation of electron transfer.

    Science.gov (United States)

    Höfinger, Siegfried; Acocella, Angela; Pop, Sergiu C; Narumi, Tetsu; Yasuoka, Kenji; Beu, Titus; Zerbetto, Francesco

    2012-11-05

    Electron transfer is a fundamental process that can be studied with the help of computer simulation. The underlying quantum mechanical description renders the problem a computationally intensive application. In this study, we probe the graphics processing unit (GPU) for suitability to this type of problem. Time-critical components are identified via profiling of an existing implementation and several different variants are tested involving the GPU at increasing levels of abstraction. A publicly available library supporting basic linear algebra operations on the GPU turns out to accelerate the computation approximately 50-fold with minor dependence on actual problem size. The performance gain does not compromise numerical accuracy and is of significant value for practical purposes. Copyright © 2012 Wiley Periodicals, Inc.

  5. Parallel GPU implementation of PWR reactor burnup

    International Nuclear Information System (INIS)

    Heimlich, A.; Silva, F.C.; Martinez, A.S.

    2016-01-01

    Highlights: • Three GPU algorithms used to evaluate the burn-up in a PWR reactor. • Exhibit speed improvement exceeding 200 times over the sequential. • The C++ container is expansible to accept new nuclides chains. - Abstract: This paper surveys three methods, implemented for multi-core CPU and graphic processor unit (GPU), to evaluate the fuel burn-up in a pressurized light water nuclear reactor (PWR) using the solutions of a large system of coupled ordinary differential equations. The reactor physics simulation of a PWR reactor spends a long execution time with burnup calculations, so performance improvement using GPU can imply in better core design and thus extended fuel life cycle. The results of this study exhibit speed improvement exceeding 200 times over the sequential solver, within 1% accuracy.

  6. Development of a prototype chest digital tomosynthesis (CDT) R/F system with fast image reconstruction using graphics processing unit (GPU) programming

    Energy Technology Data Exchange (ETDEWEB)

    Choi, Sunghoon, E-mail: choi.sh@yonsei.ac.kr [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Lee, Seungwan [Department of Radiological Science, College of Medical Science, Konyang University, 158 Gwanjeodong-ro, Daejeon, 308-812 (Korea, Republic of); Lee, Haenghwa [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Lee, Donghoon; Choi, Seungyeon [Department of Radiation Convergence Engineering, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Shin, Jungwook [LISTEM Corporation, 94 Donghwagongdan-ro, Munmak-eup, Wonju (Korea, Republic of); Seo, Chang-Woo [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Kim, Hee-Joung, E-mail: hjk1@yonsei.ac.kr [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Department of Radiation Convergence Engineering, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of)

    2017-03-11

    Digital tomosynthesis offers the advantage of low radiation doses compared to conventional computed tomography (CT) by utilizing small numbers of projections (~80) acquired over a limited angular range. It produces 3D volumetric data, although there are artifacts due to incomplete sampling. Based upon these characteristics, we developed a prototype digital tomosynthesis R/F system for applications in chest imaging. Our prototype chest digital tomosynthesis (CDT) R/F system contains an X-ray tube with high power R/F pulse generator, flat-panel detector, R/F table, electromechanical radiographic subsystems including a precise motor controller, and a reconstruction server. For image reconstruction, users select between analytic and iterative reconstruction methods. Our reconstructed images of Catphan700 and LUNGMAN phantoms clearly and rapidly described the internal structures of phantoms using graphics processing unit (GPU) programming. Contrast-to-noise ratio (CNR) values of the CTP682 module of Catphan700 were higher in images using a simultaneous algebraic reconstruction technique (SART) than in those using filtered back-projection (FBP) for all materials by factors of 2.60, 3.78, 5.50, 2.30, 3.70, and 2.52 for air, lung foam, low density polyethylene (LDPE), Delrin{sup ®} (acetal homopolymer resin), bone 50% (hydroxyapatite), and Teflon, respectively. Total elapsed times for producing 3D volume were 2.92 s and 86.29 s on average for FBP and SART (20 iterations), respectively. The times required for reconstruction were clinically feasible. Moreover, the total radiation dose from our system (5.68 mGy) was lower than that of conventional chest CT scan. Consequently, our prototype tomosynthesis R/F system represents an important advance in digital tomosynthesis applications.

  7. Modelling multi-phase liquid-sediment scour and resuspension induced by rapid flows using Smoothed Particle Hydrodynamics (SPH) accelerated with a Graphics Processing Unit (GPU)

    Science.gov (United States)

    Fourtakas, G.; Rogers, B. D.

    2016-06-01

    A two-phase numerical model using Smoothed Particle Hydrodynamics (SPH) is applied to two-phase liquid-sediments flows. The absence of a mesh in SPH is ideal for interfacial and highly non-linear flows with changing fragmentation of the interface, mixing and resuspension. The rheology of sediment induced under rapid flows undergoes several states which are only partially described by previous research in SPH. This paper attempts to bridge the gap between the geotechnics, non-Newtonian and Newtonian flows by proposing a model that combines the yielding, shear and suspension layer which are needed to predict accurately the global erosion phenomena, from a hydrodynamics prospective. The numerical SPH scheme is based on the explicit treatment of both phases using Newtonian and the non-Newtonian Bingham-type Herschel-Bulkley-Papanastasiou constitutive model. This is supplemented by the Drucker-Prager yield criterion to predict the onset of yielding of the sediment surface and a concentration suspension model. The multi-phase model has been compared with experimental and 2-D reference numerical models for scour following a dry-bed dam break yielding satisfactory results and improvements over well-known SPH multi-phase models. With 3-D simulations requiring a large number of particles, the code is accelerated with a graphics processing unit (GPU) in the open-source DualSPHysics code. The implementation and optimisation of the code achieved a speed up of x58 over an optimised single thread serial code. A 3-D dam break over a non-cohesive erodible bed simulation with over 4 million particles yields close agreement with experimental scour and water surface profiles.

  8. GPU based acceleration of first principles calculation

    International Nuclear Information System (INIS)

    Tomono, H; Tsumuraya, K; Aoki, M; Iitaka, T

    2010-01-01

    We present a Graphics Processing Unit (GPU) accelerated simulations of first principles electronic structure calculations. The FFT, which is the most time-consuming part, is about 10 times accelerated. As the result, the total computation time of a first principles calculation is reduced to 15 percent of that of the CPU.

  9. GPU Accelerated Surgical Simulators for Complex Morhpology

    DEFF Research Database (Denmark)

    Mosegaard, Jesper; Sørensen, Thomas Sangild

    2005-01-01

    a springmass system in order to simulate a complex organ such as the heart. Computations are accelerated by taking advantage of modern graphics processing units (GPUs). Two GPU implementations are presented. They vary in their generality of spring connections and in the speedup factor they achieve...

  10. Parallelization and checkpointing of GPU applications through program transformation

    Energy Technology Data Exchange (ETDEWEB)

    Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)

    2012-01-01

    GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and

  11. Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics.

    Science.gov (United States)

    Zheng, Mo; Li, Xiaoxia; Guo, Li

    2013-04-01

    Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10-50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.'s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps' C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. Copyright © 2013 Elsevier Inc. All rights reserved.

  12. GRAVIDY, a GPU modular, parallel direct-summation N-body integrator: dynamics with softening

    Science.gov (United States)

    Maureira-Fredes, Cristián; Amaro-Seoane, Pau

    2018-01-01

    A wide variety of outstanding problems in astrophysics involve the motion of a large number of particles under the force of gravity. These include the global evolution of globular clusters, tidal disruptions of stars by a massive black hole, the formation of protoplanets and sources of gravitational radiation. The direct-summation of N gravitational forces is a complex problem with no analytical solution and can only be tackled with approximations and numerical methods. To this end, the Hermite scheme is a widely used integration method. With different numerical techniques and special-purpose hardware, it can be used to speed up the calculations. But these methods tend to be computationally slow and cumbersome to work with. We present a new graphics processing unit (GPU), direct-summation N-body integrator written from scratch and based on this scheme, which includes relativistic corrections for sources of gravitational radiation. GRAVIDY has high modularity, allowing users to readily introduce new physics, it exploits available computational resources and will be maintained by regular updates. GRAVIDY can be used in parallel on multiple CPUs and GPUs, with a considerable speed-up benefit. The single-GPU version is between one and two orders of magnitude faster than the single-CPU version. A test run using four GPUs in parallel shows a speed-up factor of about 3 as compared to the single-GPU version. The conception and design of this first release is aimed at users with access to traditional parallel CPU clusters or computational nodes with one or a few GPU cards.

  13. Fast Gridding on Commodity Graphics Hardware

    DEFF Research Database (Denmark)

    Sørensen, Thomas Sangild; Schaeffter, Tobias; Noe, Karsten Østergaard

    2007-01-01

    is the far most time consuming of the three steps (Table 1). Modern graphics cards (GPUs) can be utilised as a fast parallel processor provided that algorithms are reformulated in a parallel solution. The purpose of this work is to test the hypothesis, that a non-cartesian reconstruction can be efficiently...... implemented on graphics hardware giving a significant speedup compared to CPU based alternatives. We present a novel GPU implementation of the convolution step that overcomes the problems of memory bandwidth that has limited the speed of previous GPU gridding algorithms [2]....

  14. Real-Time Incompressible Fluid Simulation on the GPU

    Directory of Open Access Journals (Sweden)

    Xiao Nie

    2015-01-01

    Full Text Available We present a parallel framework for simulating incompressible fluids with predictive-corrective incompressible smoothed particle hydrodynamics (PCISPH on the GPU in real time. To this end, we propose an efficient GPU streaming pipeline to map the entire computational task onto the GPU, fully exploiting the massive computational power of state-of-the-art GPUs. In PCISPH-based simulations, neighbor search is the major performance obstacle because this process is performed several times at each time step. To eliminate this bottleneck, an efficient parallel sorting method for this time-consuming step is introduced. Moreover, we discuss several optimization techniques including using fast on-chip shared memory to avoid global memory bandwidth limitations and thus further improve performance on modern GPU hardware. With our framework, the realism of real-time fluid simulation is significantly improved since our method enforces incompressibility constraint which is typically ignored due to efficiency reason in previous GPU-based SPH methods. The performance results illustrate that our approach can efficiently simulate realistic incompressible fluid in real time and results in a speed-up factor of up to 23 on a high-end NVIDIA GPU in comparison to single-threaded CPU-based implementation.

  15. AFOSR BRI: Co-Design of Hardware/Software for Predicting MAV Aerodynamics

    Science.gov (United States)

    2016-09-27

    fold was extracted when applying architecture -aware GPU optimizations, resulting in a 371-fold speed-up. By also leveraging algorithmic innovation...mind the strengths of the underlying hardware architecture . Some examples include a block-sparse linear solver. • Characterization of performance...GPU, NVIDIA GPU, and Intel Xeon Phi . • Creation of a prototypical runtime system called CoreTSAR, short for Core Task-Size Adapting Runtime, that

  16. The ATLAS FTK Auxiliary Card: A Highly Functional VME Rear Transition Module for a Hardware Track Finding Processing Unit

    CERN Document Server

    Alison, John; The ATLAS collaboration; Bogdan, Mircea; Bryant, Patrick; Cheng, Yangyang; Krizka, Karol; Shochet, Mel; Tompkins, Lauren; Webster, Jordan S

    2014-01-01

    The ATLAS Fast TracKer is a hardware-based charged particle track finder for the High Level Trigger system of the ATLAS Experiment at the LHC. Using a multi-component system, it finds charged particle trajectories of 1 GeV/c and greater using data from the full ATLAS silicon tracking detectors at a rate of 100 kHz. Pattern recognition and preliminary track fitting are performed by VME Processing Units consisting of an Associative Memory Board containing custom associative memory chips for pattern recognition, and the Auxiliary Card (AUX), a powerful rear transition module which formats the data for pattern recognition and performs linearized fits on track candidates. We report on the design and testing of the AUX, which utilizes six FPGAs to process up to 32 Gbps of hit data, as well as fit the helical trajectory of one track candidate per nanosecond through a highly parallel track fitting architecture. Both the board and firmware design will be discussed, as well as the performance observed in tests at CERN ...

  17. TH-A-19A-09: Towards Sub-Second Proton Dose Calculation On GPU

    Energy Technology Data Exchange (ETDEWEB)

    Silva, J da [University of Cambridge, Cambridge, Cambridgeshire (United Kingdom)

    2014-06-15

    Purpose: To achieve sub-second dose calculation for clinically relevant proton therapy treatment plans. Rapid dose calculation is a key component of adaptive radiotherapy, necessary to take advantage of the better dose conformity offered by hadron therapy. Methods: To speed up proton dose calculation, the pencil beam algorithm (PBA; clinical standard) was parallelised and implemented to run on a graphics processing unit (GPU). The implementation constitutes the first PBA to run all steps on GPU, and each part of the algorithm was carefully adapted for efficiency. Monte Carlo (MC) simulations obtained using Fluka of individual beams of energies representative of the clinical range impinging on simple geometries were used to tune the PBA. For benchmarking, a typical skull base case with a spot scanning plan consisting of a total of 8872 spots divided between two beam directions of 49 energy layers each was provided by CNAO (Pavia, Italy). The calculations were carried out on an Nvidia Geforce GTX680 desktop GPU with 1536 cores running at 1006 MHz. Results: The PBA reproduced within ±3% of maximum dose results obtained from MC simulations for a range of pencil beams impinging on a water tank. Additional analysis of more complex slab geometries is currently under way to fine-tune the algorithm. Full calculation of the clinical test case took 0.9 seconds in total, with the majority of the time spent in the kernel superposition step. Conclusion: The PBA lends itself well to implementation on many-core systems such as GPUs. Using the presented implementation and current hardware, sub-second dose calculation for a clinical proton therapy plan was achieved, opening the door for adaptive treatment. The successful parallelisation of all steps of the calculation indicates that further speedups can be expected with new hardware, brightening the prospects for real-time dose calculation. This work was funded by ENTERVISION, European Commission FP7 grant 264552.

  18. TH-A-19A-09: Towards Sub-Second Proton Dose Calculation On GPU

    International Nuclear Information System (INIS)

    Silva, J da

    2014-01-01

    Purpose: To achieve sub-second dose calculation for clinically relevant proton therapy treatment plans. Rapid dose calculation is a key component of adaptive radiotherapy, necessary to take advantage of the better dose conformity offered by hadron therapy. Methods: To speed up proton dose calculation, the pencil beam algorithm (PBA; clinical standard) was parallelised and implemented to run on a graphics processing unit (GPU). The implementation constitutes the first PBA to run all steps on GPU, and each part of the algorithm was carefully adapted for efficiency. Monte Carlo (MC) simulations obtained using Fluka of individual beams of energies representative of the clinical range impinging on simple geometries were used to tune the PBA. For benchmarking, a typical skull base case with a spot scanning plan consisting of a total of 8872 spots divided between two beam directions of 49 energy layers each was provided by CNAO (Pavia, Italy). The calculations were carried out on an Nvidia Geforce GTX680 desktop GPU with 1536 cores running at 1006 MHz. Results: The PBA reproduced within ±3% of maximum dose results obtained from MC simulations for a range of pencil beams impinging on a water tank. Additional analysis of more complex slab geometries is currently under way to fine-tune the algorithm. Full calculation of the clinical test case took 0.9 seconds in total, with the majority of the time spent in the kernel superposition step. Conclusion: The PBA lends itself well to implementation on many-core systems such as GPUs. Using the presented implementation and current hardware, sub-second dose calculation for a clinical proton therapy plan was achieved, opening the door for adaptive treatment. The successful parallelisation of all steps of the calculation indicates that further speedups can be expected with new hardware, brightening the prospects for real-time dose calculation. This work was funded by ENTERVISION, European Commission FP7 grant 264552

  19. GPU-accelerated micromagnetic simulations using cloud computing

    International Nuclear Information System (INIS)

    Jermain, C.L.; Rowlands, G.E.; Buhrman, R.A.; Ralph, D.C.

    2016-01-01

    Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics. - Highlights: • The benefits of cloud computing for GPU-accelerated micromagnetics are examined. • We present the MuCloud software for running simulations on cloud computing. • Simulation run times are measured to benchmark cloud computing performance. • Comparison benchmarks are analyzed between CPU and GPU based solvers.

  20. GPU-accelerated micromagnetic simulations using cloud computing

    Energy Technology Data Exchange (ETDEWEB)

    Jermain, C.L., E-mail: clj72@cornell.edu [Cornell University, Ithaca, NY 14853 (United States); Rowlands, G.E.; Buhrman, R.A. [Cornell University, Ithaca, NY 14853 (United States); Ralph, D.C. [Cornell University, Ithaca, NY 14853 (United States); Kavli Institute at Cornell, Ithaca, NY 14853 (United States)

    2016-03-01

    Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics. - Highlights: • The benefits of cloud computing for GPU-accelerated micromagnetics are examined. • We present the MuCloud software for running simulations on cloud computing. • Simulation run times are measured to benchmark cloud computing performance. • Comparison benchmarks are analyzed between CPU and GPU based solvers.

  1. Fully 3D GPU PET reconstruction

    Energy Technology Data Exchange (ETDEWEB)

    Herraiz, J.L., E-mail: joaquin@nuclear.fis.ucm.es [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Espana, S. [Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA (United States); Cal-Gonzalez, J. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Vaquero, J.J. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Desco, M. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Unidad de Medicina y Cirugia Experimental, Hospital General Universitario Gregorio Maranon, Madrid (Spain); Udias, J.M. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain)

    2011-08-21

    Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.

  2. Fully 3D GPU PET reconstruction

    International Nuclear Information System (INIS)

    Herraiz, J.L.; Espana, S.; Cal-Gonzalez, J.; Vaquero, J.J.; Desco, M.; Udias, J.M.

    2011-01-01

    Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.

  3. Parallel generation of architecture on the GPU

    KAUST Repository

    Steinberger, Markus

    2014-05-01

    In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars on the graphics processing unit (GPU). Unlike previous approaches that are either limited in the kind of shapes they allow, the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling including stochasticity and context-sensitivity. To increase parallelism, we explicitly express independence in the grammar, reduce inter-rule dependencies required for context-sensitive evaluation, and introduce intra-rule parallelism. Our rule scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by dynamically grouping rules in on-chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude faster than the standard in CPU-based rule evaluation, while offering equal expressive power. In comparison to the state of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context-sensitivity. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.

  4. GPU-accelerated low-latency real-time searches for gravitational waves from compact binary coalescence

    International Nuclear Information System (INIS)

    Liu Yuan; Du Zhihui; Chung, Shin Kee; Hooper, Shaun; Blair, David; Wen Linqing

    2012-01-01

    We present a graphics processing unit (GPU)-accelerated time-domain low-latency algorithm to search for gravitational waves (GWs) from coalescing binaries of compact objects based on the summed parallel infinite impulse response (SPIIR) filtering technique. The aim is to facilitate fast detection of GWs with a minimum delay to allow prompt electromagnetic follow-up observations. To maximize the GPU acceleration, we apply an efficient batched parallel computing model that significantly reduces the number of synchronizations in SPIIR and optimizes the usage of the memory and hardware resource. Our code is tested on the CUDA ‘Fermi’ architecture in a GTX 480 graphics card and its performance is compared with a single core of Intel Core i7 920 (2.67 GHz). A 58-fold speedup is achieved while giving results in close agreement with the CPU implementation. Our result indicates that it is possible to conduct a full search for GWs from compact binary coalescence in real time with only one desktop computer equipped with a Fermi GPU card for the initial LIGO detectors which in the past required more than 100 CPUs. (paper)

  5. GPU-based Parallel Application Design for Emerging Mobile Devices

    Science.gov (United States)

    Gupta, Kshitij

    A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as

  6. A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors

    Science.gov (United States)

    Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S.; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun

    2011-01-01

    Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities. PMID:22164116

  7. A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors

    Directory of Open Access Journals (Sweden)

    Dennis Akos

    2011-09-01

    Full Text Available Due to their weak received signal power, Global Positioning System (GPS signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs. However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU coupled with a new generation Graphics Processing Unit (GPU having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities.

  8. An efficient spectral crystal plasticity solver for GPU architectures

    Science.gov (United States)

    Malahe, Michael

    2018-03-01

    We present a spectral crystal plasticity (CP) solver for graphics processing unit (GPU) architectures that achieves a tenfold increase in efficiency over prior GPU solvers. The approach makes use of a database containing a spectral decomposition of CP simulations performed using a conventional iterative solver over a parameter space of crystal orientations and applied velocity gradients. The key improvements in efficiency come from reducing global memory transactions, exposing more instruction-level parallelism, reducing integer instructions and performing fast range reductions on trigonometric arguments. The scheme also makes more efficient use of memory than prior work, allowing for larger problems to be solved on a single GPU. We illustrate these improvements with a simulation of 390 million crystal grains on a consumer-grade GPU, which executes at a rate of 2.72 s per strain step.

  9. GPU-Accelerated Text Mining

    International Nuclear Information System (INIS)

    Cui, X.; Mueller, F.; Zhang, Y.; Potok, Thomas E.

    2009-01-01

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices

  10. Bridging FPGA and GPU technologies for AO real-time control

    Science.gov (United States)

    Perret, Denis; Lainé, Maxime; Bernard, Julien; Gratadour, Damien; Sevin, Arnaud

    2016-07-01

    Our team has developed a common environment for high performance simulations and real-time control of AO systems based on the use of Graphics Processors Units in the context of the COMPASS project. Such a solution, based on the ability of the real time core in the simulation to provide adequate computing performance, limits the cost of developing AO RTC systems and makes them more scalable. A code developed and validated in the context of the simulation may be injected directly into the system and tested on sky. Furthermore, the use of relatively low cost components also offers significant advantages for the system hardware platform. However, the use of GPUs in an AO loop comes with drawbacks: the traditional way of offloading computation from CPU to GPUs - involving multiple copies and unacceptable overhead in kernel launching - is not well suited in a real time context. This last application requires the implementation of a solution enabling direct memory access (DMA) to the GPU memory from a third party device, bypassing the operating system. This allows this device to communicate directly with the real-time core of the simulation feeding it with the WFS camera pixel stream. We show that DMA between a custom FPGA-based frame-grabber and a computation unit (GPU, FPGA, or Coprocessor such as Xeon-phi) across PCIe allows us to get latencies compatible with what will be needed on ELTs. As a fine-grained synchronization mechanism is not yet made available by GPU vendors, we propose the use of memory polling to avoid interrupts handling and involvement of a CPU. Network and Vision protocols are handled by the FPGA-based Network Interface Card (NIC). We present the results we obtained on a complete AO loop using camera and deformable mirror simulators.

  11. Comparison Of Hybrid Sorting Algorithms Implemented On Different Parallel Hardware Platforms

    Directory of Open Access Journals (Sweden)

    Dominik Zurek

    2013-01-01

    Full Text Available Sorting is a common problem in computer science. There are lot of well-known sorting algorithms created for sequential execution on a single processor. Recently, hardware platforms enable to create wide parallel algorithms. We have standard processors consist of multiple cores and hardware accelerators like GPU. The graphic cards with their parallel architecture give new possibility to speed up many algorithms. In this paper we describe results of implementation of a few different sorting algorithms on GPU cards and multicore processors. Then hybrid algorithm will be presented which consists of parts executed on both platforms, standard CPU and GPU.

  12. GPU-Accelerated Real-Time Surveillance De-Weathering

    OpenAIRE

    Pettersson, Niklas

    2013-01-01

    A fully automatic de-weathering system to increase the visibility/stability in surveillance applications during bad weather has been developed. Rain, snow and haze during daylight are handled in real-time performance with acceleration from CUDA implemented algorithms. Video from fixed cameras is processed on a PC with no need of special hardware except an NVidia GPU. The system does not use any background model and does not require any precalibration. Increase in contrast is obtained in all h...

  13. High performance image acquisition and processing architecture for fast plant system controllers based on FPGA and GPU

    International Nuclear Information System (INIS)

    Nieto, J.; Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M.; Vega, J.; Castro, R.

    2016-01-01

    Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.

  14. High performance image acquisition and processing architecture for fast plant system controllers based on FPGA and GPU

    Energy Technology Data Exchange (ETDEWEB)

    Nieto, J., E-mail: jnieto@sec.upm.es [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M. [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Vega, J.; Castro, R. [Asociación EURATOM/CIEMAT para Fusión, Madrid (Spain)

    2016-11-15

    Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.

  15. Exploiting graphics processing units for computational biology and bioinformatics.

    Science.gov (United States)

    Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H

    2010-09-01

    Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.

  16. Parallel GPU implementation of iterative PCA algorithms.

    Science.gov (United States)

    Andrecut, M

    2009-11-01

    Principal component analysis (PCA) is a key statistical technique for multivariate data analysis. For large data sets, the common approach to PCA computation is based on the standard NIPALS-PCA algorithm, which unfortunately suffers from loss of orthogonality, and therefore its applicability is usually limited to the estimation of the first few components. Here we present an algorithm based on Gram-Schmidt orthogonalization (called GS-PCA), which eliminates this shortcoming of NIPALS-PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel implementation of both NIPALS-PCA and GS-PCA algorithms. The numerical results show that the GPU parallel optimized versions, based on CUBLAS (NVIDIA), are substantially faster (up to 12 times) than the CPU optimized versions based on CBLAS (GNU Scientific Library).

  17. Parallel, distributed and GPU computing technologies in single-particle electron microscopy.

    Science.gov (United States)

    Schmeisser, Martin; Heisen, Burkhard C; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger

    2009-07-01

    Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today's technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined.

  18. GPU Lossless Hyperspectral Data Compression System for Space Applications

    Science.gov (United States)

    Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled

    2012-01-01

    On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.

  19. Advantages of GPU technology in DFT calculations of intercalated graphene

    Science.gov (United States)

    Pešić, J.; Gajić, R.

    2014-09-01

    Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an

  20. Advantages of GPU technology in DFT calculations of intercalated graphene

    International Nuclear Information System (INIS)

    Pešić, J; Gajić, R

    2014-01-01

    Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an

  1. Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU.

    Science.gov (United States)

    Arefan, D; Talebpour, A; Ahmadinejhad, N; Kamali Asl, A

    2015-06-01

    Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU). At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU) card and the Graphics Processing Unit (GPU). It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU).

  2. Parallelized Local Volatility Estimation Using GP-GPU Hardware Acceleration

    KAUST Repository

    Douglas, Craig C.; Lee, Hyoseop; Sheen, Dongwoo

    2010-01-01

    We introduce an inverse problem for the local volatility model in option pricing. We solve the problem using the Levenberg-Marquardt algorithm and use the notion of the Fréchet derivative when calculating the Jacobian matrix. We analyze

  3. Basket Option Pricing Using GP-GPU Hardware Acceleration

    KAUST Repository

    Douglas, Craig C.; Lee, Hyoseop

    2010-01-01

    We introduce a basket option pricing problem arisen in financial mathematics. We discretized the problem based on the alternating direction implicit (ADI) method and parallel cyclic reduction is applied to solve the set of tridiagonal matrices

  4. Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.

    Science.gov (United States)

    Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G

    2011-07-01

    In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.

  5. Edge-preserving image denoising via group coordinate descent on the GPU.

    Science.gov (United States)

    McGaffin, Madison Gray; Fessler, Jeffrey A

    2015-04-01

    Image denoising is a fundamental operation in image processing, and its applications range from the direct (photographic enhancement) to the technical (as a subproblem in image reconstruction algorithms). In many applications, the number of pixels has continued to grow, while the serial execution speed of computational hardware has begun to stall. New image processing algorithms must exploit the power offered by massively parallel architectures like graphics processing units (GPUs). This paper describes a family of image denoising algorithms well-suited to the GPU. The algorithms iteratively perform a set of independent, parallel 1D pixel-update subproblems. To match GPU memory limitations, they perform these pixel updates in-place and only store the noisy data, denoised image, and problem parameters. The algorithms can handle a wide range of edge-preserving roughness penalties, including differentiable convex penalties and anisotropic total variation. Both algorithms use the majorize-minimize framework to solve the 1D pixel update subproblem. Results from a large 2D image denoising problem and a 3D medical imaging denoising problem demonstrate that the proposed algorithms converge rapidly in terms of both iteration and run-time.

  6. GPU-based Branchless Distance-Driven Projection and Backprojection.

    Science.gov (United States)

    Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong

    2017-12-01

    Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm.

  7. GPU implementations of online track finding algorithms at PANDA

    Energy Technology Data Exchange (ETDEWEB)

    Herten, Andreas; Stockmanns, Tobias; Ritman, James [Institut fuer Kernphysik, Forschungszentrum Juelich GmbH (Germany); Adinetz, Andrew; Pleiter, Dirk [Juelich Supercomputing Centre, Forschungszentrum Juelich GmbH (Germany); Kraus, Jiri [NVIDIA GmbH (Germany); Collaboration: PANDA-Collaboration

    2014-07-01

    The PANDA experiment is a hadron physics experiment that will investigate antiproton annihilation in the charm quark mass region. The experiment is now being constructed as one of the main parts of the FAIR facility. At an event rate of 2 . 10{sup 7}/s a data rate of 200 GB/s is expected. A reduction of three orders of magnitude is required in order to save the data for further offline analysis. Since signal and background processes at PANDA have similar signatures, no hardware-level trigger is foreseen for the experiment. Instead, a fast online event filter is substituting this element. We investigate the possibility of using graphics processing units (GPUs) for the online tracking part of this task. Researched algorithms are a Hough Transform, a track finder involving Riemann surfaces, and the novel, PANDA-specific Triplet Finder. This talk shows selected advances in the implementations as well as performance evaluations of the GPU tracking algorithms to be used at the PANDA experiment.

  8. Prefiltering Model for Homology Detection Algorithms on GPU.

    Science.gov (United States)

    Retamosa, Germán; de Pedro, Luis; González, Ivan; Tamames, Javier

    2016-01-01

    Homology detection has evolved over the time from heavy algorithms based on dynamic programming approaches to lightweight alternatives based on different heuristic models. However, the main problem with these algorithms is that they use complex statistical models, which makes it difficult to achieve a relevant speedup and find exact matches with the original results. Thus, their acceleration is essential. The aim of this article was to prefilter a sequence database. To make this work, we have implemented a groundbreaking heuristic model based on NVIDIA's graphics processing units (GPUs) and multicore processors. Depending on the sensitivity settings, this makes it possible to quickly reduce the sequence database by factors between 50% and 95%, while rejecting no significant sequences. Furthermore, this prefiltering application can be used together with multiple homology detection algorithms as a part of a next-generation sequencing system. Extensive performance and accuracy tests have been carried out in the Spanish National Centre for Biotechnology (NCB). The results show that GPU hardware can accelerate the execution times of former homology detection applications, such as National Centre for Biotechnology Information (NCBI), Basic Local Alignment Search Tool for Proteins (BLASTP), up to a factor of 4.

  9. Gfargo: Fargo for Gpu

    Science.gov (United States)

    Masset, Frédéric

    2015-09-01

    GFARGO is a GPU version of FARGO. It is written in C and C for CUDA and runs only on NVIDIA’s graphics cards. Though it corresponds to the standard, isothermal version of FARGO, not all functionnalities of the CPU version have been translated to CUDA. The code is available in single and double precision versions, the latter compatible with FERMI architectures. GFARGO can run on a graphics card connected to the display, allowing the user to see in real time how the fields evolve.

  10. Development of parallel GPU based algorithms for problems in nuclear area; Desenvolvimento de algoritmos paralelos baseados em GPU para solucao de problemas na area nuclear

    Energy Technology Data Exchange (ETDEWEB)

    Almeida, Adino Americo Heimlich

    2009-07-01

    Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)

  11. Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method

    Science.gov (United States)

    Januszewski, M.; Kostur, M.

    2014-09-01

    We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes. Catalogue identifier: AETA_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AETA_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Lesser General Public License, version 3 No. of lines in distributed program, including test data, etc.: 225864 No. of bytes in distributed program, including test data, etc.: 46861049 Distribution format: tar.gz Programming language: Python, CUDA C, OpenCL. Computer: Any with an OpenCL or CUDA-compliant GPU. Operating system: No limits (tested on Linux and Mac OS X). RAM: Hundreds of megabytes to tens of gigabytes for typical cases. Classification: 12, 6.5. External routines: PyCUDA/PyOpenCL, Numpy, Mako, ZeroMQ (for multi-GPU simulations), scipy, sympy Nature of problem: GPU-accelerated simulation of single- and multi-component fluid flows. Solution method: A wide range of relaxation models (LBGK, MRT, regularized LB, ELBM, Shan-Chen, free energy, free surface) and boundary conditions within the lattice

  12. Hardware malware

    CERN Document Server

    Krieg, Christian

    2013-01-01

    In our digital world, integrated circuits are present in nearly every moment of our daily life. Even when using the coffee machine in the morning, or driving our car to work, we interact with integrated circuits. The increasing spread of information technology in virtually all areas of life in the industrialized world offers a broad range of attack vectors. So far, mainly software-based attacks have been considered and investigated, while hardware-based attacks have attracted comparatively little interest. The design and production process of integrated circuits is mostly decentralized due to

  13. Bayer image parallel decoding based on GPU

    Science.gov (United States)

    Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua

    2012-11-01

    In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.

  14. ALICE HLT high speed tracking on GPU

    CERN Document Server

    Gorbunov, Sergey; Aamodt, Kenneth; Alt, Torsten; Appelshauser, Harald; Arend, Andreas; Bach, Matthias; Becker, Bruce; Bottger, Stefan; Breitner, Timo; Busching, Henner; Chattopadhyay, Sukalyan; Cleymans, Jean; Cicalo, Corrado; Das, Indranil; Djuvsland, Oystein; Engel, Heiko; Erdal, Hege Austrheim; Fearick, Roger; Haaland, Oystein Senneset; Hille, Per Thomas; Kalcher, Sebastian; Kanaki, Kalliopi; Kebschull, Udo Wolfgang; Kisel, Ivan; Kretz, Matthias; Lara, Camillo; Lindal, Sven; Lindenstruth, Volker; Masoodi, Arshad Ahmad; Ovrebekk, Gaute; Panse, Ralf; Peschek, Jorg; Ploskon, Mateusz; Pocheptsov, Timur; Ram, Dinesh; Rascanu, Theodor; Richter, Matthias; Rohrich, Dieter; Ronchetti, Federico; Skaali, Bernhard; Smorholm, Olav; Stokkevag, Camilla; Steinbeck, Timm Morten; Szostak, Artur; Thader, Jochen; Tveter, Trine; Ullaland, Kjetil; Vilakazi, Zeblon; Weis, Robert; Yin, Zhong-Bao; Zelnicek, Pierre

    2011-01-01

    The on-line event reconstruction in ALICE is performed by the High Level Trigger, which should process up to 2000 events per second in proton-proton collisions and up to 300 central events per second in heavy-ion collisions, corresponding to an inp ut data stream of 30 GB/s. In order to fulfill the time requirements, a fast on-line tracker has been developed. The algorithm combines a Cellular Automaton method being used for a fast pattern recognition and the Kalman Filter method for fitting of found trajectories and for the final track selection. The tracker was adapted to run on Graphics Processing Units (GPU) using the NVIDIA Compute Unified Device Architecture (CUDA) framework. The implementation of the algorithm had to be adjusted at many points to allow for an efficient usage of the graphics cards. In particular, achieving a good overall workload for many processor cores, efficient transfer to and from the GPU, as well as optimized utilization of the different memories the GPU offers turned out to be cri...

  15. Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU

    Directory of Open Access Journals (Sweden)

    Zhou Zhe

    2017-04-01

    Full Text Available According to previous reports, information could be leaked from GPU memory; however, the security implications of such a threat were mostly over-looked, because only limited information could be indirectly extracted through side-channel attacks. In this paper, we propose a novel algorithm for recovering raw data directly from the GPU memory residues of many popular applications such as Google Chrome and Adobe PDF reader. Our algorithm enables harvesting highly sensitive information including credit card numbers and email contents from GPU memory residues. Evaluation results also indicate that nearly all GPU-accelerated applications are vulnerable to such attacks, and adversaries can launch attacks without requiring any special privileges both on traditional multi-user operating systems, and emerging cloud computing scenarios.

  16. Multi-GPU configuration of 4D intensity modulated radiation therapy inverse planning using global optimization

    Science.gov (United States)

    Hagan, Aaron; Sawant, Amit; Folkerts, Michael; Modiri, Arezoo

    2018-01-01

    We report on the design, implementation and characterization of a multi-graphic processing unit (GPU) computational platform for higher-order optimization in radiotherapy treatment planning. In collaboration with a commercial vendor (Varian Medical Systems, Palo Alto, CA), a research prototype GPU-enabled Eclipse (V13.6) workstation was configured. The hardware consisted of dual 8-core Xeon processors, 256 GB RAM and four NVIDIA Tesla K80 general purpose GPUs. We demonstrate the utility of this platform for large radiotherapy optimization problems through the development and characterization of a parallelized particle swarm optimization (PSO) four dimensional (4D) intensity modulated radiation therapy (IMRT) technique. The PSO engine was coupled to the Eclipse treatment planning system via a vendor-provided scripting interface. Specific challenges addressed in this implementation were (i) data management and (ii) non-uniform memory access (NUMA). For the former, we alternated between parameters over which the computation process was parallelized. For the latter, we reduced the amount of data required to be transferred over the NUMA bridge. The datasets examined in this study were approximately 300 GB in size, including 4D computed tomography images, anatomical structure contours and dose deposition matrices. For evaluation, we created a 4D-IMRT treatment plan for one lung cancer patient and analyzed computation speed while varying several parameters (number of respiratory phases, GPUs, PSO particles, and data matrix sizes). The optimized 4D-IMRT plan enhanced sparing of organs at risk by an average reduction of 26% in maximum dose, compared to the clinical optimized IMRT plan, where the internal target volume was used. We validated our computation time analyses in two additional cases. The computation speed in our implementation did not monotonically increase with the number of GPUs. The optimal number of GPUs (five, in our study) is directly related to the

  17. Incompressible SPH (ISPH) with fast Poisson solver on a GPU

    Science.gov (United States)

    Chow, Alex D.; Rogers, Benedict D.; Lind, Steven J.; Stansby, Peter K.

    2018-05-01

    This paper presents a fast incompressible SPH (ISPH) solver implemented to run entirely on a graphics processing unit (GPU) capable of simulating several millions of particles in three dimensions on a single GPU. The ISPH algorithm is implemented by converting the highly optimised open-source weakly-compressible SPH (WCSPH) code DualSPHysics to run ISPH on the GPU, combining it with the open-source linear algebra library ViennaCL for fast solutions of the pressure Poisson equation (PPE). Several challenges are addressed with this research: constructing a PPE matrix every timestep on the GPU for moving particles, optimising the limited GPU memory, and exploiting fast matrix solvers. The ISPH pressure projection algorithm is implemented as 4 separate stages, each with a particle sweep, including an algorithm for the population of the PPE matrix suitable for the GPU, and mixed precision storage methods. An accurate and robust ISPH boundary condition ideal for parallel processing is also established by adapting an existing WCSPH boundary condition for ISPH. A variety of validation cases are presented: an impulsively started plate, incompressible flow around a moving square in a box, and dambreaks (2-D and 3-D) which demonstrate the accuracy, flexibility, and speed of the methodology. Fragmentation of the free surface is shown to influence the performance of matrix preconditioners and therefore the PPE matrix solution time. The Jacobi preconditioner demonstrates robustness and reliability in the presence of fragmented flows. For a dambreak simulation, GPU speed ups demonstrate up to 10-18 times and 1.1-4.5 times compared to single-threaded and 16-threaded CPU run times respectively.

  18. Forward and adjoint spectral-element simulations of seismic wave propagation using hardware accelerators

    Science.gov (United States)

    Peter, Daniel; Videau, Brice; Pouget, Kevin; Komatitsch, Dimitri

    2015-04-01

    Improving the resolution of tomographic images is crucial to answer important questions on the nature of Earth's subsurface structure and internal processes. Seismic tomography is the most prominent approach where seismic signals from ground-motion records are used to infer physical properties of internal structures such as compressional- and shear-wave speeds, anisotropy and attenuation. Recent advances in regional- and global-scale seismic inversions move towards full-waveform inversions which require accurate simulations of seismic wave propagation in complex 3D media, providing access to the full 3D seismic wavefields. However, these numerical simulations are computationally very expensive and need high-performance computing (HPC) facilities for further improving the current state of knowledge. During recent years, many-core architectures such as graphics processing units (GPUs) have been added to available large HPC systems. Such GPU-accelerated computing together with advances in multi-core central processing units (CPUs) can greatly accelerate scientific applications. There are mainly two possible choices of language support for GPU cards, the CUDA programming environment and OpenCL language standard. CUDA software development targets NVIDIA graphic cards while OpenCL was adopted mainly by AMD graphic cards. In order to employ such hardware accelerators for seismic wave propagation simulations, we incorporated a code generation tool BOAST into an existing spectral-element code package SPECFEM3D_GLOBE. This allows us to use meta-programming of computational kernels and generate optimized source code for both CUDA and OpenCL languages, running simulations on either CUDA or OpenCL hardware accelerators. We show here applications of forward and adjoint seismic wave propagation on CUDA/OpenCL GPUs, validating results and comparing performances for different simulations and hardware usages.

  19. A Modular Approach to Arithmetic and Logic Unit Design on a Reconfigurable Hardware Platform for Educational Purpose

    Science.gov (United States)

    Oztekin, Halit; Temurtas, Feyzullah; Gulbag, Ali

    The Arithmetic and Logic Unit (ALU) design is one of the important topics in Computer Architecture and Organization course in Computer and Electrical Engineering departments. There are ALU designs that have non-modular nature to be used as an educational tool. As the programmable logic technology has developed rapidly, it is feasible that ALU design based on Field Programmable Gate Array (FPGA) is implemented in this course. In this paper, we have adopted the modular approach to ALU design based on FPGA. All the modules in the ALU design are realized using schematic structure on Altera's Cyclone II Development board. Under this model, the ALU content is divided into four distinct modules. These are arithmetic unit except for multiplication and division operations, logic unit, multiplication unit and division unit. User can easily design any size of ALU unit since this approach has the modular nature. Then, this approach was applied to microcomputer architecture design named BZK.SAU.FPGA10.0 instead of the current ALU unit.

  20. Classification of hyperspectral imagery using MapReduce on a NVIDIA graphics processing unit (Conference Presentation)

    Science.gov (United States)

    Ramirez, Andres; Rahnemoonfar, Maryam

    2017-04-01

    A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.

  1. GPU accelerated study of heat transfer and fluid flow by lattice Boltzmann method on CUDA

    Science.gov (United States)

    Ren, Qinlong

    Lattice Boltzmann method (LBM) has been developed as a powerful numerical approach to simulate the complex fluid flow and heat transfer phenomena during the past two decades. As a mesoscale method based on the kinetic theory, LBM has several advantages compared with traditional numerical methods such as physical representation of microscopic interactions, dealing with complex geometries and highly parallel nature. Lattice Boltzmann method has been applied to solve various fluid behaviors and heat transfer process like conjugate heat transfer, magnetic and electric field, diffusion and mixing process, chemical reactions, multiphase flow, phase change process, non-isothermal flow in porous medium, microfluidics, fluid-structure interactions in biological system and so on. In addition, as a non-body-conformal grid method, the immersed boundary method (IBM) could be applied to handle the complex or moving geometries in the domain. The immersed boundary method could be coupled with lattice Boltzmann method to study the heat transfer and fluid flow problems. Heat transfer and fluid flow are solved on Euler nodes by LBM while the complex solid geometries are captured by Lagrangian nodes using immersed boundary method. Parallel computing has been a popular topic for many decades to accelerate the computational speed in engineering and scientific fields. Today, almost all the laptop and desktop have central processing units (CPUs) with multiple cores which could be used for parallel computing. However, the cost of CPUs with hundreds of cores is still high which limits its capability of high performance computing on personal computer. Graphic processing units (GPU) is originally used for the computer video cards have been emerged as the most powerful high-performance workstation in recent years. Unlike the CPUs, the cost of GPU with thousands of cores is cheap. For example, the GPU (GeForce GTX TITAN) which is used in the current work has 2688 cores and the price is only 1

  2. Personal Supercomputing for Monte Carlo Simulation Using a GPU

    Energy Technology Data Exchange (ETDEWEB)

    Oh, Jae-Yong; Koo, Yang-Hyun; Lee, Byung-Ho [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of)

    2008-05-15

    Since the usability, accessibility, and maintenance of a personal computer (PC) are very good, a PC is a useful computer simulation tool for researchers. It has enough calculation power to simulate a small scale system with the improved performance of a PC's CPU. However, if a system is large or long time scale, we need a cluster computer or supercomputer. Recently great changes have occurred in the PC calculation environment. A graphic process unit (GPU) on a graphic card, only used to calculate display data, has a superior calculation capability to a PC's CPU. This GPU calculation performance is a match for the supercomputer in 2000. Although it has such a great calculation potential, it is not easy to program a simulation code for GPU due to difficult programming techniques for converting a calculation matrix to a 3D rendering image using graphic APIs. In 2006, NVIDIA provided the Software Development Kit (SDK) for the programming environment for NVIDIA's graphic cards, which is called the Compute Unified Device Architecture (CUDA). It makes the programming on the GPU easy without knowledge of the graphic APIs. This paper describes the basic architectures of NVIDIA's GPU and CUDA, and carries out a performance benchmark for the Monte Carlo simulation.

  3. Personal Supercomputing for Monte Carlo Simulation Using a GPU

    International Nuclear Information System (INIS)

    Oh, Jae-Yong; Koo, Yang-Hyun; Lee, Byung-Ho

    2008-01-01

    Since the usability, accessibility, and maintenance of a personal computer (PC) are very good, a PC is a useful computer simulation tool for researchers. It has enough calculation power to simulate a small scale system with the improved performance of a PC's CPU. However, if a system is large or long time scale, we need a cluster computer or supercomputer. Recently great changes have occurred in the PC calculation environment. A graphic process unit (GPU) on a graphic card, only used to calculate display data, has a superior calculation capability to a PC's CPU. This GPU calculation performance is a match for the supercomputer in 2000. Although it has such a great calculation potential, it is not easy to program a simulation code for GPU due to difficult programming techniques for converting a calculation matrix to a 3D rendering image using graphic APIs. In 2006, NVIDIA provided the Software Development Kit (SDK) for the programming environment for NVIDIA's graphic cards, which is called the Compute Unified Device Architecture (CUDA). It makes the programming on the GPU easy without knowledge of the graphic APIs. This paper describes the basic architectures of NVIDIA's GPU and CUDA, and carries out a performance benchmark for the Monte Carlo simulation

  4. Implementation and Optimization of GPU-Based Static State Security Analysis in Power Systems

    Directory of Open Access Journals (Sweden)

    Yong Chen

    2017-01-01

    Full Text Available Static state security analysis (SSSA is one of the most important computations to check whether a power system is in normal and secure operating state. It is a challenge to satisfy real-time requirements with CPU-based concurrent methods due to the intensive computations. A sensitivity analysis-based method with Graphics processing unit (GPU is proposed for power systems, which can reduce calculation time by 40% compared to the execution on a 4-core CPU. The proposed method involves load flow analysis and sensitivity analysis. In load flow analysis, a multifrontal method for sparse LU factorization is explored on GPU through dynamic frontal task scheduling between CPU and GPU. The varying matrix operations during sensitivity analysis on GPU are highly optimized in this study. The results of performance evaluations show that the proposed GPU-based SSSA with optimized matrix operations can achieve a significant reduction in computation time.

  5. Development of parallel GPU based algorithms for problems in nuclear area

    International Nuclear Information System (INIS)

    Almeida, Adino Americo Heimlich

    2009-01-01

    Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)

  6. A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.

    Science.gov (United States)

    Nagaoka, Tomoaki; Watanabe, Soichi

    2010-01-01

    Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.

  7. Simulation of isothermal multi-phase fuel-coolant interaction using MPS method with GPU acceleration

    Energy Technology Data Exchange (ETDEWEB)

    Gou, W.; Zhang, S.; Zheng, Y. [Zhejiang Univ., Hangzhou (China). Center for Engineering and Scientific Computation

    2016-07-15

    The energetic fuel-coolant interaction (FCI) has been one of the primary safety concerns in nuclear power plants. Graphical processing unit (GPU) implementation of the moving particle semi-implicit (MPS) method is presented and used to simulate the fuel coolant interaction problem. The governing equations are discretized with the particle interaction model of MPS. Detailed implementation on single-GPU is introduced. The three-dimensional broken dam is simulated to verify the developed GPU acceleration MPS method. The proposed GPU acceleration algorithm and developed code are then used to simulate the FCI problem. As a summary of results, the developed GPU-MPS method showed a good agreement with the experimental observation and theoretical prediction.

  8. GPU-based high performance Monte Carlo simulation in neutron transport

    Energy Technology Data Exchange (ETDEWEB)

    Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A. [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil). Lab. de Inteligencia Artificial Aplicada], e-mail: cmnap@ien.gov.br

    2009-07-01

    Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)

  9. GPU-based high performance Monte Carlo simulation in neutron transport

    International Nuclear Information System (INIS)

    Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A.

    2009-01-01

    Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)

  10. Accelerating Pseudo-Random Number Generator for MCNP on GPU

    Science.gov (United States)

    Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu

    2010-09-01

    Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.

  11. How General-Purpose can a GPU be?

    Directory of Open Access Journals (Sweden)

    Philip Machanick

    2015-12-01

    Full Text Available The use of graphics processing units (GPUs in general-purpose computation (GPGPU is a growing field. GPU instruction sets, while implementing a graphics pipeline, draw from a range of single instruction multiple datastream (SIMD architectures characteristic of the heyday of supercomputers. Yet only one of these SIMD instruction sets has been of application on a wide enough range of problems to survive the era when the full range of supercomputer design variants was being explored: vector instructions. This paper proposes a reconceptualization of the GPU as a multicore design with minimal exotic modes of parallelism so as to make GPGPU truly general.

  12. GPU's for event reconstruction in the FairRoot framework

    International Nuclear Information System (INIS)

    Al-Turany, M; Uhlig, F; Karabowicz, R

    2010-01-01

    FairRoot is the simulation and analysis framework used by CBM and PANDA experiments at FAIR/GSI. The use of graphics processor units (GPUs) for event reconstruction in FairRoot will be presented. The fact that CUDA (Nvidia's Compute Unified Device Architecture) development tools work alongside the conventional C/C++ compiler, makes it possible to mix GPU code with general-purpose code for the host CPU, based on this some of the reconstruction tasks can be send to the graphic cards. Moreover, tasks that run on the GPU's can also run in emulation mode on the host CPU, which has the advantage that the same code is used on both CPU and GPU.

  13. Medical image processing on the GPU - past, present and future.

    Science.gov (United States)

    Eklund, Anders; Dufort, Paul; Forsberg, Daniel; LaConte, Stephen M

    2013-12-01

    Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges. Copyright © 2013 Elsevier B.V. All rights reserved.

  14. Overview of implementation of DARPA GPU program in SAIC

    Science.gov (United States)

    Braunreiter, Dennis; Furtek, Jeremy; Chen, Hai-Wen; Healy, Dennis

    2008-04-01

    This paper reviews the implementation of DARPA MTO STAP-BOY program for both Phase I and II conducted at Science Applications International Corporation (SAIC). The STAP-BOY program conducts fast covariance factorization and tuning techniques for space-time adaptive process (STAP) Algorithm Implementation on Graphics Processor unit (GPU) Architectures for Embedded Systems. The first part of our presentation on the DARPA STAP-BOY program will focus on GPU implementation and algorithm innovations for a prototype radar STAP algorithm. The STAP algorithm will be implemented on the GPU, using stream programming (from companies such as PeakStream, ATI Technologies' CTM, and NVIDIA) and traditional graphics APIs. This algorithm will include fast range adaptive STAP weight updates and beamforming applications, each of which has been modified to exploit the parallel nature of graphics architectures.

  15. gPGA: GPU Accelerated Population Genetics Analyses.

    Directory of Open Access Journals (Sweden)

    Chunbao Zhou

    Full Text Available The isolation with migration (IM model is important for studies in population genetics and phylogeography. IM program applies the IM model to genetic data drawn from a pair of closely related populations or species based on Markov chain Monte Carlo (MCMC simulations of gene genealogies. But computational burden of IM program has placed limits on its application.With strong computational power, Graphics Processing Unit (GPU has been widely used in many fields. In this article, we present an effective implementation of IM program on one GPU based on Compute Unified Device Architecture (CUDA, which we call gPGA.Compared with IM program, gPGA can achieve up to 52.30X speedup on one GPU. The evaluation results demonstrate that it allows datasets to be analyzed effectively and rapidly for research on divergence population genetics. The software is freely available with source code at https://github.com/chunbaozhou/gPGA.

  16. GPU-Boosted Camera-Only Indoor Localization

    DEFF Research Database (Denmark)

    Özkil, Ali Gürcan; Fan, Zhun; Kristensen, Jens Klæstrup

    relies on local image features detection, description and matching; by parallelizing these computationally intensive tasks on the graphical processing unit (GPU), it is possible to do online localization using a Topometric Appearance Map. The method is developed as an integral part of a mobile service...

  17. System matrix computation vs storage on GPU: A comparative study in cone beam CT.

    Science.gov (United States)

    Matenine, Dmitri; Côté, Geoffroi; Mascolo-Fortin, Julia; Goussard, Yves; Després, Philippe

    2018-02-01

    Iterative reconstruction algorithms in computed tomography (CT) require a fast method for computing the intersection distances between the trajectories of photons and the object, also called ray tracing or system matrix computation. This work focused on the thin-ray model is aimed at comparing different system matrix handling strategies using graphical processing units (GPUs). In this work, the system matrix is modeled by thin rays intersecting a regular grid of box-shaped voxels, known to be an accurate representation of the forward projection operator in CT. However, an uncompressed system matrix exceeds the random access memory (RAM) capacities of typical computers by one order of magnitude or more. Considering the RAM limitations of GPU hardware, several system matrix handling methods were compared: full storage of a compressed system matrix, on-the-fly computation of its coefficients, and partial storage of the system matrix with partial on-the-fly computation. These methods were tested on geometries mimicking a cone beam CT (CBCT) acquisition of a human head. Execution times of three routines of interest were compared: forward projection, backprojection, and ordered-subsets convex (OSC) iteration. A fully stored system matrix yielded the shortest backprojection and OSC iteration times, with a 1.52× acceleration for OSC when compared to the on-the-fly approach. Nevertheless, the maximum problem size was bound by the available GPU RAM and geometrical symmetries. On-the-fly coefficient computation did not require symmetries and was shown to be the fastest for forward projection. It also offered reasonable execution times of about 176.4 ms per view per OSC iteration for a detector of 512 × 448 pixels and a volume of 384 3 voxels, using commodity GPU hardware. Partial system matrix storage has shown a performance similar to the on-the-fly approach, while still relying on symmetries. Partial system matrix storage was shown to yield the lowest relative

  18. Application of GPU to Multi-interfaces Advection and Reconstruction Solver (MARS)

    International Nuclear Information System (INIS)

    Nagatake, Taku; Takase, Kazuyuki; Kunugi, Tomoaki

    2010-01-01

    In the nuclear engineering fields, a high performance computer system is necessary to perform the large scale computations. Recently, a Graphics Processing Unit (GPU) has been developed as a rendering computational system in order to reduce a Central Processing Unit (CPU) load. In the graphics processing, the high performance computing is needed to render the high-quality 3D objects in some video games. Thus the GPU consists of many processing units and a wide memory bandwidth. In this study, the Multi-interfaces Advection and Reconstruction Solver (MARS) which is one of the interface volume tracking methods for multi-phase flows has been performed. The multi-phase flow computation is very important for the nuclear reactors and other engineering fields. The MARS consists of two computing parts: the interface tracking part and the fluid motion computing part. As for the interface tracking part, the performance of GPU (GTX280) was 6 times faster than that of the CPU (Dual-Xeon 5040), and in the fluid motion computing part the Poisson Solver by the GPU (GTX285) was 22 times faster than that by the CPU(Core i7). As for the Dam Breaking Problem, the result of GPU-MARS showed slightly different from the experimental result. Because the GPU-MARS was developed using the single-precision GPU, it can be considered that the round-off error might be accumulated. (author)

  19. CUDA/GPU Technology : Parallel Programming For High Performance Scientific Computing

    OpenAIRE

    YUHENDRA; KUZE, Hiroaki; JOSAPHAT, Tetuko Sri Sumantyo

    2009-01-01

    [ABSTRACT]Graphics processing units (GP Us) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. In the high performance computation capabilities, graphic processing units (GPU) lead to much more powerful performance than conventional CPUs by means of parallel processing. In 2007, the birth of Compute Unified Device Architecture (CUDA) and CUDA-enabled GPUs by NVIDIA Corporation brought a revolution in the general purpose GPU a...

  20. TH-A-19A-12: A GPU-Accelerated and Monte Carlo-Based Intensity Modulated Proton Therapy Optimization System

    Energy Technology Data Exchange (ETDEWEB)

    Ma, J; Wan Chan Tseung, H; Beltran, C [Mayo Clinic, Rochester, MN (United States)

    2014-06-15

    Purpose: To develop a clinically applicable intensity modulated proton therapy (IMPT) optimization system that utilizes more accurate Monte Carlo (MC) dose calculation, rather than analytical dose calculation. Methods: A very fast in-house graphics processing unit (GPU) based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified gradient based optimization method was used to achieve the desired dose volume histograms (DVH). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve the spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that Result from maintaining the intrinsic CT resolution and large number of proton spots. The dose effects were studied particularly in cases with heterogeneous materials in comparison with the commercial treatment planning system (TPS). Results: For a relatively large and complex three-field bi-lateral head and neck case (i.e. >100K spots with a target volume of ∼1000 cc and multiple surrounding critical structures), the optimization together with the initial MC dose influence map calculation can be done in a clinically viable time frame (i.e. less than 15 minutes) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The DVHs of the MC TPS plan compare favorably with those of a commercial treatment planning system. Conclusion: A GPU accelerated and MC-based IMPT optimization system was developed. The dose calculation and plan optimization can be performed in less than 15 minutes on a hardware system costing less than 45,000 dollars. The fast calculation and optimization makes the system easily expandable to robust and multi-criteria optimization. This work was funded in part by a grant from Varian Medical Systems, Inc.

  1. TH-A-19A-12: A GPU-Accelerated and Monte Carlo-Based Intensity Modulated Proton Therapy Optimization System

    International Nuclear Information System (INIS)

    Ma, J; Wan Chan Tseung, H; Beltran, C

    2014-01-01

    Purpose: To develop a clinically applicable intensity modulated proton therapy (IMPT) optimization system that utilizes more accurate Monte Carlo (MC) dose calculation, rather than analytical dose calculation. Methods: A very fast in-house graphics processing unit (GPU) based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified gradient based optimization method was used to achieve the desired dose volume histograms (DVH). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve the spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that Result from maintaining the intrinsic CT resolution and large number of proton spots. The dose effects were studied particularly in cases with heterogeneous materials in comparison with the commercial treatment planning system (TPS). Results: For a relatively large and complex three-field bi-lateral head and neck case (i.e. >100K spots with a target volume of ∼1000 cc and multiple surrounding critical structures), the optimization together with the initial MC dose influence map calculation can be done in a clinically viable time frame (i.e. less than 15 minutes) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The DVHs of the MC TPS plan compare favorably with those of a commercial treatment planning system. Conclusion: A GPU accelerated and MC-based IMPT optimization system was developed. The dose calculation and plan optimization can be performed in less than 15 minutes on a hardware system costing less than 45,000 dollars. The fast calculation and optimization makes the system easily expandable to robust and multi-criteria optimization. This work was funded in part by a grant from Varian Medical Systems, Inc

  2. ARCHERRT - a GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: software development and application to helical tomotherapy.

    Science.gov (United States)

    Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X George

    2014-07-01

    Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm

  3. ARCHERRT – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

    Science.gov (United States)

    Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George

    2014-01-01

    Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified

  4. GPU-based parallel algorithm for blind image restoration using midfrequency-based methods

    Science.gov (United States)

    Xie, Lang; Luo, Yi-han; Bao, Qi-liang

    2013-08-01

    GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.

  5. The Acquisition and Control Design for Vacuum Unit of an Electron Beam Machine Using Remote Manual, Software and Hardware

    International Nuclear Information System (INIS)

    Sudiyanto; Prajitno

    2002-01-01

    The Acquisition and Control design for vacuum unit of an Electron Beam Machine using Remote manual, Software and Hardwire have been done. For Remote Manual system open/close of pneumatic valves can be done by using 220 Vac/12 Vdc relay equipped with long cable and switches on the control panel. An indicator lamp mentioning ready/not ready status of the vacuum unit would be the main indicator in making decisions to open/close the pneumatic valves. On the software method the acquisition and controlled would be done by using the Distributed Control System which have already been developed recently. The references voltage on he vacuum level of 10 -2 Torr and 10 -6 Torr would be proceeded by using ADC techniques of PCL-718 and recorded on the software system as a references data base in making an open/closed pneumatic decision. On the Hardwire method, on/off controlling of the pneumatic valves could be done by using voltage comparison by using logic circuitry where the vacuum references level of 10 -2 Torr and 10 -6 Torr have already been taken, monitored by penning gauge. The hardwire method is the fastest in response time than the others. (author)

  6. Evolution of GPU nuclear's training program

    International Nuclear Information System (INIS)

    Long, R.L.; Coe, R.P.

    1987-01-01

    GPU Nuclear Corporation (GPUN) manages the operators of Three Mile Island Unit 1 and Oyster Creek Nuclear Generating Stations and the recovery activities at the Three Mile Island Unit 2 plant. From the time it was formed in January 1980 GPUN emphasized the use of behavioral learning objectives as the basis for all its training programs. This paper describes the evolution to a formalized performance based Training System Development (TSD) Process. The Training and Education Department staff increased from 10 in 1979 to the current 120 dedicated professionals, with a corresponding increase in facilities and acquisition of sophisticated Basic Principles Training Simulators and a Three Mile Island Unit 1 control Room Replica Simulator. The impact of these developments and achievement of full INPO accreditation are discussed and related to plant performance improvements

  7. Area-delay trade-offs of texture decompressors for a graphics processing unit

    Science.gov (United States)

    Novoa Súñer, Emilio; Ituero, Pablo; López-Vallejo, Marisa

    2011-05-01

    Graphics Processing Units have become a booster for the microelectronics industry. However, due to intellectual property issues, there is a serious lack of information on implementation details of the hardware architecture that is behind GPUs. For instance, the way texture is handled and decompressed in a GPU to reduce bandwidth usage has never been dealt with in depth from a hardware point of view. This work addresses a comparative study on the hardware implementation of different texture decompression algorithms for both conventional (PCs and video game consoles) and mobile platforms. Circuit synthesis is performed targeting both a reconfigurable hardware platform and a 90nm standard cell library. Area-delay trade-offs have been extensively analyzed, which allows us to compare the complexity of decompressors and thus determine suitability of algorithms for systems with limited hardware resources.

  8. GPU-based cone beam computed tomography.

    Science.gov (United States)

    Noël, Peter B; Walczak, Alan M; Xu, Jinhui; Corso, Jason J; Hoffmann, Kenneth R; Schafer, Sebastian

    2010-06-01

    The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 s). In many situations, the short scanning time of CBCT is followed by a time-consuming 3D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 256(3) takes up to 25 min on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high-performance computing solutions at a low cost, allowing their use in many scientific problems. We have implemented an algorithm for 3D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Corporation, Santa Clara, California), which was executed on a NVIDIA GeForce GTX 280. Our implementation results in improved reconstruction times from minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe if differences occur between CPU and GPU-based reconstructions. By using our approach, the computation time for 256(3) is reduced from 25 min on the CPU to 3.2 s on the GPU. The GPU reconstruction time for 512(3) volumes is 8.5 s. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.

  9. GPU-accelerated adjoint algorithmic differentiation

    Science.gov (United States)

    Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe

    2016-03-01

    Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.

  10. GPU-based large-scale visualization

    KAUST Repository

    Hadwiger, Markus

    2013-11-19

    Recent advances in image and volume acquisition as well as computational advances in simulation have led to an explosion of the amount of data that must be visualized and analyzed. Modern techniques combine the parallel processing power of GPUs with out-of-core methods and data streaming to enable the interactive visualization of giga- and terabytes of image and volume data. A major enabler for interactivity is making both the computational and the visualization effort proportional to the amount of data that is actually visible on screen, decoupling it from the full data size. This leads to powerful display-aware multi-resolution techniques that enable the visualization of data of almost arbitrary size. The course consists of two major parts: An introductory part that progresses from fundamentals to modern techniques, and a more advanced part that discusses details of ray-guided volume rendering, novel data structures for display-aware visualization and processing, and the remote visualization of large online data collections. You will learn how to develop efficient GPU data structures and large-scale visualizations, implement out-of-core strategies and concepts such as virtual texturing that have only been employed recently, as well as how to use modern multi-resolution representations. These approaches reduce the GPU memory requirements of extremely large data to a working set size that fits into current GPUs. You will learn how to perform ray-casting of volume data of almost arbitrary size and how to render and process gigapixel images using scalable, display-aware techniques. We will describe custom virtual texturing architectures as well as recent hardware developments in this area. We will also describe client/server systems for distributed visualization, on-demand data processing and streaming, and remote visualization. We will describe implementations using OpenGL as well as CUDA, exploiting parallelism on GPUs combined with additional asynchronous

  11. Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space

    Directory of Open Access Journals (Sweden)

    Richard Wilton

    2015-03-01

    Full Text Available When computing alignments of DNA sequences to a large genome, a key element in achieving high processing throughput is to prioritize locations in the genome where high-scoring mappings might be expected. We formulated this task as a series of list-processing operations that can be efficiently performed on graphics processing unit (GPU hardware.We followed this approach in implementing a read aligner called Arioc that uses GPU-based parallel sort and reduction techniques to identify high-priority locations where potential alignments may be found. We then carried out a read-by-read comparison of Arioc’s reported alignments with the alignments found by several leading read aligners. With simulated reads, Arioc has comparable or better accuracy than the other read aligners we tested. With human sequencing reads, Arioc demonstrates significantly greater throughput than the other aligners we evaluated across a wide range of sensitivity settings. The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

  12. Parallel Computer System for 3D Visualization Stereo on GPU

    Science.gov (United States)

    Al-Oraiqat, Anas M.; Zori, Sergii A.

    2018-03-01

    This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.

  13. Parallel implementation of DNA sequences matching algorithms using PWM on GPU architecture.

    Science.gov (United States)

    Sharma, Rahul; Gupta, Nitin; Narang, Vipin; Mittal, Ankush

    2011-01-01

    Positional Weight Matrices (PWMs) are widely used in representation and detection of Transcription Factor Of Binding Sites (TFBSs) on DNA. We implement online PWM search algorithm over parallel architecture. A large PWM data can be processed on Graphic Processing Unit (GPU) systems in parallel which can help in matching sequences at a faster rate. Our method employs extensive usage of highly multithreaded architecture and shared memory of multi-cored GPU. An efficient use of shared memory is required to optimise parallel reduction in CUDA. Our optimised method has a speedup of 230-280x over linear implementation on GPU named GeForce GTX 280.

  14. Accelerating the Non-equispaced Fast Fourier Transform on Commodity Graphics Hardware

    DEFF Research Database (Denmark)

    Sørensen, Thomas Sangild; Schaeffter, Tobias; Noe, Karsten Østergaard

    2008-01-01

    We present a fast parallel algorithm to compute the Non-equispaced fast Fourier transform on commodity graphics hardware (the GPU). We focus particularly on a novel implementation of the convolution step in the transform, which was previously its most time consuming part. We describe the performa......We present a fast parallel algorithm to compute the Non-equispaced fast Fourier transform on commodity graphics hardware (the GPU). We focus particularly on a novel implementation of the convolution step in the transform, which was previously its most time consuming part. We describe...

  15. Accelerating the XGBoost algorithm using GPU computing

    Directory of Open Access Journals (Sweden)

    Rory Mitchell

    2017-07-01

    Full Text Available We present a CUDA-based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the graphics processing unit (GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelised, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort-based approach for larger depths. We show speedups of between 3× and 6× using a Titan X compared to a 4 core i7 CPU, and 1.2× using a Titan X compared to 2× Xeon CPUs (24 cores. We show that it is possible to process the Higgs dataset (10 million instances, 28 features entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks.

  16. High-throughput sequence alignment using Graphics Processing Units

    Directory of Open Access Journals (Sweden)

    Trapnell Cole

    2007-12-01

    Full Text Available Abstract Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.

  17. Acceleration for 2D time-domain elastic full waveform inversion using a single GPU card

    Science.gov (United States)

    Jiang, Jinpeng; Zhu, Peimin

    2018-05-01

    Full waveform inversion (FWI) is a challenging procedure due to the high computational cost related to the modeling, especially for the elastic case. The graphics processing unit (GPU) has become a popular device for the high-performance computing (HPC). To reduce the long computation time, we design and implement the GPU-based 2D elastic FWI (EFWI) in time domain using a single GPU card. We parallelize the forward modeling and gradient calculations using the CUDA programming language. To overcome the limitation of relatively small global memory on GPU, the boundary saving strategy is exploited to reconstruct the forward wavefield. Moreover, the L-BFGS optimization method used in the inversion increases the convergence of the misfit function. A multiscale inversion strategy is performed in the workflow to obtain the accurate inversion results. In our tests, the GPU-based implementations using a single GPU device achieve >15 times speedup in forward modeling, and about 12 times speedup in gradient calculation, compared with the eight-core CPU implementations optimized by OpenMP. The test results from the GPU implementations are verified to have enough accuracy by comparing the results obtained from the CPU implementations.

  18. GPU-accelerated Gibbs ensemble Monte Carlo simulations of Lennard-Jonesium

    Science.gov (United States)

    Mick, Jason; Hailat, Eyad; Russo, Vincent; Rushaidat, Kamel; Schwiebert, Loren; Potoff, Jeffrey

    2013-12-01

    This work describes an implementation of canonical and Gibbs ensemble Monte Carlo simulations on graphics processing units (GPUs). The pair-wise energy calculations, which consume the majority of the computational effort, are parallelized using the energetic decomposition algorithm. While energetic decomposition is relatively inefficient for traditional CPU-bound codes, the algorithm is ideally suited to the architecture of the GPU. The performance of the CPU and GPU codes are assessed for a variety of CPU and GPU combinations for systems containing between 512 and 131,072 particles. For a system of 131,072 particles, the GPU-enabled canonical and Gibbs ensemble codes were 10.3 and 29.1 times faster (GTX 480 GPU vs. i5-2500K CPU), respectively, than an optimized serial CPU-bound code. Due to overhead from memory transfers from system RAM to the GPU, the CPU code was slightly faster than the GPU code for simulations containing less than 600 particles. The critical temperature Tc∗=1.312(2) and density ρc∗=0.316(3) were determined for the tail corrected Lennard-Jones potential from simulations of 10,000 particle systems, and found to be in exact agreement with prior mixed field finite-size scaling calculations [J.J. Potoff, A.Z. Panagiotopoulos, J. Chem. Phys. 109 (1998) 10914].

  19. Ultrafast convolution/superposition using tabulated and exponential kernels on GPU

    Energy Technology Data Exchange (ETDEWEB)

    Chen Quan; Chen Mingli; Lu Weiguo [TomoTherapy Inc., 1240 Deming Way, Madison, Wisconsin 53717 (United States)

    2011-03-15

    Purpose: Collapsed-cone convolution/superposition (CCCS) dose calculation is the workhorse for IMRT dose calculation. The authors present a novel algorithm for computing CCCS dose on the modern graphic processing unit (GPU). Methods: The GPU algorithm includes a novel TERMA calculation that has no write-conflicts and has linear computation complexity. The CCCS algorithm uses either tabulated or exponential cumulative-cumulative kernels (CCKs) as reported in literature. The authors have demonstrated that the use of exponential kernels can reduce the computation complexity by order of a dimension and achieve excellent accuracy. Special attentions are paid to the unique architecture of GPU, especially the memory accessing pattern, which increases performance by more than tenfold. Results: As a result, the tabulated kernel implementation in GPU is two to three times faster than other GPU implementations reported in literature. The implementation of CCCS showed significant speedup on GPU over single core CPU. On tabulated CCK, speedups as high as 70 are observed; on exponential CCK, speedups as high as 90 are observed. Conclusions: Overall, the GPU algorithm using exponential CCK is 1000-3000 times faster over a highly optimized single-threaded CPU implementation using tabulated CCK, while the dose differences are within 0.5% and 0.5 mm. This ultrafast CCCS algorithm will allow many time-sensitive applications to use accurate dose calculation.

  20. Computing treewidth on the GPU

    NARCIS (Netherlands)

    Van Der Zanden, Tom C.; Bodlaender, Hans L.

    2018-01-01

    We present a parallel algorithm for computing the treewidth of a graph on a GPU. We implement this algorithm in OpenCL, and experimentally evaluate its performance. Our algorithm is based on an O∗(2n)-time algorithm that explores the elimination orderings of the graph using a Held-Karp like dynamic

  1. Computing treewidth on the GPU

    NARCIS (Netherlands)

    van der Zanden, T.C.; Bodlaender, Hans L.

    2017-01-01

    We present a parallel algorithm for computing the treewidth of a graph on a GPU. We implement this algorithm in OpenCL, and experimentally evaluate its performance. Our algorithm is based on an $O^*(2^{n})$-time algorithm that explores the elimination orderings of the graph using a Held-Karp like

  2. Optimizing a mobile robot control system using GPU acceleration

    Science.gov (United States)

    Tuck, Nat; McGuinness, Michael; Martin, Fred

    2012-01-01

    This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.

  3. A GPU-accelerated and Monte Carlo-based intensity modulated proton therapy optimization system.

    Science.gov (United States)

    Ma, Jiasen; Beltran, Chris; Seum Wan Chan Tseung, Hok; Herman, Michael G

    2014-12-01

    Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. An in-house graphics processing unit (GPU)-based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. For relatively large and complex three-field head and neck cases, i.e., >100,000 spots with a target volume of ∼ 1000 cm(3) and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45,000 dollars. The fast calculation and

  4. A GPU-accelerated and Monte Carlo-based intensity modulated proton therapy optimization system

    Energy Technology Data Exchange (ETDEWEB)

    Ma, Jiasen, E-mail: ma.jiasen@mayo.edu; Beltran, Chris; Seum Wan Chan Tseung, Hok; Herman, Michael G. [Department of Radiation Oncology, Division of Medical Physics, Mayo Clinic, 200 First Street Southwest, Rochester, Minnesota 55905 (United States)

    2014-12-15

    Purpose: Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. Methods: An in-house graphics processing unit (GPU)-based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. Results: For relatively large and complex three-field head and neck cases, i.e., >100 000 spots with a target volume of ∼1000 cm{sup 3} and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. Conclusions: A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45

  5. SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems

    Energy Technology Data Exchange (ETDEWEB)

    Xiao, K; Chen, D. Z; Hu, X. S [University of Notre Dame, Notre Dame, IN (United States); Zhou, B [Altera Corp., San Jose, CA (United States)

    2014-06-01

    Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this procedure into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF

  6. FULL GPU Implementation of Lattice-Boltzmann Methods with Immersed Boundary Conditions for Fast Fluid Simulations

    Directory of Open Access Journals (Sweden)

    G Boroni

    2017-03-01

    Full Text Available Lattice Boltzmann Method (LBM has shown great potential in fluid simulations, but performance issues and difficulties to manage complex boundary conditions have hindered a wider application. The upcoming of Graphic Processing Units (GPU Computing offered a possible solution for the performance issue, and methods like the Immersed Boundary (IB algorithm proved to be a flexible solution to boundaries. Unfortunately, the implicit IB algorithm makes the LBM implementation in GPU a non-trivial task. This work presents a fully parallel GPU implementation of LBM in combination with IB. The fluid-boundary interaction is implemented via GPU kernels, using execution configurations and data structures specifically designed to accelerate each code execution. Simulations were validated against experimental and analytical data showing good agreement and improving the computational time. Substantial reductions of calculation rates were achieved, lowering down the required time to execute the same model in a CPU to about two magnitude orders.

  7. gWEGA: GPU-accelerated WEGA for molecular superposition and shape comparison.

    Science.gov (United States)

    Yan, Xin; Li, Jiabo; Gu, Qiong; Xu, Jun

    2014-06-05

    Virtual screening of a large chemical library for drug lead identification requires searching/superimposing a large number of three-dimensional (3D) chemical structures. This article reports a graphic processing unit (GPU)-accelerated weighted Gaussian algorithm (gWEGA) that expedites shape or shape-feature similarity score-based virtual screening. With 86 GPU nodes (each node has one GPU card), gWEGA can screen 110 million conformations derived from an entire ZINC drug-like database with diverse antidiabetic agents as query structures within 2 s (i.e., screening more than 55 million conformations per second). The rapid screening speed was accomplished through the massive parallelization on multiple GPU nodes and rapid prescreening of 3D structures (based on their shape descriptors and pharmacophore feature compositions). Copyright © 2014 Wiley Periodicals, Inc.

  8. Semiempirical Quantum Chemical Calculations Accelerated on a Hybrid Multicore CPU-GPU Computing Platform.

    Science.gov (United States)

    Wu, Xin; Koslowski, Axel; Thiel, Walter

    2012-07-10

    In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.

  9. GPU-Based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites.

    Science.gov (United States)

    Leinweber, Matthias; Fober, Thomas; Freisleben, Bernd

    2018-01-01

    In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH, and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92 percent (CPU) to 96 percent (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.

  10. GPU based numerical simulation of core shooting process

    Directory of Open Access Journals (Sweden)

    Yi-zhong Zhang

    2017-11-01

    Full Text Available Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model (TFM and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit (GPU has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture (CUDA platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications.

  11. LDPC Decoding on GPU for Mobile Device

    Directory of Open Access Journals (Sweden)

    Yiqin Lu

    2016-01-01

    Full Text Available A flexible software LDPC decoder that exploits data parallelism for simultaneous multicode words decoding on the mobile device is proposed in this paper, supported by multithreading on OpenCL based graphics processing units. By dividing the check matrix into several parts to make full use of both the local memory and private memory on GPU and properly modify the code capacity each time, our implementation on a mobile phone shows throughputs above 100 Mbps and delay is less than 1.6 millisecond in decoding, which make high-speed communication like video calling possible. To realize efficient software LDPC decoding on the mobile device, the LDPC decoding feature on communication baseband chip should be replaced to save the cost and make it easier to upgrade decoder to be compatible with a variety of channel access schemes.

  12. GPU-based prompt gamma ray imaging from boron neutron capture therapy

    International Nuclear Information System (INIS)

    Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae; Jo Hong, Key; Sil Lee, Keum

    2015-01-01

    Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations

  13. GPU-based stochastic-gradient optimization for non-rigid medical image registration in time-critical applications

    NARCIS (Netherlands)

    Staring, M.; Al-Ars, Z.; Berendsen, Floris; Angelini, Elsa D.; Landman, Bennett A.

    2018-01-01

    Currently, non-rigid image registration algorithms are too computationally intensive to use in time-critical applications. Existing implementations that focus on speed typically address this by either parallelization on GPU-hardware, or by introducing methodically novel techniques into

  14. Proton Testing of nVidia GTX 1050 GPU

    Science.gov (United States)

    Wyrwas, E. J.

    2017-01-01

    Single-Event Effects (SEE) testing was conducted on the nVidia GTX 1050 Graphics Processor Unit (GPU); herein referred to as device under test (DUT). Testing was conducted at Massachusetts General Hospitals (MGH) Francis H. Burr Proton Therapy Center on April 9th, 2017 using 200-MeV protons. This testing trip was purposed to provide a baseline assessment of the radiation susceptibility of the DUT as no previous testing had been conducted on this component.

  15. A GPU Accelerated Spring Mass System for Surgical Simulation

    DEFF Research Database (Denmark)

    Mosegaard, Jesper; Sørensen, Thomas Sangild

    2005-01-01

    There is a growing demand for surgical simulators to dofast and precise calculations of tissue deformation to simulateincreasingly complex morphology in real-time. Unfortunately, evenfast spring-mass based systems have slow convergence rates for largemodels. This paper presents a method to accele...... to accelerate computation of aspring-mass system in order to simulate a complex organ such as theheart. This acceleration is achieved by taking advantage of moderngraphics processing units (GPU)....

  16. Integrative multicellular biological modeling: a case study of 3D epidermal development using GPU algorithms

    Directory of Open Access Journals (Sweden)

    Christley Scott

    2010-08-01

    Full Text Available Abstract Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a

  17. GPU Parallel Bundle Block Adjustment

    Directory of Open Access Journals (Sweden)

    ZHENG Maoteng

    2017-09-01

    Full Text Available To deal with massive data in photogrammetry, we introduce the GPU parallel computing technology. The preconditioned conjugate gradient and inexact Newton method are also applied to decrease the iteration times while solving the normal equation. A brand new workflow of bundle adjustment is developed to utilize GPU parallel computing technology. Our method can avoid the storage and inversion of the big normal matrix, and compute the normal matrix in real time. The proposed method can not only largely decrease the memory requirement of normal matrix, but also largely improve the efficiency of bundle adjustment. It also achieves the same accuracy as the conventional method. Preliminary experiment results show that the bundle adjustment of a dataset with about 4500 images and 9 million image points can be done in only 1.5 minutes while achieving sub-pixel accuracy.

  18. GPU PRO 3 Advanced rendering techniques

    CERN Document Server

    Engel, Wolfgang

    2012-01-01

    GPU Pro3, the third volume in the GPU Pro book series, offers practical tips and techniques for creating real-time graphics that are useful to beginners and seasoned game and graphics programmers alike. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Wessam Bahnassi, and Sebastien St-Laurent have once again brought together a high-quality collection of cutting-edge techniques for advanced GPU programming. With contributions by more than 50 experts, GPU Pro3: Advanced Rendering Techniques covers battle-tested tips and tricks for creating interesting geometry, realistic sha

  19. Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU

    Directory of Open Access Journals (Sweden)

    Arefan D

    2015-06-01

    Full Text Available Digital Breast Tomosynthesis (DBT is a technology that creates three dimensional (3D images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU. At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU card and the Graphics Processing Unit (GPU. It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU.

  20. Exploiting current-generation graphics hardware for synthetic-scene generation

    Science.gov (United States)

    Tanner, Michael A.; Keen, Wayne A.

    2010-04-01

    Increasing seeker frame rate and pixel count, as well as the demand for higher levels of scene fidelity, have driven scene generation software for hardware-in-the-loop (HWIL) and software-in-the-loop (SWIL) testing to higher levels of parallelization. Because modern PC graphics cards provide multiple computational cores (240 shader cores for a current NVIDIA Corporation GeForce and Quadro cards), implementation of phenomenology codes on graphics processing units (GPUs) offers significant potential for simultaneous enhancement of simulation frame rate and fidelity. To take advantage of this potential requires algorithm implementation that is structured to minimize data transfers between the central processing unit (CPU) and the GPU. In this paper, preliminary methodologies developed at the Kinetic Hardware In-The-Loop Simulator (KHILS) will be presented. Included in this paper will be various language tradeoffs between conventional shader programming, Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), including performance trades and possible pathways for future tool development.

  1. First results with the immediate reconstructive strategy for internal hardware exposure in non-united fractures of the distal third of the leg: case series and literature review

    Directory of Open Access Journals (Sweden)

    Vaienti Luca

    2012-08-01

    Full Text Available Abstract Background Fractures of the distal third of the leg are increasingly common and are often handled by open reduction and internal fixation. Exposure and infection of internal hardware could occur, especially after high energy traumas, requiring hardware removal and delayed soft tissue reconstruction. Nevertheless immediate soft tissue reconstruction without internal hardware removal is still possible in selected patients. In this study the effectiveness and the complications of immediate soft tissue reconstruction without internal hardware removal is analyzed. Methods 13 patients, affected by internal hardware exposure in the distal leg, treated with immediate soft tissue reconstruction with pedicled flaps and hardware retention, are retrospectively analyzed, with special regard to flap survival and wound infection. Results Wound infection was observed in 10 cases before surgery and in 5 cases surgical debridement was necessary before reconstruction which was performed in a separate operative session. After reconstruction, wound dehiscence and infection occurred in 5 cases, and in 3 cases removal of internal hardware was necessary in order to achieve the complete healing of dehiscence. In one case the previous flap failed but prompt reconstruction with a sural fasciocutaneous flap was performed without hardware removal and without complications. Pre-operative infection and late reconstructive surgery are predictive for higher rates of post-operative complications (respectively p 0.018 and p 0.028. Conclusion Our approach achieved full recovery in 53.8% of the treated cases after one-step surgery, therefore reducing hospitalization and allowing early mobilization. Controlled trials are needed to confirm the effectiveness of this strategy, although the present case series shows encouraging results.

  2. Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

    Science.gov (United States)

    Rostrup, Scott; De Sterck, Hans

    2010-12-01

    Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL

  3. Introduction to Hardware Security

    Directory of Open Access Journals (Sweden)

    Yier Jin

    2015-10-01

    Full Text Available Hardware security has become a hot topic recently with more and more researchers from related research domains joining this area. However, the understanding of hardware security is often mixed with cybersecurity and cryptography, especially cryptographic hardware. For the same reason, the research scope of hardware security has never been clearly defined. To help researchers who have recently joined in this area better understand the challenges and tasks within the hardware security domain and to help both academia and industry investigate countermeasures and solutions to solve hardware security problems, we will introduce the key concepts of hardware security as well as its relations to related research topics in this survey paper. Emerging hardware security topics will also be clearly depicted through which the future trend will be elaborated, making this survey paper a good reference for the continuing research efforts in this area.

  4. A real-time spike sorting method based on the embedded GPU.

    Science.gov (United States)

    Zelan Yang; Kedi Xu; Xiang Tian; Shaomin Zhang; Xiaoxiang Zheng

    2017-07-01

    Microelectrode arrays with hundreds of channels have been widely used to acquire neuron population signals in neuroscience studies. Online spike sorting is becoming one of the most important challenges for high-throughput neural signal acquisition systems. Graphic processing unit (GPU) with high parallel computing capability might provide an alternative solution for increasing real-time computational demands on spike sorting. This study reported a method of real-time spike sorting through computing unified device architecture (CUDA) which was implemented on an embedded GPU (NVIDIA JETSON Tegra K1, TK1). The sorting approach is based on the principal component analysis (PCA) and K-means. By analyzing the parallelism of each process, the method was further optimized in the thread memory model of GPU. Our results showed that the GPU-based classifier on TK1 is 37.92 times faster than the MATLAB-based classifier on PC while their accuracies were the same with each other. The high-performance computing features of embedded GPU demonstrated in our studies suggested that the embedded GPU provide a promising platform for the real-time neural signal processing.

  5. Development of High-speed Visualization System of Hypocenter Data Using CUDA-based GPU computing

    Science.gov (United States)

    Kumagai, T.; Okubo, K.; Uchida, N.; Matsuzawa, T.; Kawada, N.; Takeuchi, N.

    2014-12-01

    After the Great East Japan Earthquake on March 11, 2011, intelligent visualization of seismic information is becoming important to understand the earthquake phenomena. On the other hand, to date, the quantity of seismic data becomes enormous as a progress of high accuracy observation network; we need to treat many parameters (e.g., positional information, origin time, magnitude, etc.) to efficiently display the seismic information. Therefore, high-speed processing of data and image information is necessary to handle enormous amounts of seismic data. Recently, GPU (Graphic Processing Unit) is used as an acceleration tool for data processing and calculation in various study fields. This movement is called GPGPU (General Purpose computing on GPUs). In the last few years the performance of GPU keeps on improving rapidly. GPU computing gives us the high-performance computing environment at a lower cost than before. Moreover, use of GPU has an advantage of visualization of processed data, because GPU is originally architecture for graphics processing. In the GPU computing, the processed data is always stored in the video memory. Therefore, we can directly write drawing information to the VRAM on the video card by combining CUDA and the graphics API. In this study, we employ CUDA and OpenGL and/or DirectX to realize full-GPU implementation. This method makes it possible to write drawing information to the VRAM on the video card without PCIe bus data transfer: It enables the high-speed processing of seismic data. The present study examines the GPU computing-based high-speed visualization and the feasibility for high-speed visualization system of hypocenter data.

  6. GPU Based Software Correlators - Perspectives for VLBI2010

    Science.gov (United States)

    Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun

    2010-01-01

    Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.

  7. GPU accelerated FDTD solver and its application in MRI.

    Science.gov (United States)

    Chi, J; Liu, F; Jin, J; Mason, D G; Crozier, S

    2010-01-01

    The finite difference time domain (FDTD) method is a popular technique for computational electromagnetics (CEM). The large computational power often required, however, has been a limiting factor for its applications. In this paper, we will present a graphics processing unit (GPU)-based parallel FDTD solver and its successful application to the investigation of a novel B1 shimming scheme for high-field magnetic resonance imaging (MRI). The optimized shimming scheme exhibits considerably improved transmit B(1) profiles. The GPU implementation dramatically shortened the runtime of FDTD simulation of electromagnetic field compared with its CPU counterpart. The acceleration in runtime has made such investigation possible, and will pave the way for other studies of large-scale computational electromagnetic problems in modern MRI which were previously impractical.

  8. High-speed optical coherence tomography signal processing on GPU

    International Nuclear Information System (INIS)

    Li Xiqi; Shi Guohua; Zhang Yudong

    2011-01-01

    The signal processing speed of spectral domain optical coherence tomography (SD-OCT) has become a bottleneck in many medical applications. Recently, a time-domain interpolation method was proposed. This method not only gets a better signal-to noise ratio (SNR) but also gets a faster signal processing time for the SD-OCT than the widely used zero-padding interpolation method. Furthermore, the re-sampled data is obtained by convoluting the acquired data and the coefficients in time domain. Thus, a lot of interpolations can be performed concurrently. So, this interpolation method is suitable for parallel computing. An ultra-high optical coherence tomography signal processing can be realized by using graphics processing unit (GPU) with computer unified device architecture (CUDA). This paper will introduce the signal processing steps of SD-OCT on GPU. An experiment is performed to acquire a frame SD-OCT data (400A-linesx2048 pixel per A-line) and real-time processed the data on GPU. The results show that it can be finished in 6.208 milliseconds, which is 37 times faster than that on Central Processing Unit (CPU).

  9. Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

    International Nuclear Information System (INIS)

    Ammendola A, R; Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P

    2014-01-01

    APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth

  10. Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

    Science.gov (United States)

    Ammendola A, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Paolucci, P. S.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.

    2014-06-01

    APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.

  11. Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

    Energy Technology Data Exchange (ETDEWEB)

    Ammendola A, R [INFN Roma II, Via della Ricerca Scientifica 1 – 00133 Roma (Italy); Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P [INFN Roma I, P.le Aldo Moro 2 – 00185 Roma (Italy)

    2014-06-06

    APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.

  12. High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP

    Directory of Open Access Journals (Sweden)

    Khaled Benkrid

    2012-01-01

    Full Text Available This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs, Graphics Processor Units (GPUs, and IBM’s Cell Broadband Engine (Cell BE, in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools, FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs.

  13. Near real-time digital holographic microscope based on GPU parallel computing

    Science.gov (United States)

    Zhu, Gang; Zhao, Zhixiong; Wang, Huarui; Yang, Yan

    2018-01-01

    A transmission near real-time digital holographic microscope with in-line and off-axis light path is presented, in which the parallel computing technology based on compute unified device architecture (CUDA) and digital holographic microscopy are combined. Compared to other holographic microscopes, which have to implement reconstruction in multiple focal planes and are time-consuming the reconstruction speed of the near real-time digital holographic microscope can be greatly improved with the parallel computing technology based on CUDA, so it is especially suitable for measurements of particle field in micrometer and nanometer scale. Simulations and experiments show that the proposed transmission digital holographic microscope can accurately measure and display the velocity of particle field in micrometer scale, and the average velocity error is lower than 10%.With the graphic processing units(GPU), the computing time of the 100 reconstruction planes(512×512 grids) is lower than 120ms, while it is 4.9s using traditional reconstruction method by CPU. The reconstruction speed has been raised by 40 times. In other words, it can handle holograms at 8.3 frames per second and the near real-time measurement and display of particle velocity field are realized. The real-time three-dimensional reconstruction of particle velocity field is expected to achieve by further optimization of software and hardware. Keywords: digital holographic microscope,

  14. GPU Enhancement of the Trigger to Extend Physics Reach at the LHC

    CERN Document Server

    Halyo, V; Jindal, P; LeGresley, P; Lujan, P

    2013-01-01

    Significant new challenges are continuously confronting the High Energy Physics (HEP) experiments, in particular the two detectors at the Large Hadron Collider (LHC) at CERN, where nominal conditions deliver proton-proton collisions to the detectors at a rate of 40 MHz. This rate must be significantly reduced to comply with both the performance limitations of the mass storage hardware and the capabilities of the computing resources to process the collected data in a timely fashion for physics analysis. At the same time, the physics signals of interest must be retained with high efficiency. The quest for rare new physics phenomena at the LHC leads us to evaluate a Graphics Processing Unit (GPU) enhancement of the existing High-Level Trigger (HLT), made possible by the current flexibility of the trigger system, which not only provides faster and more efficient event selection, but also includes the possibility of new complex triggers that were not previously feasible. A new tracking algorithm is evaluated on a ...

  15. Smooth Particle Hydrodynamics GPU-Acceleration Tool for Asteroid Fragmentation Simulation

    Science.gov (United States)

    Buruchenko, Sergey K.; Schäfer, Christoph M.; Maindl, Thomas I.

    2017-10-01

    The impact threat of near-Earth objects (NEOs) is a concern to the global community, as evidenced by the Chelyabinsk event (caused by a 17-m meteorite) in Russia on February 15, 2013 and a near miss by asteroid 2012 DA14 ( 30 m diameter), on the same day. The expected energy, from either a low-altitude air burst or direct impact, would have severe consequences, especially in populated regions. To mitigate this threat one of the methods is employment of large kinetic-energy impactors (KEIs). The simulation of asteroid target fragmentation is a challenging task which demands efficient and accurate numerical methods with large computational power. Modern graphics processing units (GPUs) lead to a major increase 10 times and more in the performance of the computation of astrophysical and high velocity impacts. The paper presents a new implementation of the numerical method smooth particle hydrodynamics (SPH) using NVIDIA-GPU and the first astrophysical and high velocity application of the new code. The code allows for a tremendous increase in speed of astrophysical simulations with SPH and self-gravity at low costs for new hardware. We have implemented the SPH equations to model gas, liquids and elastic, and plastic solid bodies and added a fragmentation model for brittle materials. Self-gravity may be optionally included in the simulations.

  16. Parallel, distributed and GPU computing technologies in single-particle electron microscopy

    International Nuclear Information System (INIS)

    Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger

    2009-01-01

    An introduction to the current paradigm shift towards concurrency in software. Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined

  17. Enhancing professionalism at GPU Nuclear

    International Nuclear Information System (INIS)

    Coe, R.P.

    1992-01-01

    Late in 1988, GPU Nuclear embarked on a major program aimed at enhancing professionalism at its Oyster Creek and Three Mile Island nuclear generating stations. The program was also to include its corporate headquarters in Parsippany, New Jersey. The overall program was to take several directions, including on-site degree programs, a sabbatical leave-type program for personnel to finish college degrees, advanced technical training for licensed staff, career progression for senior reactor operators, and expanded teamwork and leadership training for control room crew. The largest portion of this initiative was the development and delivery of professionalism training to the nearly 2,000 people at both nuclear generating sites

  18. Parallel fuzzy connected image segmentation on GPU.

    Science.gov (United States)

    Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W

    2011-07-01

    Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.

  19. Enhancing professionalism at GPU nuclear

    International Nuclear Information System (INIS)

    Coe, R.P.; Landy, F.J.

    1991-01-01

    Late in 1988, GPU Nuclear embarked on a major program aimed at enhancing Professionalism at its Oyster Creek and Three Mile Island Nuclear Generating Stations. The program was also to include its Corporate Headquarters in Parsippany, New Jersey. The overall program was to take several directions which included on-site degree programs, a sabbatical leave-type program for personnel to finish college degrees, advanced technical training for licensed staff, career progression for SROs and expanded teamwork and leadership training for control room crews. The largest portion of this initiative was the development and delivery of professionalism training to the nearly two thousand people at both sites. Three primary philosophies guided the development of the program. Employees as Experts: First, GPU Nuclear employees were considered to be the most valuable source of information for designing a Professionalism program because it is these individuals who are sensitive to the issues encountered in the workplace. Realism: The second philosophy guiding this effort was that the program must be grounded in real life challenges that employees face and must address. Active Learning: The third guiding philosophy was that, in order to have any real impact on the way employees think about professionalism, the program must utilize active rather than passive learning techniques

  20. Risk management at GPU Nuclear

    International Nuclear Information System (INIS)

    Long, R.L.

    1991-01-01

    This paper reports on GPU Nuclear. Among other goals, it established the independence of key safety functions as highlighted by the lessons learned from the accident. In particular, an independent Nuclear Assurance Division was established which include Quality Assurance, Training and Education, Emergency Preparedness, and Nuclear Safety Assessment. The latter consisted of corporate and site independent-safety-review groups. As the GPU Nuclear organization matured, a mid-1987 reorganization created an even more focused Planning and Nuclear Safety Division bringing together Nuclear Safety Assessment with Licensing and Regulatory Affairs and Risk Management. The Risk Management Group (RMG), which began its work in fall 1987, was formed to develop a framework for proactive identification, evaluation, and cost-effective reduction and management of risks of all types. The RMG set out to learn as much as possible about risks and their management in nuclear and other high-technology industries. This began with a thorough literature search. It progressed to interviews with individuals and organizations which have demonstrated innovative ideas, experience, and reputations for safe and reliable operation

  1. Molecular Monte Carlo Simulations Using Graphics Processing Units: To Waste Recycle or Not?

    Science.gov (United States)

    Kim, Jihan; Rodgers, Jocelyn M; Athènes, Manuel; Smit, Berend

    2011-10-11

    In the waste recycling Monte Carlo (WRMC) algorithm, (1) multiple trial states may be simultaneously generated and utilized during Monte Carlo moves to improve the statistical accuracy of the simulations, suggesting that such an algorithm may be well posed for implementation in parallel on graphics processing units (GPUs). In this paper, we implement two waste recycling Monte Carlo algorithms in CUDA (Compute Unified Device Architecture) using uniformly distributed random trial states and trial states based on displacement random-walk steps, and we test the methods on a methane-zeolite MFI framework system to evaluate their utility. We discuss the specific implementation details of the waste recycling GPU algorithm and compare the methods to other parallel algorithms optimized for the framework system. We analyze the relationship between the statistical accuracy of our simulations and the CUDA block size to determine the efficient allocation of the GPU hardware resources. We make comparisons between the GPU and the serial CPU Monte Carlo implementations to assess speedup over conventional microprocessors. Finally, we apply our optimized GPU algorithms to the important problem of determining free energy landscapes, in this case for molecular motion through the zeolite LTA.

  2. Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy

    International Nuclear Information System (INIS)

    Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T

    2011-01-01

    We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30–16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9–67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning. (note)

  3. Hardware device binding and mutual authentication

    Science.gov (United States)

    Hamlet, Jason R; Pierson, Lyndon G

    2014-03-04

    Detection and deterrence of device tampering and subversion by substitution may be achieved by including a cryptographic unit within a computing device for binding multiple hardware devices and mutually authenticating the devices. The cryptographic unit includes a physically unclonable function ("PUF") circuit disposed in or on the hardware device, which generates a binding PUF value. The cryptographic unit uses the binding PUF value during an enrollment phase and subsequent authentication phases. During a subsequent authentication phase, the cryptographic unit uses the binding PUF values of the multiple hardware devices to generate a challenge to send to the other device, and to verify a challenge received from the other device to mutually authenticate the hardware devices.

  4. A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.

    Directory of Open Access Journals (Sweden)

    Chun-Liang Lee

    Full Text Available The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.

  5. Open Hardware Business Models

    Directory of Open Access Journals (Sweden)

    Edy Ferreira

    2008-04-01

    Full Text Available In the September issue of the Open Source Business Resource, Patrick McNamara, president of the Open Hardware Foundation, gave a comprehensive introduction to the concept of open hardware, including some insights about the potential benefits for both companies and users. In this article, we present the topic from a different perspective, providing a classification of market offers from companies that are making money with open hardware.

  6. Open Hardware Business Models

    OpenAIRE

    Edy Ferreira

    2008-01-01

    In the September issue of the Open Source Business Resource, Patrick McNamara, president of the Open Hardware Foundation, gave a comprehensive introduction to the concept of open hardware, including some insights about the potential benefits for both companies and users. In this article, we present the topic from a different perspective, providing a classification of market offers from companies that are making money with open hardware.

  7. Cobalt: A GPU-based correlator and beamformer for LOFAR

    Science.gov (United States)

    Broekema, P. Chris; Mol, J. Jan David; Nijboer, R.; van Amesfoort, A. S.; Brentjens, M. A.; Loose, G. Marcel; Klijn, W. F. A.; Romein, J. W.

    2018-04-01

    For low-frequency radio astronomy, software correlation and beamforming on general purpose hardware is a viable alternative to custom designed hardware. LOFAR, a new-generation radio telescope centered in the Netherlands with international stations in Germany, France, Ireland, Poland, Sweden and the UK, has successfully used software real-time processors based on IBM Blue Gene technology since 2004. Since then, developments in technology have allowed us to build a system based on commercial off-the-shelf components that combines the same capabilities with lower operational cost. In this paper, we describe the design and implementation of a GPU-based correlator and beamformer with the same capabilities as the Blue Gene based systems. We focus on the design approach taken, and show the challenges faced in selecting an appropriate system. The design, implementation and verification of the software system show the value of a modern test-driven development approach. Operational experience, based on three years of operations, demonstrates that a general purpose system is a good alternative to the previous supercomputer-based system or custom-designed hardware.

  8. PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs

    International Nuclear Information System (INIS)

    Kylasa, S.B.; Aktulga, H.M.; Grama, A.Y.

    2014-01-01

    We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques we developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors

  9. PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs

    Energy Technology Data Exchange (ETDEWEB)

    Kylasa, S.B., E-mail: skylasa@purdue.edu [Department of Elec. and Comp. Eng., Purdue University, West Lafayette, IN 47907 (United States); Aktulga, H.M., E-mail: hmaktulga@lbl.gov [Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, MS 50F-1650, Berkeley, CA 94720 (United States); Grama, A.Y., E-mail: ayg@cs.purdue.edu [Department of Computer Science, Purdue University, West Lafayette, IN 47907 (United States)

    2014-09-01

    We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques we developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors.

  10. Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading

    Energy Technology Data Exchange (ETDEWEB)

    Mishra, Alok [Stony Brook Univ., Stony Brook, NY (United States); Li, Lingda [Brookhaven National Lab. (BNL), Upton, NY (United States); Kong, Martin [Brookhaven National Lab. (BNL), Upton, NY (United States); Finkel, Hal [Argonne National Lab. (ANL), Argonne, IL (United States); Chapman, Barbara [Stony Brook Univ., Stony Brook, NY (United States); Brookhaven National Lab. (BNL), Upton, NY (United States)

    2017-01-01

    Here, the latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other's memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory over subscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers should use it. We have modified several benchmarks codes, in the Rodinia benchmark suite, to study the behavior of OpenMP accelerator extensions and have used them to explore the impact of unified memory in an OpenMP context. We moreover modified the open source LLVM compiler to allow OpenMP programs to exploit unified memory. The results of our evaluation reveal that, while the performance of unified memory is comparable with that of normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is over subcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.

  11. The Dynamo package for tomography and subtomogram averaging: components for MATLAB, GPU computing and EC2 Amazon Web Services.

    Science.gov (United States)

    Castaño-Díez, Daniel

    2017-06-01

    Dynamo is a package for the processing of tomographic data. As a tool for subtomogram averaging, it includes different alignment and classification strategies. Furthermore, its data-management module allows experiments to be organized in groups of tomograms, while offering specialized three-dimensional tomographic browsers that facilitate visualization, location of regions of interest, modelling and particle extraction in complex geometries. Here, a technical description of the package is presented, focusing on its diverse strategies for optimizing computing performance. Dynamo is built upon mbtools (middle layer toolbox), a general-purpose MATLAB library for object-oriented scientific programming specifically developed to underpin Dynamo but usable as an independent tool. Its structure intertwines a flexible MATLAB codebase with precompiled C++ functions that carry the burden of numerically intensive operations. The package can be delivered as a precompiled standalone ready for execution without a MATLAB license. Multicore parallelization on a single node is directly inherited from the high-level parallelization engine provided for MATLAB, automatically imparting a balanced workload among the threads in computationally intense tasks such as alignment and classification, but also in logistic-oriented tasks such as tomogram binning and particle extraction. Dynamo supports the use of graphical processing units (GPUs), yielding considerable speedup factors both for native Dynamo procedures (such as the numerically intensive subtomogram alignment) and procedures defined by the user through its MATLAB-based GPU library for three-dimensional operations. Cloud-based virtual computing environments supplied with a pre-installed version of Dynamo can be publicly accessed through the Amazon Elastic Compute Cloud (EC2), enabling users to rent GPU computing time on a pay-as-you-go basis, thus avoiding upfront investments in hardware and longterm software maintenance.

  12. Democratic population decisions result in robust policy-gradient learning: a parametric study with GPU simulations.

    Directory of Open Access Journals (Sweden)

    Paul Richmond

    2011-05-01

    Full Text Available High performance computing on the Graphics Processing Unit (GPU is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism, achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic" for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.

  13. GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy

    International Nuclear Information System (INIS)

    Ammazzalorso, F; Jelen, U; Bednarz, T

    2014-01-01

    We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.

  14. GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy

    Science.gov (United States)

    Ammazzalorso, F.; Bednarz, T.; Jelen, U.

    2014-03-01

    We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.

  15. Nanoscale multireference quantum chemistry: full configuration interaction on graphical processing units.

    Science.gov (United States)

    Fales, B Scott; Levine, Benjamin G

    2015-10-13

    Methods based on a full configuration interaction (FCI) expansion in an active space of orbitals are widely used for modeling chemical phenomena such as bond breaking, multiply excited states, and conical intersections in small-to-medium-sized molecules, but these phenomena occur in systems of all sizes. To scale such calculations up to the nanoscale, we have developed an implementation of FCI in which electron repulsion integral transformation and several of the more expensive steps in σ vector formation are performed on graphical processing unit (GPU) hardware. When applied to a 1.7 × 1.4 × 1.4 nm silicon nanoparticle (Si72H64) described with the polarized, all-electron 6-31G** basis set, our implementation can solve for the ground state of the 16-active-electron/16-active-orbital CASCI Hamiltonian (more than 100,000,000 configurations) in 39 min on a single NVidia K40 GPU.

  16. Accelerating the numerical simulation of magnetic field lines in tokamaks using the GPU

    International Nuclear Information System (INIS)

    Kalling, R.C.; Evans, T.E.; Orlov, D.M.; Schissel, D.P.; Maingi, R.; Menard, J.E.; Sabbagh, S.A.

    2011-01-01

    Highlights: → Tokamak magnetic field lines are simulated on a GPU. → Numerical integration of a set of nonlinear differential equations is required. → Using the GPU yields a significant reduction in processing time compared to the CPU. → Computational runs that took days now take hours. → These gains have been accomplished without significant hardware expense. - Abstract: TRIP3D is a field line simulation code that numerically integrates a set of nonlinear magnetic field line differential equations. The code is used to study properties of magnetic islands and stochastic or chaotic field line topologies that are important for designing non-axisymmetric magnetic perturbation coils for controlling plasma instabilities in future machines. The code is very computationally intensive and for large runs can take on the order of days to complete on a traditional single CPU. This work describes how the code was converted from Fortran to C and then restructured to take advantage of GPU computing using NVIDIA's CUDA. The reduction in computing time has been dramatic where runs that previously took days now take hours allowing a scale of problem to be examined that would previously not have been attempted. These gains have been accomplished without significant hardware expense. Performance, correctness, code flexibility, and implementation time have been analyzed to gauge the success and applicability of these methods when compared to the traditional multi-CPU approach.

  17. GASPRNG: GPU accelerated scalable parallel random number generator library

    Science.gov (United States)

    Gao, Shuang; Peterson, Gregory D.

    2013-04-01

    Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or

  18. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

    International Nuclear Information System (INIS)

    Gong Chunye; Liu Jie; Chi Lihua; Huang Haowei; Fang Jingyue; Gong Zhenghu

    2011-01-01

    Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (S n ) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.

  19. A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU

    Science.gov (United States)

    Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha

    2018-03-01

    Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.

  20. Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.

    Science.gov (United States)

    Mei, Gang; Xu, Nengxiong; Xu, Liangliang

    2016-01-01

    This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.

  1. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

    Science.gov (United States)

    Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu

    2011-07-01

    Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.

  2. TU-FG-BRB-07: GPU-Based Prompt Gamma Ray Imaging From Boron Neutron Capture Therapy

    Energy Technology Data Exchange (ETDEWEB)

    Kim, S; Suh, T; Yoon, D; Jung, J; Shin, H; Kim, M [The catholic university of Korea, Seoul (Korea, Republic of)

    2016-06-15

    Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusion: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray reconstruction using the GPU computation for BNCT simulations.

  3. Real-Time GPU Implementation of Transverse Oscillation Vector Velocity Flow Imaging

    DEFF Research Database (Denmark)

    Bradway, David; Pihl, Michael Johannes; Krebs, Andreas

    2014-01-01

    Rapid estimation of blood velocity and visualization of complex flow patterns are important for clinical use of diagnostic ultrasound. This paper presents real-time processing for two-dimensional (2-D) vector flow imaging which utilizes an off-the-shelf graphics processing unit (GPU). In this work...... vector flow acquisition takes 2.3 milliseconds seconds on an Advanced Micro Devices Radeon HD 7850 GPU card. The detected velocities are accurate to within the precision limit of the output format of the display routine. Because this tool was developed as a module external to the scanner’s built...

  4. Comparison of image sharpness metrics and real-time sharpening methods with GPU implementations

    CSIR Research Space (South Africa)

    De Villiers, Johan P

    2010-06-01

    Full Text Available , and not in trying to adjust the image to some fixed sharpness value. With the advent of the increased progammability of Graphics Pro- cessing Units (GPU) and their seemingly ever increasing number of processor cores (the dual-GPU NVidia GTX295 has 480 cores...) Quadro MDS 140M 16 400 64 700 ATI HD 2400XT 40 800 64 700 NVidia 9600GT 64 650 256 900 NVidia GTX280 240 602 512 1107 2 Metric descriptions Three metrics are used to evaluate images for sharpness. The first two are a measure of how much information...

  5. Interior Point Methods on GPU with application to Model Predictive Control

    DEFF Research Database (Denmark)

    Gade-Nielsen, Nicolai Fog

    The goal of this thesis is to investigate the application of interior point methods to solve dynamical optimization problems, using a graphical processing unit (GPU) with a focus on problems arising in Model Predictice Control (MPC). Multi-core processors have been available for over ten years now...... software package called GPUOPT, available under the non-restrictive MIT license. GPUOPT includes includes a primal-dual interior-point method, which supports both the CPU and the GPU. It is implemented as multiple components, where the matrix operations and solver for the Newton directions is separated...

  6. High Performance Processing and Analysis of Geospatial Data Using CUDA on GPU

    Directory of Open Access Journals (Sweden)

    STOJANOVIC, N.

    2014-11-01

    Full Text Available In this paper, the high-performance processing of massive geospatial data on many-core GPU (Graphic Processing Unit is presented. We use CUDA (Compute Unified Device Architecture programming framework to implement parallel processing of common Geographic Information Systems (GIS algorithms, such as viewshed analysis and map-matching. Experimental evaluation indicates the improvement in performance with respect to CPU-based solutions and shows feasibility of using GPU and CUDA for parallel implementation of GIS algorithms over large-scale geospatial datasets.

  7. GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies.

    Science.gov (United States)

    Yung, Ling Sing; Yang, Can; Wan, Xiang; Yu, Weichuan

    2011-05-01

    Collecting millions of genetic variations is feasible with the advanced genotyping technology. With a huge amount of genetic variations data in hand, developing efficient algorithms to carry out the gene-gene interaction analysis in a timely manner has become one of the key problems in genome-wide association studies (GWAS). Boolean operation-based screening and testing (BOOST), a recent work in GWAS, completes gene-gene interaction analysis in 2.5 days on a desktop computer. Compared with central processing units (CPUs), graphic processing units (GPUs) are highly parallel hardware and provide massive computing resources. We are, therefore, motivated to use GPUs to further speed up the analysis of gene-gene interactions. We implement the BOOST method based on a GPU framework and name it GBOOST. GBOOST achieves a 40-fold speedup compared with BOOST. It completes the analysis of Wellcome Trust Case Control Consortium Type 2 Diabetes (WTCCC T2D) genome data within 1.34 h on a desktop computer equipped with Nvidia GeForce GTX 285 display card. GBOOST code is available at http://bioinformatics.ust.hk/BOOST.html#GBOOST.

  8. A GPU code for analytic continuation through a sampling method

    Directory of Open Access Journals (Sweden)

    Johan Nordström

    2016-01-01

    Full Text Available We here present a code for performing analytic continuation of fermionic Green’s functions and self-energies as well as bosonic susceptibilities on a graphics processing unit (GPU. The code is based on the sampling method introduced by Mishchenko et al. (2000, and is written for the widely used CUDA platform from NVidia. Detailed scaling tests are presented, for two different GPUs, in order to highlight the advantages of this code with respect to standard CPU computations. Finally, as an example of possible applications, we provide the analytic continuation of model Gaussian functions, as well as more realistic test cases from many-body physics.

  9. GPU-Based Techniques for Global Illumination Effects

    CERN Document Server

    Szirmay-Kalos, László; Sbert, Mateu

    2008-01-01

    This book presents techniques to render photo-realistic images by programming the Graphics Processing Unit (GPU). We discuss effects such as mirror reflections, refractions, caustics, diffuse or glossy indirect illumination, radiosity, single or multiple scattering in participating media, tone reproduction, glow, and depth of field. This book targets game developers, graphics programmers, and also students with some basic understanding of computer graphics algorithms, rendering APIs like Direct3D or OpenGL, and shader programming. In order to make this book self-contained, the most important c

  10. Ramses-GPU: Second order MUSCL-Handcock finite volume fluid solver

    Science.gov (United States)

    Kestener, Pierre

    2017-10-01

    RamsesGPU is a reimplementation of RAMSES (ascl:1011.007) which drops the adaptive mesh refinement (AMR) features to optimize 3D uniform grid algorithms for modern graphics processor units (GPU) to provide an efficient software package for astrophysics applications that do not need AMR features but do require a very large number of integration time steps. RamsesGPU provides an very efficient C++/CUDA/MPI software implementation of a second order MUSCL-Handcock finite volume fluid solver for compressible hydrodynamics as a magnetohydrodynamics solver based on the constraint transport technique. Other useful modules includes static gravity, dissipative terms (viscosity, resistivity), and forcing source term for turbulence studies, and special care was taken to enhance parallel input/output performance by using state-of-the-art libraries such as HDF5 and parallel-netcdf.

  11. GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo

    Energy Technology Data Exchange (ETDEWEB)

    Kim, H; Duchaineau, M; Max, N

    2011-09-21

    We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.

  12. Using GPU to calculate electron dose for hybrid pencil beam model

    International Nuclear Information System (INIS)

    Guo Chengjun; Li Xia; Hou Qing; Wu Zhangwen

    2011-01-01

    Hybrid pencil beam model (HPBM) offers an efficient approach to calculate the three-dimension dose distribution from a clinical electron beam. Still, clinical radiation treatment activity desires faster treatment plan process. Our work presented the fast implementation of HPBM-based electron dose calculation using graphics processing unit (GPU). The HPBM algorithm was implemented in compute unified device architecture running on the GPU, and C running on the CPU, respectively. Several tests with various sizes of the field, beamlet and voxel were used to evaluate our implementation. On an NVIDIA GeForce GTX470 GPU card, we achieved speedup factors of 2.18- 98.23 with acceptable accuracy, compared with the results from a Pentium E5500 2.80 GHz Dual-core CPU. (authors)

  13. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility

    Energy Technology Data Exchange (ETDEWEB)

    Gallarno, George [Christian Brothers University; Rogers, James H [ORNL; Maxwell, Don E [ORNL

    2015-01-01

    The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

  14. Compact multimode fiber beam-shaping system based on GPU accelerated digital holography.

    Science.gov (United States)

    Plöschner, Martin; Čižmár, Tomáš

    2015-01-15

    Real-time, on-demand, beam shaping at the end of the multimode fiber has recently been made possible by exploiting the computational power of rapidly evolving graphics processing unit (GPU) technology [Opt. Express 22, 2933 (2014)]. However, the current state-of-the-art system requires the presence of an acousto-optic deflector (AOD) to produce images at the end of the fiber without interference effects between neighboring output points. Here, we present a system free from the AOD complexity where we achieve the removal of the undesired interference effects computationally using GPU implemented Gerchberg-Saxton and Yang-Gu algorithms. The GPU implementation is two orders of magnitude faster than the CPU implementation which allows video-rate image control at the distal end of the fiber virtually free of interference effects.

  15. A Study on GPU-based Iterative ML-EM Reconstruction Algorithm for Emission Computed Tomographic Imaging Systems

    Energy Technology Data Exchange (ETDEWEB)

    Ha, Woo Seok; Kim, Soo Mee; Park, Min Jae; Lee, Dong Soo; Lee, Jae Sung [Seoul National University, Seoul (Korea, Republic of)

    2009-10-15

    The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 sec, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 sec, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries

  16. A Study on GPU-based Iterative ML-EM Reconstruction Algorithm for Emission Computed Tomographic Imaging Systems

    International Nuclear Information System (INIS)

    Ha, Woo Seok; Kim, Soo Mee; Park, Min Jae; Lee, Dong Soo; Lee, Jae Sung

    2009-01-01

    The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 sec, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 sec, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries

  17. GPU implementation of Bayesian neural network construction for data-intensive applications

    International Nuclear Information System (INIS)

    Perry, Michelle; Meyer-Baese, Anke; Prosper, Harrison B

    2014-01-01

    We describe a graphical processing unit (GPU) implementation of the Hybrid Markov Chain Monte Carlo (HMC) method for training Bayesian Neural Networks (BNN). Our implementation uses NVIDIA's parallel computing architecture, CUDA. We briefly review BNNs and the HMC method and we describe our implementations and give preliminary results.

  18. Efficient GPU-based texture interpolation using uniform B-splines

    NARCIS (Netherlands)

    Ruijters, D.; Haar Romenij, ter B.M.; Suetens, P.

    2008-01-01

    This article presents uniform B-spline interpolation, completely contained on the graphics processing unit (GPU). This implies that the CPU does not need to compute any lookup tables or B-spline basis functions. The cubic interpolation can be decomposed into several linear interpolations [Sigg and

  19. BIOLOGICALLY INSPIRED HARDWARE CELL ARCHITECTURE

    DEFF Research Database (Denmark)

    2010-01-01

    Disclosed is a system comprising: - a reconfigurable hardware platform; - a plurality of hardware units defined as cells adapted to be programmed to provide self-organization and self-maintenance of the system by means of implementing a program expressed in a programming language defined as DNA...... language, where each cell is adapted to communicate with one or more other cells in the system, and where the system further comprises a converter program adapted to convert keywords from the DNA language to a binary DNA code; where the self-organisation comprises that the DNA code is transmitted to one...... or more of the cells, and each of the one or more cells is adapted to determine its function in the system; where if a fault occurs in a first cell and the first cell ceases to perform its function, self-maintenance is performed by that the system transmits information to the cells that the first cell has...

  20. A GPU-Accelerated Parameter Interpolation Thermodynamic Integration Free Energy Method.

    Science.gov (United States)

    Giese, Timothy J; York, Darrin M

    2018-03-13

    There has been a resurgence of interest in free energy methods motivated by the performance enhancements offered by molecular dynamics (MD) software written for specialized hardware, such as graphics processing units (GPUs). In this work, we exploit the properties of a parameter-interpolated thermodynamic integration (PI-TI) method to connect states by their molecular mechanical (MM) parameter values. This pathway is shown to be better behaved for Mg 2+ → Ca 2+ transformations than traditional linear alchemical pathways (with and without soft-core potentials). The PI-TI method has the practical advantage that no modification of the MD code is required to propagate the dynamics, and unlike with linear alchemical mixing, only one electrostatic evaluation is needed (e.g., single call to particle-mesh Ewald) leading to better performance. In the case of AMBER, this enables all the performance benefits of GPU-acceleration to be realized, in addition to unlocking the full spectrum of features available within the MD software, such as Hamiltonian replica exchange (HREM). The TI derivative evaluation can be accomplished efficiently in a post-processing step by reanalyzing the statistically independent trajectory frames in parallel for high throughput. We also show how one can evaluate the particle mesh Ewald contribution to the TI derivative evaluation without needing to perform two reciprocal space calculations. We apply the PI-TI method with HREM on GPUs in AMBER to predict p K a values in double stranded RNA molecules and make comparison with experiments. Convergence to under 0.25 units for these systems required 100 ns or more of sampling per window and coupling of windows with HREM. We find that MM charges derived from ab initio QM/MM fragment calculations improve the agreement between calculation and experimental results.

  1. APEnet+: a 3D Torus network optimized for GPU-based HPC Systems

    International Nuclear Information System (INIS)

    Ammendola, R; Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P

    2012-01-01

    In the supercomputing arena, the strong rise of GPU-accelerated clusters is a matter of fact. Within INFN, we proposed an initiative — the QUonG project — whose aim is to deploy a high performance computing system dedicated to scientific computations leveraging on commodity multi-core processors coupled with latest generation GPUs. The inter-node interconnection system is based on a point-to-point, high performance, low latency 3D torus network which is built in the framework of the APEnet+ project. It takes the form of an FPGA-based PCIe network card exposing six full bidirectional links running at 34 Gbps each that implements the RDMA protocol. In order to enable significant access latency reduction for inter-node data transfer, a direct network-to-GPU interface was built. The specialized hardware blocks, integrated in the APEnet+ board, provide support for GPU-initiated communications using the so called PCIe peer-to-peer (P2P) transactions. This development is made in close collaboration with the GPU vendor NVIDIA. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 80 TFLOPS/rack of peak performance, at a cost of 5 k€/T F LOPS and for an estimated power consumption of 25 kW/rack. In this paper we report on the status of final rack deployment and on the R and D activities for 2012 that will focus on performance enhancement of the APEnet+ hardware through the adoption of new generation 28 nm FPGAs allowing the implementation of PCIe Gen3 host interface and the addition of new fault tolerance-oriented capabilities.

  2. APEnet+: a 3D Torus network optimized for GPU-based HPC Systems

    Energy Technology Data Exchange (ETDEWEB)

    Ammendola, R [INFN Tor Vergata (Italy); Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P [INFN Roma (Italy)

    2012-12-13

    In the supercomputing arena, the strong rise of GPU-accelerated clusters is a matter of fact. Within INFN, we proposed an initiative - the QUonG project - whose aim is to deploy a high performance computing system dedicated to scientific computations leveraging on commodity multi-core processors coupled with latest generation GPUs. The inter-node interconnection system is based on a point-to-point, high performance, low latency 3D torus network which is built in the framework of the APEnet+ project. It takes the form of an FPGA-based PCIe network card exposing six full bidirectional links running at 34 Gbps each that implements the RDMA protocol. In order to enable significant access latency reduction for inter-node data transfer, a direct network-to-GPU interface was built. The specialized hardware blocks, integrated in the APEnet+ board, provide support for GPU-initiated communications using the so called PCIe peer-to-peer (P2P) transactions. This development is made in close collaboration with the GPU vendor NVIDIA. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 80 TFLOPS/rack of peak performance, at a cost of 5 k Euro-Sign /T F LOPS and for an estimated power consumption of 25 kW/rack. In this paper we report on the status of final rack deployment and on the R and D activities for 2012 that will focus on performance enhancement of the APEnet+ hardware through the adoption of new generation 28 nm FPGAs allowing the implementation of PCIe Gen3 host interface and the addition of new fault tolerance-oriented capabilities.

  3. Graphics Processing Unit-Accelerated Nonrigid Registration of MR Images to CT Images During CT-Guided Percutaneous Liver Tumor Ablations.

    Science.gov (United States)

    Tokuda, Junichi; Plishker, William; Torabi, Meysam; Olubiyi, Olutayo I; Zaki, George; Tatli, Servet; Silverman, Stuart G; Shekher, Raj; Hata, Nobuhiko

    2015-06-01

    Accuracy and speed are essential for the intraprocedural nonrigid magnetic resonance (MR) to computed tomography (CT) image registration in the assessment of tumor margins during CT-guided liver tumor ablations. Although both accuracy and speed can be improved by limiting the registration to a region of interest (ROI), manual contouring of the ROI prolongs the registration process substantially. To achieve accurate and fast registration without the use of an ROI, we combined a nonrigid registration technique on the basis of volume subdivision with hardware acceleration using a graphics processing unit (GPU). We compared the registration accuracy and processing time of GPU-accelerated volume subdivision-based nonrigid registration technique to the conventional nonrigid B-spline registration technique. Fourteen image data sets of preprocedural MR and intraprocedural CT images for percutaneous CT-guided liver tumor ablations were obtained. Each set of images was registered using the GPU-accelerated volume subdivision technique and the B-spline technique. Manual contouring of ROI was used only for the B-spline technique. Registration accuracies (Dice similarity coefficient [DSC] and 95% Hausdorff distance [HD]) and total processing time including contouring of ROIs and computation were compared using a paired Student t test. Accuracies of the GPU-accelerated registrations and B-spline registrations, respectively, were 88.3 ± 3.7% versus 89.3 ± 4.9% (P = .41) for DSC and 13.1 ± 5.2 versus 11.4 ± 6.3 mm (P = .15) for HD. Total processing time of the GPU-accelerated registration and B-spline registration techniques was 88 ± 14 versus 557 ± 116 seconds (P processing time. The GPU-accelerated volume subdivision technique may enable the implementation of nonrigid registration into routine clinical practice. Copyright © 2015 AUR. Published by Elsevier Inc. All rights reserved.

  4. Open Hardware at CERN

    CERN Multimedia

    CERN Knowledge Transfer Group

    2015-01-01

    CERN is actively making its knowledge and technology available for the benefit of society and does so through a variety of different mechanisms. Open hardware has in recent years established itself as a very effective way for CERN to make electronics designs and in particular printed circuit board layouts, accessible to anyone, while also facilitating collaboration and design re-use. It is creating an impact on many levels, from companies producing and selling products based on hardware designed at CERN, to new projects being released under the CERN Open Hardware Licence. Today the open hardware community includes large research institutes, universities, individual enthusiasts and companies. Many of the companies are actively involved in the entire process from design to production, delivering services and consultancy and even making their own products available under open licences.

  5. Hardware description languages

    Science.gov (United States)

    Tucker, Jerry H.

    1994-01-01

    Hardware description languages are special purpose programming languages. They are primarily used to specify the behavior of digital systems and are rapidly replacing traditional digital system design techniques. This is because they allow the designer to concentrate on how the system should operate rather than on implementation details. Hardware description languages allow a digital system to be described with a wide range of abstraction, and they support top down design techniques. A key feature of any hardware description language environment is its ability to simulate the modeled system. The two most important hardware description languages are Verilog and VHDL. Verilog has been the dominant language for the design of application specific integrated circuits (ASIC's). However, VHDL is rapidly gaining in popularity.

  6. Hardware protection through obfuscation

    CERN Document Server

    Bhunia, Swarup; Tehranipoor, Mark

    2017-01-01

    This book introduces readers to various threats faced during design and fabrication by today’s integrated circuits (ICs) and systems. The authors discuss key issues, including illegal manufacturing of ICs or “IC Overproduction,” insertion of malicious circuits, referred as “Hardware Trojans”, which cause in-field chip/system malfunction, and reverse engineering and piracy of hardware intellectual property (IP). The authors provide a timely discussion of these threats, along with techniques for IC protection based on hardware obfuscation, which makes reverse-engineering an IC design infeasible for adversaries and untrusted parties with any reasonable amount of resources. This exhaustive study includes a review of the hardware obfuscation methods developed at each level of abstraction (RTL, gate, and layout) for conventional IC manufacturing, new forms of obfuscation for emerging integration strategies (split manufacturing, 2.5D ICs, and 3D ICs), and on-chip infrastructure needed for secure exchange o...

  7. ZEUS hardware control system

    Science.gov (United States)

    Loveless, R.; Erhard, P.; Ficenec, J.; Gather, K.; Heath, G.; Iacovacci, M.; Kehres, J.; Mobayyen, M.; Notz, D.; Orr, R.; Orr, R.; Sephton, A.; Stroili, R.; Tokushuku, K.; Vogel, W.; Whitmore, J.; Wiggers, L.

    1989-12-01

    The ZEUS collaboration is building a system to monitor, control and document the hardware of the ZEUS detector. This system is based on a network of VAX computers and microprocessors connected via ethernet. The database for the hardware values will be ADAMO tables; the ethernet connection will be DECNET, TCP/IP, or RPC. Most of the documentation will also be kept in ADAMO tables for easy access by users.

  8. ZEUS hardware control system

    International Nuclear Information System (INIS)

    Loveless, R.; Erhard, P.; Ficenec, J.; Gather, K.; Heath, G.; Iacovacci, M.; Kehres, J.; Mobayyen, M.; Notz, D.; Orr, R.; Sephton, A.; Stroili, R.; Tokushuku, K.; Vogel, W.; Whitmore, J.; Wiggers, L.

    1989-01-01

    The ZEUS collaboration is building a system to monitor, control and document the hardware of the ZEUS detector. This system is based on a network of VAX computers and microprocessors connected via ethernet. The database for the hardware values will be ADAMO tables; the ethernet connection will be DECNET, TCP/IP, or RPC. Most of the documentation will also be kept in ADAMO tables for easy access by users. (orig.)

  9. GPU accelerated population annealing algorithm

    Science.gov (United States)

    Barash, Lev Yu.; Weigel, Martin; Borovský, Michal; Janke, Wolfhard; Shchur, Lev N.

    2017-11-01

    steps and multi-histogram reweighting. Additional comments: Code repository at https://github.com/LevBarash/PAising. The system size and size of the population of replicas are limited depending on the memory of the GPU device used. For the default parameter values used in the sample programs, L = 64, θ = 100, β0 = 0, βf = 1, Δβ = 0 . 005, R = 20 000, a typical run time on an NVIDIA Tesla K80 GPU is 151 seconds for the single spin coded (SSC) and 17 seconds for the multi-spin coded (MSC) program (see Section 2 for a description of these parameters).

  10. Accelerating cardiac bidomain simulations using graphics processing units.

    Science.gov (United States)

    Neic, A; Liebmann, M; Hoetzl, E; Mitchell, L; Vigmond, E J; Haase, G; Plank, G

    2012-08-01

    Anatomically realistic and biophysically detailed multiscale computer models of the heart are playing an increasingly important role in advancing our understanding of integrated cardiac function in health and disease. Such detailed simulations, however, are computationally vastly demanding, which is a limiting factor for a wider adoption of in-silico modeling. While current trends in high-performance computing (HPC) hardware promise to alleviate this problem, exploiting the potential of such architectures remains challenging since strongly scalable algorithms are necessitated to reduce execution times. Alternatively, acceleration technologies such as graphics processing units (GPUs) are being considered. While the potential of GPUs has been demonstrated in various applications, benefits in the context of bidomain simulations where large sparse linear systems have to be solved in parallel with advanced numerical techniques are less clear. In this study, the feasibility of multi-GPU bidomain simulations is demonstrated by running strong scalability benchmarks using a state-of-the-art model of rabbit ventricles. The model is spatially discretized using the finite element methods (FEM) on fully unstructured grids. The GPU code is directly derived from a large pre-existing code, the Cardiac Arrhythmia Research Package (CARP), with very minor perturbation of the code base. Overall, bidomain simulations were sped up by a factor of 11.8 to 16.3 in benchmarks running on 6-20 GPUs compared to the same number of CPU cores. To match the fastest GPU simulation which engaged 20 GPUs, 476 CPU cores were required on a national supercomputing facility.

  11. ARCHERRT – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

    International Nuclear Information System (INIS)

    Su, Lin; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond

    2014-01-01

    Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHER RT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head and neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHER RT . Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHER RT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHER RT agree well with DOSXYZnrc. For clinical cases, results from ARCHER RT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head and neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU

  12. Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Time Series Data-fMRI Study.

    Science.gov (United States)

    Eslami, Taban; Saeed, Fahad

    2018-04-20

    Functional magnetic resonance imaging (fMRI) is a non-invasive brain imaging technique, which has been regularly used for studying brain’s functional activities in the past few years. A very well-used measure for capturing functional associations in brain is Pearson’s correlation coefficient. Pearson’s correlation is widely used for constructing functional network and studying dynamic functional connectivity of the brain. These are useful measures for understanding the effects of brain disorders on connectivities among brain regions. The fMRI scanners produce huge number of voxels and using traditional central processing unit (CPU)-based techniques for computing pairwise correlations is very time consuming especially when large number of subjects are being studied. In this paper, we propose a graphics processing unit (GPU)-based algorithm called Fast-GPU-PCC for computing pairwise Pearson’s correlation coefficient. Based on the symmetric property of Pearson’s correlation, this approach returns N ( N − 1 ) / 2 correlation coefficients located at strictly upper triangle part of the correlation matrix. Storing correlations in a one-dimensional array with the order as proposed in this paper is useful for further usage. Our experiments on real and synthetic fMRI data for different number of voxels and varying length of time series show that the proposed approach outperformed state of the art GPU-based techniques as well as the sequential CPU-based versions. We show that Fast-GPU-PCC runs 62 times faster than CPU-based version and about 2 to 3 times faster than two other state of the art GPU-based methods.

  13. Real-time autocorrelator for fluorescence correlation spectroscopy based on graphical-processor-unit architecture: method, implementation, and comparative studies

    Science.gov (United States)

    Laracuente, Nicholas; Grossman, Carl

    2013-03-01

    We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College

  14. Fully 3-D list-mode positron emission tomography image reconstruction on a multi-GPU cluster

    Energy Technology Data Exchange (ETDEWEB)

    Cui, Jingyu [Stanford Univ., CA (United States). Dept. of Electrical Engineering; Prevrhal, Sven; Shao, Lingxiong [Philips Healthcare, San Jose, CA (United States); Pratx, Guillem [Stanford Univ., CA (United States). Dept. of Radiation Oncology; Levin, Craig S. [Stanford Univ., CA (United States). Dept. of Radiology, Electrical Engineering, and Physics; Stanford Univ., CA (United States). Molecular Imaging Program at Stanford (MIPS); Stanford Univ., CA (United States). School of Medicine

    2011-07-01

    List-mode processing is an efficient way of dealing with the sparse nature of PET data sets, and is the processing method of choice for time-of-flight (ToF) PET. We present a novel method of computing line projection operations required for list-mode ordered subsets expectation maximization (OSEM) for fully 3-D PET image reconstruction on a graphics processing unit (GPU) using the compute unified device architecture (CUDA) framework. Our method overcomes challenges such as compute thread divergence, and exploits GPU capabilities such as shared memory and atomic operations. When applied to line projection operations for list-mode time-of-flight PET, this new GPU-CUDA reformulation is 188X faster than a single-threaded reference CPU implementation. When embedded in a multi-process environment on a GPU-equipped small cluster, a speedup of 4X was observed over the same configuration but without GPU support. Image quality is preserved with root mean squared (RMS) deviation of 0.05% between CPU and GPU-generated images, which has negligible effect in typical clinical applications. (orig.)

  15. A GPU offloading mechanism for LHCb

    International Nuclear Information System (INIS)

    Badalov, Alexey; Cardona, Xavier Vilasis; Perez, Daniel Hugo Campora; Zvyagin, Alexander; Neufeld, Niko

    2014-01-01

    The current computational infrastructure at LHCb is designed for sequential execution. It is possible to make use of modern multi-core machines by using multi-threaded algorithms and running multiple instances in parallel, but there is no way to make efficient use of specialized massively parallel hardware, such as graphical processing units and Intel Xeon/Phi. We extend the current infrastructure with an out-of-process computational server able to gather data from multiple instances and process them in large batches.

  16. Implementation and optimization of ultrasound signal processing algorithms on mobile GPU

    Science.gov (United States)

    Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong

    2014-03-01

    A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNRe., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.

  17. FastGCN: a GPU accelerated tool for fast gene co-expression networks.

    Directory of Open Access Journals (Sweden)

    Meimei Liang

    Full Text Available Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.

  18. Transform coding for hardware-accelerated volume rendering.

    Science.gov (United States)

    Fout, Nathaniel; Ma, Kwan-Liu

    2007-01-01

    Hardware-accelerated volume rendering using the GPU is now the standard approach for real-time volume rendering, although limited graphics memory can present a problem when rendering large volume data sets. Volumetric compression in which the decompression is coupled to rendering has been shown to be an effective solution to this problem; however, most existing techniques were developed in the context of software volume rendering, and all but the simplest approaches are prohibitive in a real-time hardware-accelerated volume rendering context. In this paper we present a novel block-based transform coding scheme designed specifically with real-time volume rendering in mind, such that the decompression is fast without sacrificing compression quality. This is made possible by consolidating the inverse transform with dequantization in such a way as to allow most of the reprojection to be precomputed. Furthermore, we take advantage of the freedom afforded by off-line compression in order to optimize the encoding as much as possible while hiding this complexity from the decoder. In this context we develop a new block classification scheme which allows us to preserve perceptually important features in the compression. The result of this work is an asymmetric transform coding scheme that allows very large volumes to be compressed and then decompressed in real-time while rendering on the GPU.

  19. Cucheb: A GPU implementation of the filtered Lanczos procedure

    Science.gov (United States)

    Aurentz, Jared L.; Kalantzis, Vassilis; Saad, Yousef

    2017-11-01

    This paper describes the software package Cucheb, a GPU implementation of the filtered Lanczos procedure for the solution of large sparse symmetric eigenvalue problems. The filtered Lanczos procedure uses a carefully chosen polynomial spectral transformation to accelerate convergence of the Lanczos method when computing eigenvalues within a desired interval. This method has proven particularly effective for eigenvalue problems that arise in electronic structure calculations and density functional theory. We compare our implementation against an equivalent CPU implementation and show that using the GPU can reduce the computation time by more than a factor of 10. Program Summary Program title: Cucheb Program Files doi:http://dx.doi.org/10.17632/rjr9tzchmh.1 Licensing provisions: MIT Programming language: CUDA C/C++ Nature of problem: Electronic structure calculations require the computation of all eigenvalue-eigenvector pairs of a symmetric matrix that lie inside a user-defined real interval. Solution method: To compute all the eigenvalues within a given interval a polynomial spectral transformation is constructed that maps the desired eigenvalues of the original matrix to the exterior of the spectrum of the transformed matrix. The Lanczos method is then used to compute the desired eigenvectors of the transformed matrix, which are then used to recover the desired eigenvalues of the original matrix. The bulk of the operations are executed in parallel using a graphics processing unit (GPU). Runtime: Variable, depending on the number of eigenvalues sought and the size and sparsity of the matrix. Additional comments: Cucheb is compatible with CUDA Toolkit v7.0 or greater.

  20. PIConGPU - How to build one of the fastest GPU particle-in-cell codes in the world

    Energy Technology Data Exchange (ETDEWEB)

    Burau, Heiko; Debus, Alexander; Helm, Anton; Huebl, Axel; Kluge, Thomas; Widera, Rene; Bussmann, Michael; Schramm, Ulrich; Cowan, Thomas [HZDR, Dresden (Germany); Juckeland, Guido; Nagel, Wolfgang [TU Dresden (Germany); ZIH, Dresden (Germany); Schmitt, Felix [NVIDIA (United States)

    2013-07-01

    We present the algorithmic building blocks of PIConGPU, one of the fastest implementations of the particle-in-cell algortihm on GPU clusters. PIConGPU is a highly-scalable, 3D3V electromagnetic PIC code that is used in laser plasma and astrophysical plasma simulations.

  1. Hardware Objects for Java

    DEFF Research Database (Denmark)

    Schoeberl, Martin; Thalinger, Christian; Korsholm, Stephan

    2008-01-01

    Java, as a safe and platform independent language, avoids access to low-level I/O devices or direct memory access. In standard Java, low-level I/O it not a concern; it is handled by the operating system. However, in the embedded domain resources are scarce and a Java virtual machine (JVM) without...... an underlying middleware is an attractive architecture. When running the JVM on bare metal, we need access to I/O devices from Java; therefore we investigate a safe and efficient mechanism to represent I/O devices as first class Java objects, where device registers are represented by object fields. Access...... to those registers is safe as Java’s type system regulates it. The access is also fast as it is directly performed by the bytecodes getfield and putfield. Hardware objects thus provide an object-oriented abstraction of low-level hardware devices. As a proof of concept, we have implemented hardware objects...

  2. Length-Bounded Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection

    Directory of Open Access Journals (Sweden)

    Yi-Shan Lin

    2017-01-01

    Full Text Available Since frequent communication between applications takes place in high speed networks, deep packet inspection (DPI plays an important role in the network application awareness. The signature-based network intrusion detection system (NIDS contains a DPI technique that examines the incoming packet payloads by employing a pattern matching algorithm that dominates the overall inspection performance. Existing studies focused on implementing efficient pattern matching algorithms by parallel programming on software platforms because of the advantages of lower cost and higher scalability. Either the central processing unit (CPU or the graphic processing unit (GPU were involved. Our studies focused on designing a pattern matching algorithm based on the cooperation between both CPU and GPU. In this paper, we present an enhanced design for our previous work, a length-bounded hybrid CPU/GPU pattern matching algorithm (LHPMA. In the preliminary experiment, the performance and comparison with the previous work are displayed, and the experimental results show that the LHPMA can achieve not only effective CPU/GPU cooperation but also higher throughput than the previous method.

  3. Development of efficient GPU parallelization of WRF Yonsei University planetary boundary layer scheme

    Directory of Open Access Journals (Sweden)

    M. Huang

    2015-09-01

    Full Text Available The planetary boundary layer (PBL is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds, and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF model of which the Yonsei University (YSU scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.

  4. A fast and accurate image reconstruction using GPU for OpenPET prototype

    International Nuclear Information System (INIS)

    Kinouchi, Shoko; Suga, Mikio; Yamaya, Taiga; Yoshida, Eiji

    2010-01-01

    The OpenPET (positron emission tomography), which have a physically opened space between two detector rings, is our new geometry to enable PET imaging during radiation therapy if the real-time imaging system is realized. In this paper, therefore, we developed a list-mode image reconstruction method using general purpose graphic processing units (GPUs). We used the list-mode dynamic row-action maximum likelihood algorithm (DRAMA). For GPU implementation, the efficiency of acceleration depends on the implementation method which is required to avoid conditional statements. We developed a system model in which each element of system matrix is calculated as the value of detector response function (DRF) of the length between the center of a voxel and a line of response (LOR). The system model was suited to GPU implementations that enable us to calculate each element of the system matrix with reduced number of the conditional statements. We applied the developed method to a small OpenPET prototype, which was developed for a proof-of-concept. We measured the micro-Derenzo phantom placed at the gap. The results showed that the same quality of reconstructed images using GPU as using central processing unit (CPU) were achieved, and calculation speed on the GPU was 35.5 times faster than that on the CPU. (author)

  5. GPU-accelerated Kernel Regression Reconstruction for Freehand 3D Ultrasound Imaging.

    Science.gov (United States)

    Wen, Tiexiang; Li, Ling; Zhu, Qingsong; Qin, Wenjian; Gu, Jia; Yang, Feng; Xie, Yaoqin

    2017-07-01

    Volume reconstruction method plays an important role in improving reconstructed volumetric image quality for freehand three-dimensional (3D) ultrasound imaging. By utilizing the capability of programmable graphics processing unit (GPU), we can achieve a real-time incremental volume reconstruction at a speed of 25-50 frames per second (fps). After incremental reconstruction and visualization, hole-filling is performed on GPU to fill remaining empty voxels. However, traditional pixel nearest neighbor-based hole-filling fails to reconstruct volume with high image quality. On the contrary, the kernel regression provides an accurate volume reconstruction method for 3D ultrasound imaging but with the cost of heavy computational complexity. In this paper, a GPU-based fast kernel regression method is proposed for high-quality volume after the incremental reconstruction of freehand ultrasound. The experimental results show that improved image quality for speckle reduction and details preservation can be obtained with the parameter setting of kernel window size of [Formula: see text] and kernel bandwidth of 1.0. The computational performance of the proposed GPU-based method can be over 200 times faster than that on central processing unit (CPU), and the volume with size of 50 million voxels in our experiment can be reconstructed within 10 seconds.

  6. Fast Simulation of Large-Scale Floods Based on GPU Parallel Computing

    Directory of Open Access Journals (Sweden)

    Qiang Liu

    2018-05-01

    Full Text Available Computing speed is a significant issue of large-scale flood simulations for real-time response to disaster prevention and mitigation. Even today, most of the large-scale flood simulations are generally run on supercomputers due to the massive amounts of data and computations necessary. In this work, a two-dimensional shallow water model based on an unstructured Godunov-type finite volume scheme was proposed for flood simulation. To realize a fast simulation of large-scale floods on a personal computer, a Graphics Processing Unit (GPU-based, high-performance computing method using the OpenACC application was adopted to parallelize the shallow water model. An unstructured data management method was presented to control the data transportation between the GPU and CPU (Central Processing Unit with minimum overhead, and then both computation and data were offloaded from the CPU to the GPU, which exploited the computational capability of the GPU as much as possible. The parallel model was validated using various benchmarks and real-world case studies. The results demonstrate that speed-ups of up to one order of magnitude can be achieved in comparison with the serial model. The proposed parallel model provides a fast and reliable tool with which to quickly assess flood hazards in large-scale areas and, thus, has a bright application prospect for dynamic inundation risk identification and disaster assessment.

  7. Computer hardware fault administration

    Science.gov (United States)

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-09-14

    Computer hardware fault administration carried out in a parallel computer, where the parallel computer includes a plurality of compute nodes. The compute nodes are coupled for data communications by at least two independent data communications networks, where each data communications network includes data communications links connected to the compute nodes. Typical embodiments carry out hardware fault administration by identifying a location of a defective link in the first data communications network of the parallel computer and routing communications data around the defective link through the second data communications network of the parallel computer.

  8. Analysis of impact of general-purpose graphics processor units in supersonic flow modeling

    Science.gov (United States)

    Emelyanov, V. N.; Karpenko, A. G.; Kozelkov, A. S.; Teterina, I. V.; Volkov, K. N.; Yalozo, A. V.

    2017-06-01

    Computational methods are widely used in prediction of complex flowfields associated with off-normal situations in aerospace engineering. Modern graphics processing units (GPU) provide architectures and new programming models that enable to harness their large processing power and to design computational fluid dynamics (CFD) simulations at both high performance and low cost. Possibilities of the use of GPUs for the simulation of external and internal flows on unstructured meshes are discussed. The finite volume method is applied to solve three-dimensional unsteady compressible Euler and Navier-Stokes equations on unstructured meshes with high resolution numerical schemes. CUDA technology is used for programming implementation of parallel computational algorithms. Solutions of some benchmark test cases on GPUs are reported, and the results computed are compared with experimental and computational data. Approaches to optimization of the CFD code related to the use of different types of memory are considered. Speedup of solution on GPUs with respect to the solution on central processor unit (CPU) is compared. Performance measurements show that numerical schemes developed achieve 20-50 speedup on GPU hardware compared to CPU reference implementation. The results obtained provide promising perspective for designing a GPU-based software framework for applications in CFD.

  9. PIConGPU - A highly-scalable particle-in-cell implementation for GPU clusters

    Energy Technology Data Exchange (ETDEWEB)

    Bussmann, Michael; Burau, Heiko; Debus, Alexander; Huebl, Axel; Kluge, Thomas; Pausch, Richard; Schmeisser, Nils; Steiniger, Klaus; Widera, Rene; Wyderka, Nikolai; Schramm, Ulrich; Cowan, Thomas [HZDR, Dresden (Germany); Schneider, Benjamin [HZDR, Dresden (Germany); TU Dresden (Germany); Schmitt, Felix [NVIDIA, Austin, TX (United States); Grottel, Sebastian; Gumhold, Stefan [TU Dresden (Germany); Juckeland, Guido; Angel, Wolfgang [TU Dresden (Germany); ZIH, Dresden (Germany)

    2013-07-01

    PIConGPU can handle large-scale simulations of laser plasma and astrophysical plasma dynamics on GPU clusters with thousands of GPUs. High data throughput allows to conduct large parameter surveys but makes it necessary to rethink data analysis and look for new ways of analyzing large simulation data sets. The speedup seen on GPUs enables scientists to add physical effects to their code that up until recently have been too computationally demanding. We present recent results obtained with PIConGPU, discuss scaling behaviour, the most important building blocks of the code and new physics modules recently added. In addition we give an outlook on data analysis, resiliance and load balancing with PIConGPU.

  10. The VMTG Hardware Description

    CERN Document Server

    Puccio, B

    1998-01-01

    The document describes the hardware features of the CERN Master Timing Generator. This board is the common platform for the transmission of General Timing Machine required by the CERN accelerators. In addition, the paper shows the various jumper options to customise the card which is compliant to the VMEbus standard.

  11. The LASS hardware processor

    International Nuclear Information System (INIS)

    Kunz, P.F.

    1976-01-01

    The problems of data analysis with hardware processors are reviewed and a description is given of a programmable processor. This processor, the 168/E, has been designed for use in the LASS multi-processor system; it has an execution speed comparable to the IBM 370/168 and uses the subset of IBM 370 instructions appropriate to the LASS analysis task. (Auth.)

  12. CERN Neutrino Platform Hardware

    CERN Document Server

    Nelson, Kevin

    2017-01-01

    My summer research was broadly in CERN's neutrino platform hardware efforts. This project had two main components: detector assembly and data analysis work for ICARUS. Specifically, I worked on assembly for the ProtoDUNE project and monitored the safety of ICARUS as it was transported to Fermilab by analyzing the accelerometer data from its move.

  13. RRFC hardware operation manual

    International Nuclear Information System (INIS)

    Abhold, M.E.; Hsue, S.T.; Menlove, H.O.; Walton, G.

    1996-05-01

    The Research Reactor Fuel Counter (RRFC) system was developed to assay the 235 U content in spent Material Test Reactor (MTR) type fuel elements underwater in a spent fuel pool. RRFC assays the 235 U content using active neutron coincidence counting and also incorporates an ion chamber for gross gamma-ray measurements. This manual describes RRFC hardware, including detectors, electronics, and performance characteristics

  14. CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis

    Directory of Open Access Journals (Sweden)

    Federico Raimondo

    2012-01-01

    Full Text Available In recent years, Independent Component Analysis (ICA has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation.

  15. High-throughput GPU-based LDPC decoding

    Science.gov (United States)

    Chang, Yang-Lang; Chang, Cheng-Chun; Huang, Min-Yu; Huang, Bormin

    2010-08-01

    Low-density parity-check (LDPC) code is a linear block code known to approach the Shannon limit via the iterative sum-product algorithm. LDPC codes have been adopted in most current communication systems such as DVB-S2, WiMAX, WI-FI and 10GBASE-T. LDPC for the needs of reliable and flexible communication links for a wide variety of communication standards and configurations have inspired the demand for high-performance and flexibility computing. Accordingly, finding a fast and reconfigurable developing platform for designing the high-throughput LDPC decoder has become important especially for rapidly changing communication standards and configurations. In this paper, a new graphic-processing-unit (GPU) LDPC decoding platform with the asynchronous data transfer is proposed to realize this practical implementation. Experimental results showed that the proposed GPU-based decoder achieved 271x speedup compared to its CPU-based counterpart. It can serve as a high-throughput LDPC decoder.

  16. Operator training and requalification at GPU Nuclear

    International Nuclear Information System (INIS)

    Long, R.L.; Barrett, R.J.; Newton, S.L.

    1982-01-01

    The operator training and requalification programs at GPU Nuclear's Oyster Creek (650 MWe BWR) and Three Mile Island-1 (776 MWe PWR) nuclear plants have undergone significant revisions since the Three Mile Island-2 accident. This paper describes the Training and Education organization, the expanded training facilities, including basic principle trainers and replica simulators, and the present operator training and requalification programs

  17. Synthetic Aperture Beamformation using the GPU

    DEFF Research Database (Denmark)

    Hansen, Jens Munk; Schaa, Dana; Jensen, Jørgen Arendt

    2011-01-01

    A synthetic aperture ultrasound beamformer is implemented for a GPU using the OpenCL framework. The implementation supports beamformation of either RF signals or complex baseband signals. Transmit and receive apodization can be either parametric or dynamic using a fixed F-number, a reference...

  18. The Research and Test of Fast Radio Burst Real-time Search Algorithm Based on GPU Acceleration

    Science.gov (United States)

    Wang, J.; Chen, M. Z.; Pei, X.; Wang, Z. Q.

    2017-03-01

    In order to satisfy the research needs of Nanshan 25 m radio telescope of Xinjiang Astronomical Observatory (XAO) and study the key technology of the planned QiTai radio Telescope (QTT), the receiver group of XAO studied the GPU (Graphics Processing Unit) based real-time FRB searching algorithm which developed from the original FRB searching algorithm based on CPU (Central Processing Unit), and built the FRB real-time searching system. The comparison of the GPU system and the CPU system shows that: on the basis of ensuring the accuracy of the search, the speed of the GPU accelerated algorithm is improved by 35-45 times compared with the CPU algorithm.

  19. Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster

    Energy Technology Data Exchange (ETDEWEB)

    Allada, Veerendra, Benjegerdes, Troy; Bode, Brett

    2009-08-31

    Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.

  20. Parallel computing in cluster of GPU applied to a problem of nuclear engineering

    International Nuclear Information System (INIS)

    Moraes, Sergio Ricardo S.; Heimlich, Adino; Resende, Pedro

    2013-01-01

    Cluster computing has been widely used as a low cost alternative for parallel processing in scientific applications. With the use of Message-Passing Interface (MPI) protocol development became even more accessible and widespread in the scientific community. A more recent trend is the use of Graphic Processing Unit (GPU), which is a powerful co-processor able to perform hundreds of instructions in parallel, reaching a capacity of hundreds of times the processing of a CPU. However, a standard PC does not allow, in general, more than two GPUs. Hence, it is proposed in this work development and evaluation of a hybrid low cost parallel approach to the solution to a nuclear engineering typical problem. The idea is to use clusters parallelism technology (MPI) together with GPU programming techniques (CUDA - Compute Unified Device Architecture) to simulate neutron transport through a slab using Monte Carlo method. By using a cluster comprised by four quad-core computers with 2 GPU each, it has been developed programs using MPI and CUDA technologies. Experiments, applying different configurations, from 1 to 8 GPUs has been performed and results were compared with the sequential (non-parallel) version. A speed up of about 2.000 times has been observed when comparing the 8-GPU with the sequential version. Results here presented are discussed and analyzed with the objective of outlining gains and possible limitations of the proposed approach. (author)

  1. Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.

    Directory of Open Access Journals (Sweden)

    Cheng-Ying Chou

    Full Text Available Positron emission tomography (PET is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU, NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.

  2. Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster

    International Nuclear Information System (INIS)

    Allada, Veerendra; Benjegerdes, Troy; Bode, Brett

    2009-01-01

    Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.

  3. CPU and GPU (Cuda Template Matching Comparison

    Directory of Open Access Journals (Sweden)

    Evaldas Borcovas

    2014-05-01

    Full Text Available Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I, NVidia GeForce GT320M CUDAcompliable graphics card (GPU I and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II, NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II.Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have.

  4. Performance of Сellular Automata-based Stream Ciphers in GPU Implementation

    Directory of Open Access Journals (Sweden)

    P. G. Klyucharev

    2016-01-01

    Full Text Available Earlier the author had developed methods to build high-performance generalized cellular automata-based symmetric ciphers, which allow obtaining the encryption algorithms that show extremely high performance in hardware implementation. However, their implementation based on the conventional microprocessors lacks high performance. The mere fact is quite common - it shows a scope of applications for these ciphers. Nevertheless, the use of graphic processors enables achieving an appropriate performance for a software implementation.The article is extension of a series of the articles, which study various aspects to construct and implement cryptographic algorithms based on the generalized cellular automata. The article is aimed at studying the capabilities to implement the GPU-based cryptographic algorithms under consideration.Representing a key generator, the implemented encryption algorithm comprises 2k generalized cellular automata. The cellular automata graphs are Ramanujan’s ones. The cells of produced k gamma streams alternate, thereby allowing the GPU capabilities to be better used.To implement was used OpenCL, as the most universal and platform-independent API. The software written in C ++ was designed so that the user could set various parameters, including the encryption key, the graph structure, the local communication function, various constants, etc. To test were used a variety of graphics processors (NVIDIA GTX 650; NVIDIA GTX 770; AMD R9 280X.Depending on operating conditions, and GPU used, a performance range is from 0.47 to 6.61 Gb / s, which is comparable to the performance of the countertypes.Thus, the article has demonstrated that using the GPU makes it is possible to provide efficient software implementation of stream ciphers based on the generalized cellular automata.This work was supported by the RFBR, the project №16-07-00542.

  5. Hardware Accelerated Simulated Radiography

    International Nuclear Information System (INIS)

    Laney, D; Callahan, S; Max, N; Silva, C; Langer, S; Frank, R

    2005-01-01

    We present the application of hardware accelerated volume rendering algorithms to the simulation of radiographs as an aid to scientists designing experiments, validating simulation codes, and understanding experimental data. The techniques presented take advantage of 32 bit floating point texture capabilities to obtain validated solutions to the radiative transport equation for X-rays. An unsorted hexahedron projection algorithm is presented for curvilinear hexahedra that produces simulated radiographs in the absorption-only regime. A sorted tetrahedral projection algorithm is presented that simulates radiographs of emissive materials. We apply the tetrahedral projection algorithm to the simulation of experimental diagnostics for inertial confinement fusion experiments on a laser at the University of Rochester. We show that the hardware accelerated solution is faster than the current technique used by scientists

  6. Sterilization of space hardware.

    Science.gov (United States)

    Pflug, I. J.

    1971-01-01

    Discussion of various techniques of sterilization of space flight hardware using either destructive heating or the action of chemicals. Factors considered in the dry-heat destruction of microorganisms include the effects of microbial water content, temperature, the physicochemical properties of the microorganism and adjacent support, and nature of the surrounding gas atmosphere. Dry-heat destruction rates of microorganisms on the surface, between mated surface areas, or buried in the solid material of space vehicle hardware are reviewed, along with alternative dry-heat sterilization cycles, thermodynamic considerations, and considerations of final sterilization-process design. Discussed sterilization chemicals include ethylene oxide, formaldehyde, methyl bromide, dimethyl sulfoxide, peracetic acid, and beta-propiolactone.

  7. Hardware Acceleration of Sparse Cognitive Algorithms

    Science.gov (United States)

    2016-05-01

    3 Figure 2: Kernel Run Times on an NVidia GTX750 Using the NVidia Profiler .......................... 5 Figure 3: 2D Implementation...Demonstrated Results Notes: (1) Server class GPU, NVidia Tesla K20c. (2) Consumer grade GPU, GeForce GT 640. Factor GPU Baseline SIMD with PiM ASIC...times on an NVidia GTX750 using the NVidia profiler. Figure 2: Kernel Run Times on an NVidia GTX750 Using the NVidia Profiler In this example

  8. Hardware characteristic and application

    International Nuclear Information System (INIS)

    Gu, Dong Hyeon

    1990-03-01

    The contents of this book are system board on memory, performance, system timer system click and specification, coprocessor such as programing interface and hardware interface, power supply on input and output, protection for DC output, Power Good signal, explanation on 84 keyboard and 101/102 keyboard,BIOS system, 80286 instruction set and 80287 coprocessor, characters, keystrokes and colors, communication and compatibility of IBM personal computer on application direction, multitasking and code for distinction of system.

  9. Array abstractions for GPU programming

    DEFF Research Database (Denmark)

    Dybdal, Martin

    The shift towards massively parallel hardware platforms for highperformance computing tasks has introduced a need for improved programming models that facilitate ease of reasoning for both users and compiler optimization. A promising direction is the field of functional data-parallel programming......, for which functional invariants can be utilized by optimizing compilers to perform large program transformations automatically. However, the previous work in this area allow users only limited ability to reason about the performance of algorithms. For this reason, such languages have yet to see wide...... industrial adoption. We present two programming languages that attempt at both supporting industrial applications and providing reasoning tools for hierarchical data-parallel architectures, such as GPUs. First, we present TAIL, an array based intermediate language and compiler framework for compiling a large...

  10. Expert System analysis of non-fuel assembly hardware and spent fuel disassembly hardware: Its generation and recommended disposal

    International Nuclear Information System (INIS)

    Williamson, D.A.

    1991-01-01

    Almost all of the effort being expended on radioactive waste disposal in the United States is being focused on the disposal of spent Nuclear Fuel, with little consideration for other areas that will have to be disposed of in the same facilities. one area of radioactive waste that has not been addressed adequately because it is considered a secondary part of the waste issue is the disposal of the various Non-Fuel Bearing Components of the reactor core. These hardware components fall somewhat arbitrarily into two categories: Non-Fuel Assembly (NFA) hardware and Spent Fuel Disassembly (SFD) hardware. This work provides a detailed examination of the generation and disposal of NFA hardware and SFD hardware by the nuclear utilities of the United States as it relates to the Civilian Radioactive Waste Management Program. All available sources of data on NFA and SFD hardware are analyzed with particular emphasis given to the Characteristics Data Base developed by Oak Ridge National Laboratory and the characterization work performed by Pacific Northwest Laboratories and Rochester Gas ampersand Electric. An Expert System developed as a portion of this work is used to assist in the prediction of quantities of NFA hardware and SFD hardware that will be generated by the United States' utilities. Finally, the hardware waste management practices of the United Kingdom, France, Germany, Sweden, and Japan are studied for possible application to the disposal of domestic hardware wastes. As a result of this work, a general classification scheme for NFA and SFD hardware was developed. Only NFA and SFD hardware constructed of zircaloy and experiencing a burnup of less than 70,000 MWD/MTIHM and PWR control rods constructed of stainless steel are considered Low-Level Waste. All other hardware is classified as Greater-ThanClass-C waste

  11. Fast, multi-channel real-time processing of signals with microsecond latency using graphics processing units

    Energy Technology Data Exchange (ETDEWEB)

    Rath, N., E-mail: Nikolaus@rath.org; Levesque, J. P.; Mauel, M. E.; Navratil, G. A.; Peng, Q. [Department of Applied Physics and Applied Mathematics, Columbia University, 500 W 120th St, New York, New York 10027 (United States); Kato, S. [Department of Information Engineering, Nagoya University, Nagoya (Japan)

    2014-04-15

    Fast, digital signal processing (DSP) has many applications. Typical hardware options for performing DSP are field-programmable gate arrays (FPGAs), application-specific integrated DSP chips, or general purpose personal computer systems. This paper presents a novel DSP platform that has been developed for feedback control on the HBT-EP tokamak device. The system runs all signal processing exclusively on a Graphics Processing Unit (GPU) to achieve real-time performance with latencies below 8 μs. Signals are transferred into and out of the GPU using PCI Express peer-to-peer direct-memory-access transfers without involvement of the central processing unit or host memory. Tests were performed on the feedback control system of the HBT-EP tokamak using forty 16-bit floating point inputs and outputs each and a sampling rate of up to 250 kHz. Signals were digitized by a D-TACQ ACQ196 module, processing done on an NVIDIA GTX 580 GPU programmed in CUDA, and analog output was generated by D-TACQ AO32CPCI modules.

  12. Fast, multi-channel real-time processing of signals with microsecond latency using graphics processing units

    International Nuclear Information System (INIS)

    Rath, N.; Levesque, J. P.; Mauel, M. E.; Navratil, G. A.; Peng, Q.; Kato, S.

    2014-01-01

    Fast, digital signal processing (DSP) has many applications. Typical hardware options for performing DSP are field-programmable gate arrays (FPGAs), application-specific integrated DSP chips, or general purpose personal computer systems. This paper presents a novel DSP platform that has been developed for feedback control on the HBT-EP tokamak device. The system runs all signal processing exclusively on a Graphics Processing Unit (GPU) to achieve real-time performance with latencies below 8 μs. Signals are transferred into and out of the GPU using PCI Express peer-to-peer direct-memory-access transfers without involvement of the central processing unit or host memory. Tests were performed on the feedback control system of the HBT-EP tokamak using forty 16-bit floating point inputs and outputs each and a sampling rate of up to 250 kHz. Signals were digitized by a D-TACQ ACQ196 module, processing done on an NVIDIA GTX 580 GPU programmed in CUDA, and analog output was generated by D-TACQ AO32CPCI modules

  13. A multi-GPU real-time dose simulation software framework for lung radiotherapy.

    Science.gov (United States)

    Santhanam, A P; Min, Y; Neelakkantan, H; Papp, N; Meeks, S L; Kupelian, P A

    2012-09-01

    Medical simulation frameworks facilitate both the preoperative and postoperative analysis of the patient's pathophysical condition. Of particular importance is the simulation of radiation dose delivery for real-time radiotherapy monitoring and retrospective analyses of the patient's treatment. In this paper, a software framework tailored for the development of simulation-based real-time radiation dose monitoring medical applications is discussed. A multi-GPU-based computational framework coupled with inter-process communication methods is introduced for simulating the radiation dose delivery on a deformable 3D volumetric lung model and its real-time visualization. The model deformation and the corresponding dose calculation are allocated among the GPUs in a task-specific manner and is performed in a pipelined manner. Radiation dose calculations are computed on two different GPU hardware architectures. The integration of this computational framework with a front-end software layer and back-end patient database repository is also discussed. Real-time simulation of the dose delivered is achieved at once every 120 ms using the proposed framework. With a linear increase in the number of GPU cores, the computational time of the simulation was linearly decreased. The inter-process communication time also improved with an increase in the hardware memory. Variations in the delivered dose and computational speedup for variations in the data dimensions are investigated using D70 and D90 as well as gEUD as metrics for a set of 14 patients. Computational speed-up increased with an increase in the beam dimensions when compared with a CPU-based commercial software while the error in the dose calculation was lung model-based radiotherapy is an effective tool for performing both real-time and retrospective analyses.

  14. Collision detection of convex polyhedra on the NVIDIA GPU architecture for the discrete element method

    CSIR Research Space (South Africa)

    Govender, Nicolin

    2015-09-01

    Full Text Available consideration due to the architectural differences between CPU and GPU platforms. This paper describes the DEM algorithms and heuristics that are optimized for the parallel NVIDIA Kepler GPU architecture in detail. This includes a GPU optimized collision...

  15. High performance cellular level agent-based simulation with FLAME for the GPU.

    Science.gov (United States)

    Richmond, Paul; Walker, Dawn; Coakley, Simon; Romano, Daniela

    2010-05-01

    Driven by the availability of experimental data and ability to simulate a biological scale which is of immediate interest, the cellular scale is fast emerging as an ideal candidate for middle-out modelling. As with 'bottom-up' simulation approaches, cellular level simulations demand a high degree of computational power, which in large-scale simulations can only be achieved through parallel computing. The flexible large-scale agent modelling environment (FLAME) is a template driven framework for agent-based modelling (ABM) on parallel architectures ideally suited to the simulation of cellular systems. It is available for both high performance computing clusters (www.flame.ac.uk) and GPU hardware (www.flamegpu.com) and uses a formal specification technique that acts as a universal modelling format. This not only creates an abstraction from the underlying hardware architectures, but avoids the steep learning curve associated with programming them. In benchmarking tests and simulations of advanced cellular systems, FLAME GPU has reported massive improvement in performance over more traditional ABM frameworks. This allows the time spent in the development and testing stages of modelling to be drastically reduced and creates the possibility of real-time visualisation for simple visual face-validation.

  16. GPU Accelerated Chemical Similarity Calculation for Compound Library Comparison

    Science.gov (United States)

    Ma, Chao; Wang, Lirong; Xie, Xiang-Qun

    2012-01-01

    Chemical similarity calculation plays an important role in compound library design, virtual screening, and “lead” optimization. In this manuscript, we present a novel GPU-accelerated algorithm for all-vs-all Tanimoto matrix calculation and nearest neighbor search. By taking advantage of multi-core GPU architecture and CUDA parallel programming technology, the algorithm is up to 39 times superior to the existing commercial software that runs on CPUs. Because of the utilization of intrinsic GPU instructions, this approach is nearly 10 times faster than existing GPU-accelerated sparse vector algorithm, when Unity fingerprints are used for Tanimoto calculation. The GPU program that implements this new method takes about 20 minutes to complete the calculation of Tanimoto coefficients between 32M PubChem compounds and 10K Active Probes compounds, i.e., 324G Tanimoto coefficients, on a 128-CUDA-core GPU. PMID:21692447

  17. Heterogeneous Gpu&Cpu Cluster For High Performance Computing In Cryptography

    Directory of Open Access Journals (Sweden)

    Michał Marks

    2012-01-01

    Full Text Available This paper addresses issues associated with distributed computing systems andthe application of mixed GPU&CPU technology to data encryption and decryptionalgorithms. We describe a heterogenous cluster HGCC formed by twotypes of nodes: Intel processor with NVIDIA graphics processing unit and AMDprocessor with AMD graphics processing unit (formerly ATI, and a novel softwareframework that hides the heterogeneity of our cluster and provides toolsfor solving complex scientific and engineering problems. Finally, we present theresults of numerical experiments. The considered case study is concerned withparallel implementations of selected cryptanalysis algorithms. The main goal ofthe paper is to show the wide applicability of the GPU&CPU technology tolarge scale computation and data processing.

  18. High performance MRI simulations of motion on multi-GPU systems.

    Science.gov (United States)

    Xanthis, Christos G; Venetis, Ioannis E; Aletras, Anthony H

    2014-07-04

    MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer multi-GPU configuration. The incorporation

  19. A GPU-based finite-size pencil beam algorithm with 3D-density correction for radiotherapy dose calculation

    International Nuclear Information System (INIS)

    Gu Xuejun; Jia Xun; Jiang, Steve B; Jelen, Urszula; Li Jinsheng

    2011-01-01

    Targeting at the development of an accurate and efficient dose calculation engine for online adaptive radiotherapy, we have implemented a finite-size pencil beam (FSPB) algorithm with a 3D-density correction method on graphics processing unit (GPU). This new GPU-based dose engine is built on our previously published ultrafast FSPB computational framework (Gu et al 2009 Phys. Med. Biol. 54 6287-97). Dosimetric evaluations against Monte Carlo dose calculations are conducted on ten IMRT treatment plans (five head-and-neck cases and five lung cases). For all cases, there is improvement with the 3D-density correction over the conventional FSPB algorithm and for most cases the improvement is significant. Regarding the efficiency, because of the appropriate arrangement of memory access and the usage of GPU intrinsic functions, the dose calculation for an IMRT plan can be accomplished well within 1 s (except for one case) with this new GPU-based FSPB algorithm. Compared to the previous GPU-based FSPB algorithm without 3D-density correction, this new algorithm, though slightly sacrificing the computational efficiency (∼5-15% lower), has significantly improved the dose calculation accuracy, making it more suitable for online IMRT replanning.

  20. Explicit integration with GPU acceleration for large kinetic networks

    International Nuclear Information System (INIS)

    Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike

    2015-01-01

    We demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. This orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.

  1. GPU-based fast pencil beam algorithm for proton therapy

    International Nuclear Information System (INIS)

    Fujimoto, Rintaro; Nagamine, Yoshihiko; Kurihara, Tsuneya

    2011-01-01

    Performance of a treatment planning system is an essential factor in making sophisticated plans. The dose calculation is a major time-consuming process in planning operations. The standard algorithm for proton dose calculations is the pencil beam algorithm which produces relatively accurate results, but is time consuming. In order to shorten the computational time, we have developed a GPU (graphics processing unit)-based pencil beam algorithm. We have implemented this algorithm and calculated dose distributions in the case of a water phantom. The results were compared to those obtained by a traditional method with respect to the computational time and discrepancy between the two methods. The new algorithm shows 5-20 times faster performance using the NVIDIA GeForce GTX 480 card in comparison with the Intel Core-i7 920 processor. The maximum discrepancy of the dose distribution is within 0.2%. Our results show that GPUs are effective for proton dose calculations.

  2. GPU-computing in econophysics and statistical physics

    Science.gov (United States)

    Preis, T.

    2011-03-01

    A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples. In particular computationally expensive analyses employed in financial market context are coded on a graphics card architecture which leads to a significant reduction of computing time. In order to demonstrate the wide range of possible applications, a standard model in statistical physics - the Ising model - is ported to a graphics card architecture as well, resulting in large speedup values.

  3. Acoustic reverse-time migration using GPU card and POSIX thread based on the adaptive optimal finite-difference scheme and the hybrid absorbing boundary condition

    Science.gov (United States)

    Cai, Xiaohui; Liu, Yang; Ren, Zhiming

    2018-06-01

    Reverse-time migration (RTM) is a powerful tool for imaging geologically complex structures such as steep-dip and subsalt. However, its implementation is quite computationally expensive. Recently, as a low-cost solution, the graphic processing unit (GPU) was introduced to improve the efficiency of RTM. In the paper, we develop three ameliorative strategies to implement RTM on GPU card. First, given the high accuracy and efficiency of the adaptive optimal finite-difference (FD) method based on least squares (LS) on central processing unit (CPU), we study the optimal LS-based FD method on GPU. Second, we develop the CPU-based hybrid absorbing boundary condition (ABC) to the GPU-based one by addressing two issues of the former when introduced to GPU card: time-consuming and chaotic threads. Third, for large-scale data, the combinatorial strategy for optimal checkpointing and efficient boundary storage is introduced for the trade-off between memory and recomputation. To save the time of communication between host and disk, the portable operating system interface (POSIX) thread is utilized to create the other CPU core at the checkpoints. Applications of the three strategies on GPU with the compute unified device architecture (CUDA) programming language in RTM demonstrate their efficiency and validity.

  4. GPU acceleration of 3D forward and backward projection using separable footprints for X-ray CT image reconstruction

    Energy Technology Data Exchange (ETDEWEB)

    Wu, Meng; Fessler, Jeffrey A. [Michigan Univ., Ann Arbor, MI (United States). Dept. of Electrical Engineering and Computer Science

    2011-07-01

    Iterative 3D image reconstruction methods can improve image quality over conventional filtered back projection (FBP) in X-ray computed tomography. However, high computational costs deter the routine use of iterative reconstruction clinically. The separable footprint method for forward and back-projection simplifies the integrals over a detector cell in a way that is quite accurate and also has a relatively efficient CPU implementation. In this project, we implemented the separable footprints method for both forward and backward projection on a graphics processing unit (GPU) with NVDIA's parallel computing architecture (CUDA). This paper describes our GPU kernels for the separable footprint method and simulation results. (orig.)

  5. MAGI: a Node.js web service for fast microRNA-Seq analysis in a GPU infrastructure

    OpenAIRE

    Kim, Jihoon; Levy, Eric; Ferbrache, Alex; Stepanowsky, Petra; Farcas, Claudiu; Wang, Shuang; Brunner, Stefan; Bath, Tyler; Wu, Yuan; Ohno-Machado, Lucila

    2014-01-01

    Summary: MAGI is a web service for fast MicroRNA-Seq data analysis in a graphics processing unit (GPU) infrastructure. Using just a browser, users have access to results as web reports in just a few hours—>600% end-to-end performance improvement over state of the art. MAGI’s salient features are (i) transfer of large input files in native FASTA with Qualities (FASTQ) format through drag-and-drop operations, (ii) rapid prediction of microRNA target genes leveraging parallel computing with GPU ...

  6. Haptic Feedback for the GPU-based Surgical Simulator

    DEFF Research Database (Denmark)

    Sørensen, Thomas Sangild; Mosegaard, Jesper

    2006-01-01

    The GPU has proven to be a powerful processor to compute spring-mass based surgical simulations. It has not previously been shown however, how to effectively implement haptic interaction with a simulation running entirely on the GPU. This paper describes a method to calculate haptic feedback...... with limited performance cost. It allows easy balancing of the GPU workload between calculations of simulation, visualisation, and the haptic feedback....

  7. Multi-GPU based acceleration of a list-mode DRAMA toward real-time OpenPET imaging

    Energy Technology Data Exchange (ETDEWEB)

    Kinouchi, Shoko [Chiba Univ. (Japan); National Institute of Radiological Sciences, Chiba (Japan); Yamaya, Taiga; Yoshida, Eiji; Tashima, Hideaki [National Institute of Radiological Sciences, Chiba (Japan); Kudo, Hiroyuki [Tsukuba Univ., Ibaraki (Japan); Suga, Mikio [Chiba Univ. (Japan)

    2011-07-01

    OpenPET, which has a physical gap between two detector rings, is our new PET geometry. In order to realize future radiation therapy guided by OpenPET, real-time imaging is required. Therefore we developed a list-mode image reconstruction method using general purpose graphic processing units (GPUs). For GPU implementation, the efficiency of acceleration depends on the implementation method which is required to avoid conditional statements. Therefore, in our previous study, we developed a new system model which was suited for the GPU implementation. In this paper, we implemented our image reconstruction method using 4 GPUs to get further acceleration. We applied the developed reconstruction method to a small OpenPET prototype. We obtained calculation times of total iteration using 4 GPUs that were 3.4 times faster than using a single GPU. Compared to using a single CPU, we achieved the reconstruction time speed-up of 142 times using 4 GPUs. (orig.)

  8. GPU seeks new funding for TMI cleanup

    International Nuclear Information System (INIS)

    Utroska, D.

    1982-01-01

    General Public Utilities (GPU) wants approval for annual transfer of money from base rate increases in other accounts to pay for the cleanup at Three Mile Island (TMI) until TMI-1 returns to service or the public utility commission takes further action. This proposal confirms fears of a delay in TMI-1 startup and demonstrates that the January negotiated settlement will produce little funding for TMI-2 cleanup. A review of the settlement terms outlines the three-step process for base rate increases and revenue adjustments after the startup of TMI-1, and points out where controversy and delays due to psychological stress make a new source of money essential. GPU thinks customer funding will motivate other parties to a broad-based cost-sharing agreement

  9. Large Scale Simulations of the Euler Equations on GPU Clusters

    KAUST Repository

    Liebmann, Manfred

    2010-08-01

    The paper investigates the scalability of a parallel Euler solver, using the Vijayasundaram method, on a GPU cluster with 32 Nvidia Geforce GTX 295 boards. The aim of this research is to enable large scale fluid dynamics simulations with up to one billion elements. We investigate communication protocols for the GPU cluster to compensate for the slow Gigabit Ethernet network between the GPU compute nodes and to maintain overall efficiency. A diesel engine intake-port and a nozzle, meshed in different resolutions, give good real world examples for the scalability tests on the GPU cluster. © 2010 IEEE.

  10. Solving global optimization problems on GPU cluster

    Energy Technology Data Exchange (ETDEWEB)

    Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya [Lobachevsky State University of Nizhni Novgorod, Gagarin Avenue 23, 603950 Nizhni Novgorod (Russian Federation)

    2016-06-08

    The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.

  11. COMPUTER HARDWARE MARKING

    CERN Multimedia

    Groupe de protection des biens

    2000-01-01

    As part of the campaign to protect CERN property and for insurance reasons, all computer hardware belonging to the Organization must be marked with the words 'PROPRIETE CERN'.IT Division has recently introduced a new marking system that is both economical and easy to use. From now on all desktop hardware (PCs, Macintoshes, printers) issued by IT Division with a value equal to or exceeding 500 CHF will be marked using this new system.For equipment that is already installed but not yet marked, including UNIX workstations and X terminals, IT Division's Desktop Support Service offers the following services free of charge:Equipment-marking wherever the Service is called out to perform other work (please submit all work requests to the IT Helpdesk on 78888 or helpdesk@cern.ch; for unavoidable operational reasons, the Desktop Support Service will only respond to marking requests when these coincide with requests for other work such as repairs, system upgrades, etc.);Training of personnel designated by Division Leade...

  12. The ATLAS Trigger Algorithms for General Purpose Graphics Processor Units

    CERN Document Server

    Tavares Delgado, Ademar; The ATLAS collaboration

    2016-01-01

    The ATLAS Trigger Algorithms for General Purpose Graphics Processor Units Type: Talk Abstract: We present the ATLAS Trigger algorithms developed to exploit General­ Purpose Graphics Processor Units. ATLAS is a particle physics experiment located on the LHC collider at CERN. The ATLAS Trigger system has two levels, hardware-­based Level 1 and the High Level Trigger implemented in software running on a farm of commodity CPU. Performing the trigger event selection within the available farm resources presents a significant challenge that will increase future LHC upgrades. are being evaluated as a potential solution for trigger algorithms acceleration. Key factors determining the potential benefit of this new technology are the relative execution speedup, the number of GPUs required and the relative financial cost of the selected GPU. We have developed a trigger demonstrator which includes algorithms for reconstructing tracks in the Inner Detector and Muon Spectrometer and clusters of energy deposited in the Cal...

  13. Online real-time reconstruction of adaptive TSENSE with commodity CPU / GPU hardware

    DEFF Research Database (Denmark)

    Roujol, Sebastien; de Senneville, Baudouin Denis; Vahalla, Erkki

    2009-01-01

    Adaptive temporal sensitivity encoding (TSENSE) has been suggested as a robust parallel imaging method suitable for MR guidance of interventional procedures. However, in practice, the reconstruction of adaptive TSENSE images obtained with large coil arrays leads to long reconstruction times...... image sizes used in interventional imaging (128 × 96, 16 channels, sensitivity encoding (SENSE) factor 2-4), the pipeline is able to reconstruct adaptive TSENSE images with image latencies below 90 ms at frame rates of up to 40 images/s, rendering the MR performance in practice limited...... by the constraints of the MR acquisition. Its performance is demonstrated by the online reconstruction of in vivo MR images for rapid temperature mapping of the kidney and for cardiac catheterization....

  14. Moving-Target Position Estimation Using GPU-Based Particle Filter for IoT Sensing Applications

    Directory of Open Access Journals (Sweden)

    Seongseop Kim

    2017-11-01

    Full Text Available A particle filter (PF has been introduced for effective position estimation of moving targets for non-Gaussian and nonlinear systems. The time difference of arrival (TDOA method using acoustic sensor array has normally been used to for estimation by concealing the location of a moving target, especially underwater. In this paper, we propose a GPU -based acceleration of target position estimation using a PF and propose an efficient system and software architecture. The proposed graphic processing unit (GPU-based algorithm has more advantages in applying PF signal processing to a target system, which consists of large-scale Internet of Things (IoT-driven sensors because of the parallelization which is scalable. For the TDOA measurement from the acoustic sensor array, we use the generalized cross correlation phase transform (GCC-PHAT method to obtain the correlation coefficient of the signal using Fast Fourier Transform (FFT, and we try to accelerate the calculations of GCC-PHAT based TDOA measurements using FFT with GPU compute unified device architecture (CUDA. The proposed approach utilizes a parallelization method in the target position estimation algorithm using GPU-based PF processing. In addition, it could efficiently estimate sudden movement change of the target using GPU-based parallel computing which also can be used for multiple target tracking. It also provides scalability in extending the detection algorithm according to the increase of the number of sensors. Therefore, the proposed architecture can be applied in IoT sensing applications with a large number of sensors. The target estimation algorithm was verified using MATLAB and implemented using GPU CUDA. We implemented the proposed signal processing acceleration system using target GPU to analyze in terms of execution time. The execution time of the algorithm is reduced by 55% from to the CPU standalone operation in target embedded board, NVIDIA Jetson TX1. Also, to apply large

  15. Foundations of hardware IP protection

    CERN Document Server

    Torres, Lionel

    2017-01-01

    This book provides a comprehensive and up-to-date guide to the design of security-hardened, hardware intellectual property (IP). Readers will learn how IP can be threatened, as well as protected, by using means such as hardware obfuscation/camouflaging, watermarking, fingerprinting (PUF), functional locking, remote activation, hidden transmission of data, hardware Trojan detection, protection against hardware Trojan, use of secure element, ultra-lightweight cryptography, and digital rights management. This book serves as a single-source reference to design space exploration of hardware security and IP protection. · Provides readers with a comprehensive overview of hardware intellectual property (IP) security, describing threat models and presenting means of protection, from integrated circuit layout to digital rights management of IP; · Enables readers to transpose techniques fundamental to digital rights management (DRM) to the realm of hardware IP security; · Introduce designers to the concept of salutar...

  16. Open hardware for open science

    CERN Multimedia

    CERN Bulletin

    2011-01-01

    Inspired by the open source software movement, the Open Hardware Repository was created to enable hardware developers to share the results of their R&D activities. The recently published CERN Open Hardware Licence offers the legal framework to support this knowledge and technology exchange.   Two years ago, a group of electronics designers led by Javier Serrano, a CERN engineer, working in experimental physics laboratories created the Open Hardware Repository (OHR). This project was initiated in order to facilitate the exchange of hardware designs across the community in line with the ideals of “open science”. The main objectives include avoiding duplication of effort by sharing results across different teams that might be working on the same need. “For hardware developers, the advantages of open hardware are numerous. For example, it is a great learning tool for technologies some developers would not otherwise master, and it avoids unnecessary work if someone ha...

  17. ARCHER{sub RT} – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

    Energy Technology Data Exchange (ETDEWEB)

    Su, Lin; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George, E-mail: xug2@rpi.edu [Nuclear Engineering Program, Rensselaer Polytechnic Institute, Troy, New York 12180 (United States); Yang, Youming; Bednarz, Bryan [Medical Physics, University of Wisconsin, Madison, Wisconsin 53706 (United States); Sterpin, Edmond [Molecular Imaging, Radiotherapy and Oncology, Université catholique de Louvain, Brussels, Belgium 1348 (Belgium)

    2014-07-15

    Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHER{sub RT} is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head and neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHER{sub RT}. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHER{sub RT} and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHER{sub RT} agree well with DOSXYZnrc. For clinical cases, results from ARCHER{sub RT} are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head and neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to

  18. Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems

    Energy Technology Data Exchange (ETDEWEB)

    Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; van Dam, Hubertus JJ; Apra, Edoardo; Kowalski, Karol

    2013-04-09

    A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithm is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.

  19. Disgruntled employees challenge GPU on TMI-2 polar crane safety, say load test needed

    International Nuclear Information System (INIS)

    Smock, R.

    1983-01-01

    Workers at the Three Mile Island No. 2 unit have gone public with their complaint that General Public Utilities (GPU) Corp. is ignoring safety at the cleanup site. With the exception of a specific concern over an overhead crane inside the containment building, however, the charges are vague. The polar crane will be used to lift the 170 to 180-ton reactor vessel head later this year, but a plant engineer faults the planned test procedure because it calls for lifting 40-ton missile shields from above the reactor before the crane is tested for strength. If the crane fails when lifting the missile shields, the engineer contends, there could be another loss of coolant. GPU rejected a 50-ton test of the crane because it is not required and because the risk is virtually zero. The utility also argues that additional testing will only increase exposure for the workers. 1 figure

  20. Data assimilation using a GPU accelerated path integral Monte Carlo approach

    Science.gov (United States)

    Quinn, John C.; Abarbanel, Henry D. I.

    2011-09-01

    The answers to data assimilation questions can be expressed as path integrals over all possible state and parameter histories. We show how these path integrals can be evaluated numerically using a Markov Chain Monte Carlo method designed to run in parallel on a graphics processing unit (GPU). We demonstrate the application of the method to an example with a transmembrane voltage time series of a simulated neuron as an input, and using a Hodgkin-Huxley neuron model. By taking advantage of GPU computing, we gain a parallel speedup factor of up to about 300, compared to an equivalent serial computation on a CPU, with performance increasing as the length of the observation time used for data assimilation increases.

  1. GPU-accelerated Lattice Boltzmann method for anatomical extraction in patient-specific computational hemodynamics

    Science.gov (United States)

    Yu, H.; Wang, Z.; Zhang, C.; Chen, N.; Zhao, Y.; Sawchuk, A. P.; Dalsing, M. C.; Teague, S. D.; Cheng, Y.

    2014-11-01

    Existing research of patient-specific computational hemodynamics (PSCH) heavily relies on software for anatomical extraction of blood arteries. Data reconstruction and mesh generation have to be done using existing commercial software due to the gap between medical image processing and CFD, which increases computation burden and introduces inaccuracy during data transformation thus limits the medical applications of PSCH. We use lattice Boltzmann method (LBM) to solve the level-set equation over an Eulerian distance field and implicitly and dynamically segment the artery surfaces from radiological CT/MRI imaging data. The segments seamlessly feed to the LBM based CFD computation of PSCH thus explicit mesh construction and extra data management are avoided. The LBM is ideally suited for GPU (graphic processing unit)-based parallel computing. The parallel acceleration over GPU achieves excellent performance in PSCH computation. An application study will be presented which segments an aortic artery from a chest CT dataset and models PSCH of the segmented artery.

  2. Hypergraph partitioning implementation for parallelizing matrix-vector multiplication using CUDA GPU-based parallel computing

    Science.gov (United States)

    Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.

    2017-07-01

    Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).

  3. A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics

    Science.gov (United States)

    Bard, Christopher; Dorelli, John C.

    2013-01-01

    We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.

  4. GPU-Vote: A Framework for Accelerating Voting Algorithms on GPU.

    NARCIS (Netherlands)

    Braak, van den G.J.W.; Nugteren, C.; Mesman, B.; Corporaal, H.; Kaklamanis, C.; Papatheodorou, T.; Spirakis, P.G.

    2012-01-01

    Voting algorithms, such as histogram and Hough transforms, are frequently used algorithms in various domains, such as statistics and image processing. Algorithms in these domains may be accelerated using GPUs. Implementing voting algorithms efficiently on a GPU however is far from trivial due to

  5. A Monte Carlo neutron transport code for eigenvalue calculations on a dual-GPU system and CUDA environment

    Energy Technology Data Exchange (ETDEWEB)

    Liu, T.; Ding, A.; Ji, W.; Xu, X. G. [Nuclear Engineering and Engineering Physics, Rensselaer Polytechnic Inst., Troy, NY 12180 (United States); Carothers, C. D. [Dept. of Computer Science, Rensselaer Polytechnic Inst. RPI (United States); Brown, F. B. [Los Alamos National Laboratory (LANL) (United States)

    2012-07-01

    Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of {approx}2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)

  6. A Monte Carlo neutron transport code for eigenvalue calculations on a dual-GPU system and CUDA environment

    International Nuclear Information System (INIS)

    Liu, T.; Ding, A.; Ji, W.; Xu, X. G.; Carothers, C. D.; Brown, F. B.

    2012-01-01

    Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of ∼2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)

  7. Hardware device to physical structure binding and authentication

    Science.gov (United States)

    Hamlet, Jason R.; Stein, David J.; Bauer, Todd M.

    2013-08-20

    Detection and deterrence of device tampering and subversion may be achieved by including a cryptographic fingerprint unit within a hardware device for authenticating a binding of the hardware device and a physical structure. The cryptographic fingerprint unit includes an internal physically unclonable function ("PUF") circuit disposed in or on the hardware device, which generate an internal PUF value. Binding logic is coupled to receive the internal PUF value, as well as an external PUF value associated with the physical structure, and generates a binding PUF value, which represents the binding of the hardware device and the physical structure. The cryptographic fingerprint unit also includes a cryptographic unit that uses the binding PUF value to allow a challenger to authenticate the binding.

  8. Hardware Implementation Of Line Clipping A lgorithm By Using FPGA

    Directory of Open Access Journals (Sweden)

    Amar Dawod

    2013-04-01

    Full Text Available The computer graphics system performance is increasing faster than any other computing application. Algorithms for line clipping against convex polygons and lines have been studied for a long time and many research papers have been published so far. In spite of the latest graphical hardware development and significant increase of performance the clipping is still a bottleneck of any graphical system. So its implementation in hardware is essential for real time applications. In this paper clipping operation is discussed and a hardware implementation of the line clipping algorithm is presented and finally formulated and tested using Field Programmable Gate Arrays (FPGA. The designed hardware unit consists of two parts : the first is positional code generator unit and the second is the clipping unit. Finally it is worth mentioning that the  designed unit is capable of clipping (232524 line segments per second.       

  9. GPU-based RFA simulation for minimally invasive cancer treatment of liver tumours.

    Science.gov (United States)

    Mariappan, Panchatcharam; Weir, Phil; Flanagan, Ronan; Voglreiter, Philip; Alhonnoro, Tuomas; Pollari, Mika; Moche, Michael; Busse, Harald; Futterer, Jurgen; Portugaller, Horst Rupert; Sequeiros, Roberto Blanco; Kolesnik, Marina

    2017-01-01

    Radiofrequency ablation (RFA) is one of the most popular and well-standardized minimally invasive cancer treatments (MICT) for liver tumours, employed where surgical resection has been contraindicated. Less-experienced interventional radiologists (IRs) require an appropriate planning tool for the treatment to help avoid incomplete treatment and so reduce the tumour recurrence risk. Although a few tools are available to predict the ablation lesion geometry, the process is computationally expensive. Also, in our implementation, a few patient-specific parameters are used to improve the accuracy of the lesion prediction. Advanced heterogeneous computing using personal computers, incorporating the graphics processing unit (GPU) and the central processing unit (CPU), is proposed to predict the ablation lesion geometry. The most recent GPU technology is used to accelerate the finite element approximation of Penne's bioheat equation and a three state cell model. Patient-specific input parameters are used in the bioheat model to improve accuracy of the predicted lesion. A fast GPU-based RFA solver is developed to predict the lesion by doing most of the computational tasks in the GPU, while reserving the CPU for concurrent tasks such as lesion extraction based on the heat deposition at each finite element node. The solver takes less than 3 min for a treatment duration of 26 min. When the model receives patient-specific input parameters, the deviation between real and predicted lesion is below 3 mm. A multi-centre retrospective study indicates that the fast RFA solver is capable of providing the IR with the predicted lesion in the short time period before the intervention begins when the patient has been clinically prepared for the treatment.

  10. Sop-GPU: accelerating biomolecular simulations in the centisecond timescale using graphics processors.

    Science.gov (United States)

    Zhmurov, A; Dima, R I; Kholodov, Y; Barsegov, V

    2010-11-01

    Theoretical exploration of fundamental biological processes involving the forced unraveling of multimeric proteins, the sliding motion in protein fibers and the mechanical deformation of biomolecular assemblies under physiological force loads is challenging even for distributed computing systems. Using a C(α)-based coarse-grained self organized polymer (SOP) model, we implemented the Langevin simulations of proteins on graphics processing units (SOP-GPU program). We assessed the computational performance of an end-to-end application of the program, where all the steps of the algorithm are running on a GPU, by profiling the simulation time and memory usage for a number of test systems. The ∼90-fold computational speedup on a GPU, compared with an optimized central processing unit program, enabled us to follow the dynamics in the centisecond timescale, and to obtain the force-extension profiles using experimental pulling speeds (v(f) = 1-10 μm/s) employed in atomic force microscopy and in optical tweezers-based dynamic force spectroscopy. We found that the mechanical molecular response critically depends on the conditions of force application and that the kinetics and pathways for unfolding change drastically even upon a modest 10-fold increase in v(f). This implies that, to resolve accurately the free energy landscape and to relate the results of single-molecule experiments in vitro and in silico, molecular simulations should be carried out under the experimentally relevant force loads. This can be accomplished in reasonable wall-clock time for biomolecules of size as large as 10(5) residues using the SOP-GPU package. © 2010 Wiley-Liss, Inc.

  11. GPU Pro advanced rendering techniques

    CERN Document Server

    Engel, Wolfgang

    2010-01-01

    This book covers essential tools and techniques for programming the graphics processing unit. Brought to you by Wolfgang Engel and the same team of editors who made the ShaderX series a success, this volume covers advanced rendering techniques, engine design, GPGPU techniques, related mathematical techniques, and game postmortems. A special emphasis is placed on handheld programming to account for the increased importance of graphics on mobile devices, especially the iPhone and iPod touch.Example programs and source code can be downloaded from the book's CRC Press web page. 

  12. Hardware Support for Embedded Java

    DEFF Research Database (Denmark)

    Schoeberl, Martin

    2012-01-01

    The general Java runtime environment is resource hungry and unfriendly for real-time systems. To reduce the resource consumption of Java in embedded systems, direct hardware support of the language is a valuable option. Furthermore, an implementation of the Java virtual machine in hardware enables...... worst-case execution time analysis of Java programs. This chapter gives an overview of current approaches to hardware support for embedded and real-time Java....

  13. HARDWARE TROJAN IDENTIFICATION AND DETECTION

    OpenAIRE

    Samer Moein; Fayez Gebali; T. Aaron Gulliver; Abdulrahman Alkandari

    2017-01-01

    ABSTRACT The majority of techniques developed to detect hardware trojans are based on specific attributes. Further, the ad hoc approaches employed to design methods for trojan detection are largely ineffective. Hardware trojans have a number of attributes which can be used to systematically develop detection techniques. Based on this concept, a detailed examination of current trojan detection techniques and the characteristics of existing hardware trojans is presented. This is used to dev...

  14. Hardware assisted hypervisor introspection.

    Science.gov (United States)

    Shi, Jiangyong; Yang, Yuexiang; Tang, Chuan

    2016-01-01

    In this paper, we introduce hypervisor introspection, an out-of-box way to monitor the execution of hypervisors. Similar to virtual machine introspection which has been proposed to protect virtual machines in an out-of-box way over the past decade, hypervisor introspection can be used to protect hypervisors which are the basis of cloud security. Virtual machine introspection tools are usually deployed either in hypervisor or in privileged virtual machines, which might also be compromised. By utilizing hardware support including nested virtualization, EPT protection and #BP, we are able to monitor all hypercalls belongs to the virtual machines of one hypervisor, include that of privileged virtual machine and even when the hypervisor is compromised. What's more, hypercall injection method is used to simulate hypercall-based attacks and evaluate the performance of our method. Experiment results show that our method can effectively detect hypercall-based attacks with some performance cost. Lastly, we discuss our furture approaches of reducing the performance cost and preventing the compromised hypervisor from detecting the existence of our introspector, in addition with some new scenarios to apply our hypervisor introspection system.

  15. LHCb: Hardware Data Injector

    CERN Multimedia

    Delord, V; Neufeld, N

    2009-01-01

    The LHCb High Level Trigger and Data Acquisition system selects about 2 kHz of events out of the 1 MHz of events, which have been selected previously by the first-level hardware trigger. The selected events are consolidated into files and then sent to permanent storage for subsequent analysis on the Grid. The goal of the upgrade of the LHCb readout is to lift the limitation to 1 MHz. This means speeding up the DAQ to 40 MHz. Such a DAQ system will certainly employ 10 Gigabit or technologies and might also need new networking protocols: a customized TCP or proprietary solutions. A test module is being presented, which integrates in the existing LHCb infrastructure. It is a 10-Gigabit traffic generator, flexible enough to generate LHCb's raw data packets using dummy data or simulated data. These data are seen as real data coming from sub-detectors by the DAQ. The implementation is based on an FPGA using 10 Gigabit Ethernet interface. This module is integrated in the experiment control system. The architecture, ...

  16. GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF

    Science.gov (United States)

    Mielikainen, J.; Huang, B.; Huang, A.

    2011-12-01

    The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has

  17. Hardware for soft computing and soft computing for hardware

    CERN Document Server

    Nedjah, Nadia

    2014-01-01

    Single and Multi-Objective Evolutionary Computation (MOEA),  Genetic Algorithms (GAs), Artificial Neural Networks (ANNs), Fuzzy Controllers (FCs), Particle Swarm Optimization (PSO) and Ant colony Optimization (ACO) are becoming omnipresent in almost every intelligent system design. Unfortunately, the application of the majority of these techniques is complex and so requires a huge computational effort to yield useful and practical results. Therefore, dedicated hardware for evolutionary, neural and fuzzy computation is a key issue for designers. With the spread of reconfigurable hardware such as FPGAs, digital as well as analog hardware implementations of such computation become cost-effective. The idea behind this book is to offer a variety of hardware designs for soft computing techniques that can be embedded in any final product. Also, to introduce the successful application of soft computing technique to solve many hard problem encountered during the design of embedded hardware designs. Reconfigurable em...

  18. FY1995 evolvable hardware chip; 1995 nendo shinkasuru hardware chip

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1997-03-01

    This project aims at the development of 'Evolvable Hardware' (EHW) which can adapt its hardware structure to the environment to attain better hardware performance, under the control of genetic algorithms. EHW is a key technology to explore the new application area requiring real-time performance and on-line adaptation. 1. Development of EHW-LSI for function level hardware evolution, which includes 15 DSPs in one chip. 2. Application of the EHW to the practical industrial applications such as data compression, ATM control, digital mobile communication. 3. Two patents : (1) the architecture and the processing method for programmable EHW-LSI. (2) The method of data compression for loss-less data, using EHW. 4. The first international conference for evolvable hardware was held by authors: Intl. Conf. on Evolvable Systems (ICES96). It was determined at ICES96 that ICES will be held every two years between Japan and Europe. So the new society has been established by us. (NEDO)

  19. Better Faster Noise with the GPU

    DEFF Research Database (Denmark)

    Wyvill, Geoff; Frisvad, Jeppe Revall

    Filtered noise [Perlin 1985] has, for twenty years, been a fundamental tool for creating functional texture and it has many other applications; for example, animating water waves or the motion of grass waving in the wind. Perlin noise suffers from a number of defects and there have been many atte...... attempts to create better or faster noise but Perlin’s ‘Gradient Noise’ has consistently proved to be the best compromise between speed and quality. Our objective was to create a better noise cheaply by use of the GPU....

  20. Strategies for regular segmented reductions on GPU

    DEFF Research Database (Denmark)

    Larsen, Rasmus Wriedt; Henriksen, Troels

    2017-01-01

    We present and evaluate an implementation technique for regular segmented reductions on GPUs. Existing techniques tend to be either consistent in performance but relatively inefficient in absolute terms, or optimised for specific workloads and thereby exhibiting bad performance for certain input...... is in the context of the Futhark compiler, the implementation technique is applicable to any library or language that has a need for segmented reductions. We evaluate the technique on four microbenchmarks, two of which we also compare to implementations in the CUB library for GPU programming, as well as on two...

  1. Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

    International Nuclear Information System (INIS)

    Levine, Benjamin G.; Stone, John E.; Kohlmeyer, Axel

    2011-01-01

    The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.

  2. Energy- and cost-efficient lattice-QCD computations using graphics processing units

    Energy Technology Data Exchange (ETDEWEB)

    Bach, Matthias

    2014-07-01

    Quarks and gluons are the building blocks of all hadronic matter, like protons and neutrons. Their interaction is described by Quantum Chromodynamics (QCD), a theory under test by large scale experiments like the Large Hadron Collider (LHC) at CERN and in the future at the Facility for Antiproton and Ion Research (FAIR) at GSI. However, perturbative methods can only be applied to QCD for high energies. Studies from first principles are possible via a discretization onto an Euclidean space-time grid. This discretization of QCD is called Lattice QCD (LQCD) and is the only ab-initio option outside of the high-energy regime. LQCD is extremely compute and memory intensive. In particular, it is by definition always bandwidth limited. Thus - despite the complexity of LQCD applications - it led to the development of several specialized compute platforms and influenced the development of others. However, in recent years General-Purpose computation on Graphics Processing Units (GPGPU) came up as a new means for parallel computing. Contrary to machines traditionally used for LQCD, graphics processing units (GPUs) are a massmarket product. This promises advantages in both the pace at which higher-performing hardware becomes available and its price. CL2QCD is an OpenCL based implementation of LQCD using Wilson fermions that was developed within this thesis. It operates on GPUs by all major vendors as well as on central processing units (CPUs). On the AMD Radeon HD 7970 it provides the fastest double-precision D kernel for a single GPU, achieving 120GFLOPS. D - the most compute intensive kernel in LQCD simulations - is commonly used to compare LQCD platforms. This performance is enabled by an in-depth analysis of optimization techniques for bandwidth-limited codes on GPUs. Further, analysis of the communication between GPU and CPU, as well as between multiple GPUs, enables high-performance Krylov space solvers and linear scaling to multiple GPUs within a single system. LQCD

  3. Energy- and cost-efficient lattice-QCD computations using graphics processing units

    International Nuclear Information System (INIS)

    Bach, Matthias

    2014-01-01

    Quarks and gluons are the building blocks of all hadronic matter, like protons and neutrons. Their interaction is described by Quantum Chromodynamics (QCD), a theory under test by large scale experiments like the Large Hadron Collider (LHC) at CERN and in the future at the Facility for Antiproton and Ion Research (FAIR) at GSI. However, perturbative methods can only be applied to QCD for high energies. Studies from first principles are possible via a discretization onto an Euclidean space-time grid. This discretization of QCD is called Lattice QCD (LQCD) and is the only ab-initio option outside of the high-energy regime. LQCD is extremely compute and memory intensive. In particular, it is by definition always bandwidth limited. Thus - despite the complexity of LQCD applications - it led to the development of several specialized compute platforms and influenced the development of others. However, in recent years General-Purpose computation on Graphics Processing Units (GPGPU) came up as a new means for parallel computing. Contrary to machines traditionally used for LQCD, graphics processing units (GPUs) are a massmarket product. This promises advantages in both the pace at which higher-performing hardware becomes available and its price. CL2QCD is an OpenCL based implementation of LQCD using Wilson fermions that was developed within this thesis. It operates on GPUs by all major vendors as well as on central processing units (CPUs). On the AMD Radeon HD 7970 it provides the fastest double-precision D kernel for a single GPU, achieving 120GFLOPS. D - the most compute intensive kernel in LQCD simulations - is commonly used to compare LQCD platforms. This performance is enabled by an in-depth analysis of optimization techniques for bandwidth-limited codes on GPUs. Further, analysis of the communication between GPU and CPU, as well as between multiple GPUs, enables high-performance Krylov space solvers and linear scaling to multiple GPUs within a single system. LQCD

  4. Secure coupling of hardware components

    NARCIS (Netherlands)

    Hoepman, J.H.; Joosten, H.J.M.; Knobbe, J.W.

    2011-01-01

    A method and a system for securing communication between at least a first and a second hardware components of a mobile device is described. The method includes establishing a first shared secret between the first and the second hardware components during an initialization of the mobile device and,

  5. Computer hardware for radiologists: Part I

    Directory of Open Access Journals (Sweden)

    Indrajit I

    2010-01-01

    Full Text Available Computers are an integral part of modern radiology practice. They are used in different radiology modalities to acquire, process, and postprocess imaging data. They have had a dramatic influence on contemporary radiology practice. Their impact has extended further with the emergence of Digital Imaging and Communications in Medicine (DICOM, Picture Archiving and Communication System (PACS, Radiology information system (RIS technology, and Teleradiology. A basic overview of computer hardware relevant to radiology practice is presented here. The key hardware components in a computer are the motherboard, central processor unit (CPU, the chipset, the random access memory (RAM, the memory modules, bus, storage drives, and ports. The personnel computer (PC has a rectangular case that contains important components called hardware, many of which are integrated circuits (ICs. The fiberglass motherboard is the main printed circuit board and has a variety of important hardware mounted on it, which are connected by electrical pathways called "buses". The CPU is the largest IC on the motherboard and contains millions of transistors. Its principal function is to execute "programs". A Pentium® 4 CPU has transistors that execute a billion instructions per second. The chipset is completely different from the CPU in design and function; it controls data and interaction of buses between the motherboard and the CPU. Memory (RAM is fundamentally semiconductor chips storing data and instructions for access by a CPU. RAM is classified by storage capacity, access speed, data rate, and configuration.

  6. Computer hardware for radiologists: Part I

    International Nuclear Information System (INIS)

    Indrajit, IK; Alam, A

    2010-01-01

    Computers are an integral part of modern radiology practice. They are used in different radiology modalities to acquire, process, and postprocess imaging data. They have had a dramatic influence on contemporary radiology practice. Their impact has extended further with the emergence of Digital Imaging and Communications in Medicine (DICOM), Picture Archiving and Communication System (PACS), Radiology information system (RIS) technology, and Teleradiology. A basic overview of computer hardware relevant to radiology practice is presented here. The key hardware components in a computer are the motherboard, central processor unit (CPU), the chipset, the random access memory (RAM), the memory modules, bus, storage drives, and ports. The personnel computer (PC) has a rectangular case that contains important components called hardware, many of which are integrated circuits (ICs). The fiberglass motherboard is the main printed circuit board and has a variety of important hardware mounted on it, which are connected by electrical pathways called “buses”. The CPU is the largest IC on the motherboard and contains millions of transistors. Its principal function is to execute “programs”. A Pentium ® 4 CPU has transistors that execute a billion instructions per second. The chipset is completely different from the CPU in design and function; it controls data and interaction of buses between the motherboard and the CPU. Memory (RAM) is fundamentally semiconductor chips storing data and instructions for access by a CPU. RAM is classified by storage capacity, access speed, data rate, and configuration

  7. AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics

    Science.gov (United States)

    Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.

    2017-05-01

    We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.

  8. GPU-accelerated ray-tracing for real-time treatment planning

    International Nuclear Information System (INIS)

    Heinrich, H; Ziegenhein, P; Kamerling, C P; Oelfke, U; Froening, H

    2014-01-01

    Dose calculation methods in radiotherapy treatment planning require the radiological depth information of the voxels that represent the patient volume to correct for tissue inhomogeneities. This information is acquired by time consuming ray-tracing-based calculations. For treatment planning scenarios with changing geometries and real-time constraints this is a severe bottleneck. We implemented an algorithm for the graphics processing unit (GPU) which implements a ray-matrix approach to reduce the number of rays to trace. Furthermore, we investigated the impact of different strategies of accessing memory in kernel implementations as well as strategies for rapid data transfers between main memory and memory of the graphics device. Our study included the overlapping of computations and memory transfers to reduce the overall runtime using Hyper-Q. We tested our approach on a prostate case (9 beams, coplanar). The measured execution times for a complete ray-tracing range from 28 msec for the computations on the GPU to 99 msec when considering data transfers to and from the graphics device. Our GPU-based algorithm performed the ray-tracing in real-time. The strategies efficiently reduce the time consumption of memory accesses and data transfer overhead. The achieved runtimes demonstrate the viability of this approach and allow improved real-time performance for dose calculation methods in clinical routine.

  9. GPU accelerated flow solver for direct numerical simulation of turbulent flows

    Energy Technology Data Exchange (ETDEWEB)

    Salvadore, Francesco [CASPUR – via dei Tizii 6/b, 00185 Rome (Italy); Bernardini, Matteo, E-mail: matteo.bernardini@uniroma1.it [Department of Mechanical and Aerospace Engineering, University of Rome ‘La Sapienza’ – via Eudossiana 18, 00184 Rome (Italy); Botti, Michela [CASPUR – via dei Tizii 6/b, 00185 Rome (Italy)

    2013-02-15

    Graphical processing units (GPUs), characterized by significant computing performance, are nowadays very appealing for the solution of computationally demanding tasks in a wide variety of scientific applications. However, to run on GPUs, existing codes need to be ported and optimized, a procedure which is not yet standardized and may require non trivial efforts, even to high-performance computing specialists. In the present paper we accurately describe the porting to CUDA (Compute Unified Device Architecture) of a finite-difference compressible Navier–Stokes solver, suitable for direct numerical simulation (DNS) of turbulent flows. Porting and validation processes are illustrated in detail, with emphasis on computational strategies and techniques that can be applied to overcome typical bottlenecks arising from the porting of common computational fluid dynamics solvers. We demonstrate that a careful optimization work is crucial to get the highest performance from GPU accelerators. The results show that the overall speedup of one NVIDIA Tesla S2070 GPU is approximately 22 compared with one AMD Opteron 2352 Barcelona chip and 11 compared with one Intel Xeon X5650 Westmere core. The potential of GPU devices in the simulation of unsteady three-dimensional turbulent flows is proved by performing a DNS of a spatially evolving compressible mixing layer.

  10. A cache-friendly sampling strategy for texture-based volume rendering on GPU

    Directory of Open Access Journals (Sweden)

    Junpeng Wang

    2017-06-01

    Full Text Available The texture-based volume rendering is a memory-intensive algorithm. Its performance relies heavily on the performance of the texture cache. However, most existing texture-based volume rendering methods blindly map computational resources to texture memory and result in incoherent memory access patterns, causing low cache hit rates in certain cases. The distance between samples taken by threads of an atomic scheduling unit (e.g. a warp of 32 threads in CUDA of the GPU is a crucial factor that affects the texture cache performance. Based on this fact, we present a new sampling strategy, called Warp Marching, for the ray-casting algorithm of texture-based volume rendering. The effects of different sample organizations and different thread-pixel mappings in the ray-casting algorithm are thoroughly analyzed. Also, a pipeline manner color blending approach is introduced and the power of warp-level GPU operations is leveraged to improve the efficiency of parallel executions on the GPU. In addition, the rendering performance of the Warp Marching is view-independent, and it outperforms existing empty space skipping techniques in scenarios that need to render large dynamic volumes in a low resolution image. Through a series of micro-benchmarking and real-life data experiments, we rigorously analyze our sampling strategies and demonstrate significant performance enhancements over existing sampling methods.

  11. Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.

    Science.gov (United States)

    Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji

    2015-12-01

    A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.

  12. GPU acceleration of Monte Carlo simulations for polarized photon scattering in anisotropic turbid media.

    Science.gov (United States)

    Li, Pengcheng; Liu, Celong; Li, Xianpeng; He, Honghui; Ma, Hui

    2016-09-20

    In earlier studies, we developed scattering models and the corresponding CPU-based Monte Carlo simulation programs to study the behavior of polarized photons as they propagate through complex biological tissues. Studying the simulation results in high degrees of freedom that created a demand for massive simulation tasks. In this paper, we report a parallel implementation of the simulation program based on the compute unified device architecture running on a graphics processing unit (GPU). Different schemes for sphere-only simulations and sphere-cylinder mixture simulations were developed. Diverse optimizing methods were employed to achieve the best acceleration. The final-version GPU program is hundreds of times faster than the CPU version. Dependence of the performance on input parameters and precision were also studied. It is shown that using single precision in the GPU simulations results in very limited losses in accuracy. Consumer-level graphics cards, even those in laptop computers, are more cost-effective than scientific graphics cards for single-precision computation.

  13. Efficient parallel implementation of active appearance model fitting algorithm on GPU.

    Science.gov (United States)

    Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

    2014-01-01

    The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.

  14. Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

    Directory of Open Access Journals (Sweden)

    Jinwei Wang

    2014-01-01

    Full Text Available The active appearance model (AAM is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA on the Nvidia’s GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.

  15. On-the-fly generation and rendering of infinite cities on the GPU

    KAUST Repository

    Steinberger, Markus

    2014-05-01

    In this paper, we present a new approach for shape-grammar-based generation and rendering of huge cities in real-time on the graphics processing unit (GPU). Traditional approaches rely on evaluating a shape grammar and storing the geometry produced as a preprocessing step. During rendering, the pregenerated data is then streamed to the GPU. By interweaving generation and rendering, we overcome the problems and limitations of streaming pregenerated data. Using our methods of visibility pruning and adaptive level of detail, we are able to dynamically generate only the geometry needed to render the current view in real-time directly on the GPU. We also present a robust and efficient way to dynamically update a scene\\'s derivation tree and geometry, enabling us to exploit frame-to-frame coherence. Our combined generation and rendering is significantly faster than all previous work. For detailed scenes, we are capable of generating geometry more rapidly than even just copying pregenerated data from main memory, enabling us to render cities with thousands of buildings at up to 100 frames per second, even with the camera moving at supersonic speed. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.

  16. An implementation of the direct-forcing immersed boundary method using GPU power

    Directory of Open Access Journals (Sweden)

    Bulent Tutkun

    2017-01-01

    Full Text Available A graphics processing unit (GPU is utilized to apply the direct-forcing immersed boundary method. The code running on the GPU is generated with the help of the Compute Unified Device Architecture (CUDA. The first and second spatial derivatives of the incompressible Navier-Stokes equations are discretized by the sixth-order central compact finite-difference schemes. Two flow fields are simulated. The first test case is the simulated flow around a square cylinder, with the results providing good estimations of the wake region mechanics and vortex shedding. The second test case is the simulated flow around a circular cylinder. This case was selected to better understand the effects of sharp corners on the force coefficients. It was observed that the estimation of the force coefficients did not result in any troubles in the case of a circular cylinder. Additionally, the performance values obtained for the calculation time for the solution of the Poisson equation are compared with the values for other CPUs and GPUs from the literature. Consequently, approximately 3× and 20× speedups are achieved in comparison with GPU (using CUSP library and CPU, respectively. CUSP is an open-source library for sparse linear algebra and graph computations on CUDA.

  17. A GPU Implementation of Local Search Operators for Symmetric Travelling Salesman Problem

    Directory of Open Access Journals (Sweden)

    Juraj Fosin

    2013-06-01

    Full Text Available The Travelling Salesman Problem (TSP is one of the most studied combinatorial optimization problem which is significant in many practical applications in transportation problems. The TSP problem is NP-hard problem and requires large computation power to be solved by the exact algorithms. In the past few years, fast development of general-purpose Graphics Processing Units (GPUs has brought huge improvement in decreasing the applications’ execution time. In this paper, we implement 2-opt and 3-opt local search operators for solving the TSP on the GPU using CUDA. The novelty presented in this paper is a new parallel iterated local search approach with 2-opt and 3-opt operators for symmetric TSP, optimized for the execution on GPUs. With our implementation large TSP problems (up to 85,900 cities can be solved using the GPU. We will show that our GPU implementation can be up to 20x faster without losing quality for all TSPlib problems as well as for our CRO TSP problem.

  18. GPU-Based 3D Cone-Beam CT Image Reconstruction for Large Data Volume

    Directory of Open Access Journals (Sweden)

    Xing Zhao

    2009-01-01

    Full Text Available Currently, 3D cone-beam CT image reconstruction speed is still a severe limitation for clinical application. The computational power of modern graphics processing units (GPUs has been harnessed to provide impressive acceleration of 3D volume image reconstruction. For extra large data volume exceeding the physical graphic memory of GPU, a straightforward compromise is to divide data volume into blocks. Different from the conventional Octree partition method, a new partition scheme is proposed in this paper. This method divides both projection data and reconstructed image volume into subsets according to geometric symmetries in circular cone-beam projection layout, and a fast reconstruction for large data volume can be implemented by packing the subsets of projection data into the RGBA channels of GPU, performing the reconstruction chunk by chunk and combining the individual results in the end. The method is evaluated by reconstructing 3D images from computer-simulation data and real micro-CT data. Our results indicate that the GPU implementation can maintain original precision and speed up the reconstruction process by 110–120 times for circular cone-beam scan, as compared to traditional CPU implementation.

  19. A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing

    Science.gov (United States)

    Guerrero, Ginés D.; Imbernón, Baldomero; García, José M.

    2014-01-01

    Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC) platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs) has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO). This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor. PMID:25025055

  20. A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing

    Directory of Open Access Journals (Sweden)

    Ginés D. Guerrero

    2014-01-01

    Full Text Available Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO. This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor.

  1. On-the-fly generation and rendering of infinite cities on the GPU

    KAUST Repository

    Steinberger, Markus; Kenzel, Michael; Kainz, Bernhard K.; Wonka, Peter; Schmalstieg, Dieter

    2014-01-01

    In this paper, we present a new approach for shape-grammar-based generation and rendering of huge cities in real-time on the graphics processing unit (GPU). Traditional approaches rely on evaluating a shape grammar and storing the geometry produced as a preprocessing step. During rendering, the pregenerated data is then streamed to the GPU. By interweaving generation and rendering, we overcome the problems and limitations of streaming pregenerated data. Using our methods of visibility pruning and adaptive level of detail, we are able to dynamically generate only the geometry needed to render the current view in real-time directly on the GPU. We also present a robust and efficient way to dynamically update a scene's derivation tree and geometry, enabling us to exploit frame-to-frame coherence. Our combined generation and rendering is significantly faster than all previous work. For detailed scenes, we are capable of generating geometry more rapidly than even just copying pregenerated data from main memory, enabling us to render cities with thousands of buildings at up to 100 frames per second, even with the camera moving at supersonic speed. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.

  2. Visualizing whole-brain DTI tractography with GPU-based Tuboids and LoD management.

    Science.gov (United States)

    Petrovic, Vid; Fallon, James; Kuester, Falko

    2007-01-01

    Diffusion Tensor Imaging (DTI) of the human brain, coupled with tractography techniques, enable the extraction of large-collections of three-dimensional tract pathways per subject. These pathways and pathway bundles represent the connectivity between different brain regions and are critical for the understanding of brain related diseases. A flexible and efficient GPU-based rendering technique for DTI tractography data is presented that addresses common performance bottlenecks and image-quality issues, allowing interactive render rates to be achieved on commodity hardware. An occlusion query-based pathway LoD management system for streamlines/streamtubes/tuboids is introduced that optimizes input geometry, vertex processing, and fragment processing loads, and helps reduce overdraw. The tuboid, a fully-shaded streamtube impostor constructed entirely on the GPU from streamline vertices, is also introduced. Unlike full streamtubes and other impostor constructs, tuboids require little to no preprocessing or extra space over the original streamline data. The supported fragment processing levels of detail range from texture-based draft shading to full raycast normal computation, Phong shading, environment mapping, and curvature-correct text labeling. The presented text labeling technique for tuboids provides adaptive, aesthetically pleasing labels that appear attached to the surface of the tubes. Furthermore, an occlusion query aggregating and scheduling scheme for tuboids is described that reduces the query overhead. Results for a tractography dataset are presented, and demonstrate that LoD-managed tuboids offer benefits over traditional streamtubes both in performance and appearance.

  3. Fast GPU-based spot extraction for energy-dispersive X-ray Laue diffraction

    International Nuclear Information System (INIS)

    Alghabi, F.; Schipper, U.; Kolb, A.; Send, S.; Abboud, A.; Pashniak, N.; Pietsch, U.

    2014-01-01

    This paper describes a novel method for fast online analysis of X-ray Laue spots taken by means of an energy-dispersive X-ray 2D detector. Current pnCCD detectors typically operate at some 100 Hz (up to a maximum of 400 Hz) and have a resolution of 384 × 384 pixels, future devices head for even higher pixel counts and frame rates. The proposed online data analysis is based on a computer utilizing multiple Graphics Processing Units (GPUs), which allow for fast and parallel data processing. Our multi-GPU based algorithm is compliant with the rules of stream-based data processing, for which GPUs are optimized. The paper's main contribution is therefore an alternative algorithm for the determination of spot positions and energies over the full sequence of pnCCD data frames. Furthermore, an improved background suppression algorithm is presented.The resulting system is able to process data at the maximum acquisition rate of 400 Hz. We present a detailed analysis of the spot positions and energies deduced from a prior (single-core) CPU-based and the novel GPU-based data processing, showing that the parallel computed results using the GPU implementation are at least of the same quality as prior CPU-based results. Furthermore, the GPU-based algorithm is able to speed up the data processing by a factor of 7 (in comparison to single-core CPU-based algorithm) which effectively makes the detector system more suitable for online data processing

  4. GPU-accelerated depth map generation for X-ray simulations of complex CAD geometries

    Science.gov (United States)

    Grandin, Robert J.; Young, Gavin; Holland, Stephen D.; Krishnamurthy, Adarsh

    2018-04-01

    Interactive x-ray simulations of complex computer-aided design (CAD) models can provide valuable insights for better interpretation of the defect signatures such as porosity from x-ray CT images. Generating the depth map along a particular direction for the given CAD geometry is the most compute-intensive step in x-ray simulations. We have developed a GPU-accelerated method for real-time generation of depth maps of complex CAD geometries. We preprocess complex components designed using commercial CAD systems using a custom CAD module and convert them into a fine user-defined surface tessellation. Our CAD module can be used by different simulators as well as handle complex geometries, including those that arise from complex castings and composite structures. We then make use of a parallel algorithm that runs on a graphics processing unit (GPU) to convert the finely-tessellated CAD model to a voxelized representation. The voxelized representation can enable heterogeneous modeling of the volume enclosed by the CAD model by assigning heterogeneous material properties in specific regions. The depth maps are generated from this voxelized representation with the help of a GPU-accelerated ray-casting algorithm. The GPU-accelerated ray-casting method enables interactive (> 60 frames-per-second) generation of the depth maps of complex CAD geometries. This enables arbitrarily rotation and slicing of the CAD model, leading to better interpretation of the x-ray images by the user. In addition, the depth maps can be used to aid directly in CT reconstruction algorithms.

  5. GPU-based simulation of optical propagation through turbulence for active and passive imaging

    Science.gov (United States)

    Monnier, Goulven; Duval, François-Régis; Amram, Solène

    2014-10-01

    IMOTEP is a GPU-based (Graphical Processing Units) software relying on a fast parallel implementation of Fresnel diffraction through successive phase screens. Its applications include active imaging, laser telemetry and passive imaging through turbulence with anisoplanatic spatial and temporal fluctuations. Thanks to parallel implementation on GPU, speedups ranging from 40X to 70X are achieved. The present paper gives a brief overview of IMOTEP models, algorithms, implementation and user interface. It then focuses on major improvements recently brought to the anisoplanatic imaging simulation method. Previously, we took advantage of the computational power offered by the GPU to develop a simulation method based on large series of deterministic realisations of the PSF distorted by turbulence. The phase screen propagation algorithm, by reproducing higher moments of the incident wavefront distortion, provides realistic PSFs. However, we first used a coarse gaussian model to fit the numerical PSFs and characterise there spatial statistics through only 3 parameters (two-dimensional displacements of centroid and width). Meanwhile, this approach was unable to reproduce the effects related to the details of the PSF structure, especially the "speckles" leading to prominent high-frequency content in short-exposure images. To overcome this limitation, we recently implemented a new empirical model of the PSF, based on Principal Components Analysis (PCA), ought to catch most of the PSF complexity. The GPU implementation allows estimating and handling efficiently the numerous (up to several hundreds) principal components typically required under the strong turbulence regime. A first demanding computational step involves PCA, phase screen propagation and covariance estimates. In a second step, realistic instantaneous images, fully accounting for anisoplanatic effects, are quickly generated. Preliminary results are presented.

  6. Hardware-accelerated Point Generation and Rendering of Point-based Impostors

    DEFF Research Database (Denmark)

    Bærentzen, Jakob Andreas

    2005-01-01

    This paper presents a novel scheme for generating points from triangle models. The method is fast and lends itself well to implementation using graphics hardware. The triangle to point conversion is done by rendering the models, and the rendering may be performed procedurally or by a black box API....... I describe the technique in detail and discuss how the generated point sets can easily be used as impostors for the original triangle models used to create the points. Since the points reside solely in GPU memory, these impostors are fairly efficient. Source code is available online....

  7. Efficient lossy compression implementations of hyperspectral images: tools, hardware platforms, and comparisons

    Science.gov (United States)

    García, Aday; Santos, Lucana; López, Sebastián.; Callicó, Gustavo M.; Lopez, Jose F.; Sarmiento, Roberto

    2014-05-01

    Efficient onboard satellite hyperspectral image compression represents a necessity and a challenge for current and future space missions. Therefore, it is mandatory to provide hardware implementations for this type of algorithms in order to achieve the constraints required for onboard compression. In this work, we implement the Lossy Compression for Exomars (LCE) algorithm on an FPGA by means of high-level synthesis (HSL) in order to shorten the design cycle. Specifically, we use CatapultC HLS tool to obtain a VHDL description of the LCE algorithm from C-language specifications. Two different approaches are followed for HLS: on one hand, introducing the whole C-language description in CatapultC and on the other hand, splitting the C-language description in functional modules to be implemented independently with CatapultC, connecting and controlling them by an RTL description code without HLS. In both cases the goal is to obtain an FPGA implementation. We explain the several changes applied to the original Clanguage source code in order to optimize the results obtained by CatapultC for both approaches. Experimental results show low area occupancy of less than 15% for a SRAM-based Virtex-5 FPGA and a maximum frequency above 80 MHz. Additionally, the LCE compressor was implemented into an RTAX2000S antifuse-based FPGA, showing an area occupancy of 75% and a frequency around 53 MHz. All these serve to demonstrate that the LCE algorithm can be efficiently executed on an FPGA onboard a satellite. A comparison between both implementation approaches is also provided. The performance of the algorithm is finally compared with implementations on other technologies, specifically a graphics processing unit (GPU) and a single-threaded CPU.

  8. FARGO3D: A NEW GPU-ORIENTED MHD CODE

    Energy Technology Data Exchange (ETDEWEB)

    Benitez-Llambay, Pablo [Instituto de Astronomía Teórica y Experimental, Observatorio Astronónomico, Universidad Nacional de Córdoba. Laprida 854, X5000BGR, Córdoba (Argentina); Masset, Frédéric S., E-mail: pbllambay@oac.unc.edu.ar, E-mail: masset@icf.unam.mx [Instituto de Ciencias Físicas, Universidad Nacional Autónoma de México (UNAM), Apdo. Postal 48-3,62251-Cuernavaca, Morelos (Mexico)

    2016-03-15

    We present the FARGO3D code, recently publicly released. It is a magnetohydrodynamics code developed with special emphasis on the physics of protoplanetary disks and planet–disk interactions, and parallelized with MPI. The hydrodynamics algorithms are based on finite-difference upwind, dimensionally split methods. The magnetohydrodynamics algorithms consist of the constrained transport method to preserve the divergence-free property of the magnetic field to machine accuracy, coupled to a method of characteristics for the evaluation of electromotive forces and Lorentz forces. Orbital advection is implemented, and an N-body solver is included to simulate planets or stars interacting with the gas. We present our implementation in detail and present a number of widely known tests for comparison purposes. One strength of FARGO3D is that it can run on either graphical processing units (GPUs) or central processing units (CPUs), achieving large speed-up with respect to CPU cores. We describe our implementation choices, which allow a user with no prior knowledge of GPU programming to develop new routines for CPUs, and have them translated automatically for GPUs.

  9. Balancing Energy and Performance in Dense Linear System Solvers for Hybrid ARM+GPU platforms

    Directory of Open Access Journals (Sweden)

    Juan P. Silva

    2016-04-01

    Full Text Available The high performance computing community has traditionally focused uniquely on the reduction of execution time, though in the last years, the optimization of energy consumption has become a main issue. A reduction of energy usage without a degradation of performance requires the adoption of energy-efficient hardware platforms accompanied by the development of energy-aware algorithms and computational kernels. The solution of linear systems is a key operation for many scientific and engineering problems. Its relevance has motivated an important amount of work, and consequently, it is possible to find high performance solvers for a wide variety of hardware platforms. In this work, we aim to develop a high performance and energy-efficient linear system solver. In particular, we develop two solvers for a low-power CPU-GPU platform, the NVIDIA Jetson TK1. These solvers implement the Gauss-Huard algorithm yielding an efficient usage of the target hardware as well as an efficient memory access. The experimental evaluation shows that the novel proposal reports important savings in both time and energy-consumption when compared with the state-of-the-art solvers of the platform.

  10. GPU Linear algebra extensions for GNU/Octave

    International Nuclear Information System (INIS)

    Bosi, L B; Mariotti, M; Santocchia, A

    2012-01-01

    Octave is one of the most widely used open source tools for numerical analysis and liner algebra. Our project aims to improve Octave by introducing support for GPU computing in order to speed up some linear algebra operations. The core of our work is a C library that executes some BLAS operations concerning vector- vector, vector matrix and matrix-matrix functions on the GPU. OpenCL functions are used to program GPU kernels, which are bound within the GNU/octave framework. We report the project implementation design and some preliminary results about performance.

  11. Work-Efficient Parallel Skyline Computation for the GPU

    DEFF Research Database (Denmark)

    Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira

    2015-01-01

    offers the potential for parallelizing skyline computation across thousands of cores. However, attempts to port skyline algorithms to the GPU have prioritized throughput and failed to outperform sequential algorithms. In this paper, we introduce a new skyline algorithm, designed for the GPU, that uses...... a global, static partitioning scheme. With the partitioning, we can permit controlled branching to exploit transitive relationships and avoid most point-to-point comparisons. The result is a non-traditional GPU algorithm, SkyAlign, that prioritizes work-effciency and respectable throughput, rather than...

  12. Multi-GPU parallel algorithm design and analysis for improved inversion of probability tomography with gravity gradiometry data

    Science.gov (United States)

    Hou, Zhenlong; Huang, Danian

    2017-09-01

    In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.

  13. Quantum Chemical Calculations Using Accelerators: Migrating Matrix Operations to the NVIDIA Kepler GPU and the Intel Xeon Phi.

    Science.gov (United States)

    Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S

    2014-03-11

    Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.

  14. GPU-accelerated FDTD modeling of radio-frequency field-tissue interactions in high-field MRI.

    Science.gov (United States)

    Chi, Jieru; Liu, Feng; Weber, Ewald; Li, Yu; Crozier, Stuart

    2011-06-01

    The analysis of high-field RF field-tissue interactions requires high-performance finite-difference time-domain (FDTD) computing. Conventional CPU-based FDTD calculations offer limited computing performance in a PC environment. This study presents a graphics processing unit (GPU)-based parallel-computing framework, producing substantially boosted computing efficiency (with a two-order speedup factor) at a PC-level cost. Specific details of implementing the FDTD method on a GPU architecture have been presented and the new computational strategy has been successfully applied to the design of a novel 8-element transceive RF coil system at 9.4 T. Facilitated by the powerful GPU-FDTD computing, the new RF coil array offers optimized fields (averaging 25% improvement in sensitivity, and 20% reduction in loop coupling compared with conventional array structures of the same size) for small animal imaging with a robust RF configuration. The GPU-enabled acceleration paves the way for FDTD to be applied for both detailed forward modeling and inverse design of MRI coils, which were previously impractical.

  15. Evaluation of speedup of Monte Carlo calculations of two simple reactor physics problems coded for the GPU/CUDA environment

    International Nuclear Information System (INIS)

    Ding, Aiping; Liu, Tianyu; Liang, Chao; Ji, Wei; Shephard, Mark S.; Xu, X George; Brown, Forrest B.

    2011-01-01

    Monte Carlo simulation is ideally suited for solving Boltzmann neutron transport equation in inhomogeneous media. However, routine applications require the computation time to be reduced to hours and even minutes in a desktop system. The interest in adopting GPUs for Monte Carlo acceleration is rapidly mounting, fueled partially by the parallelism afforded by the latest GPU technologies and the challenge to perform full-size reactor core analysis on a routine basis. In this study, Monte Carlo codes for a fixed-source neutron transport problem and an eigenvalue/criticality problem were developed for CPU and GPU environments, respectively, to evaluate issues associated with computational speedup afforded by the use of GPUs. The results suggest that a speedup factor of 30 in Monte Carlo radiation transport of neutrons is within reach using the state-of-the-art GPU technologies. However, for the eigenvalue/criticality problem, the speedup was 8.5. In comparison, for a task of voxelizing unstructured mesh geometry that is more parallel in nature, the speedup of 45 was obtained. It was observed that, to date, most attempts to adopt GPUs for Monte Carlo acceleration were based on naïve implementations and have not yielded the level of anticipated gains. Successful implementation of Monte Carlo schemes for GPUs will likely require the development of an entirely new code. Given the prediction that future-generation GPU products will likely bring exponentially improved computing power and performances, innovative hardware and software solutions may make it possible to achieve full-core Monte Carlo calculation within one hour using a desktop computer system in a few years. (author)

  16. GPU-based large-scale visualization

    KAUST Repository

    Hadwiger, Markus; Krueger, Jens; Beyer, Johanna; Bruckner, Stefan

    2013-01-01

    and how to render and process gigapixel images using scalable, display-aware techniques. We will describe custom virtual texturing architectures as well as recent hardware developments in this area. We will also describe client/server systems

  17. NDAS Hardware Translation Layer Development

    Science.gov (United States)

    Nazaretian, Ryan N.; Holladay, Wendy T.

    2011-01-01

    The NASA Data Acquisition System (NDAS) project is aimed to replace all DAS software for NASA s Rocket Testing Facilities. There must be a software-hardware translation layer so the software can properly talk to the hardware. Since the hardware from each test stand varies, drivers for each stand have to be made. These drivers will act more like plugins for the software. If the software is being used in E3, then the software should point to the E3 driver package. If the software is being used at B2, then the software should point to the B2 driver package. The driver packages should also be filled with hardware drivers that are universal to the DAS system. For example, since A1, A2, and B2 all use the Preston 8300AU signal conditioners, then the driver for those three stands should be the same and updated collectively.

  18. Hardware for dynamic quantum computing.

    Science.gov (United States)

    Ryan, Colm A; Johnson, Blake R; Ristè, Diego; Donovan, Brian; Ohki, Thomas A

    2017-10-01

    We describe the hardware, gateware, and software developed at Raytheon BBN Technologies for dynamic quantum information processing experiments on superconducting qubits. In dynamic experiments, real-time qubit state information is fed back or fed forward within a fraction of the qubits' coherence time to dynamically change the implemented sequence. The hardware presented here covers both control and readout of superconducting qubits. For readout, we created a custom signal processing gateware and software stack on commercial hardware to convert pulses in a heterodyne receiver into qubit state assignments with minimal latency, alongside data taking capability. For control, we developed custom hardware with gateware and software for pulse sequencing and steering information distribution that is capable of arbitrary control flow in a fraction of superconducting qubit coherence times. Both readout and control platforms make extensive use of field programmable gate arrays to enable tailored qubit control systems in a reconfigurable fabric suitable for iterative development.

  19. The performances of R GPU implementations of the GMRES method

    Directory of Open Access Journals (Sweden)

    Bogdan Oancea

    2018-03-01

    Full Text Available Although the performance of commodity computers has improved drastically with the introduction of multicore processors and GPU computing, the standard R distribution is still based on single-threaded model of computation, using only a small fraction of the computational power available now for most desktops and laptops. Modern statistical software packages rely on high performance implementations of the linear algebra routines there are at the core of several important leading edge statistical methods. In this paper we present a GPU implementation of the GMRES iterative method for solving linear systems. We compare the performance of this implementation with a pure single threaded version of the CPU. We also investigate the performance of our implementation using different GPU packages available now for R such as gmatrix, gputools or gpuR which are based on CUDA or OpenCL frameworks.

  20. Synergia CUDA: GPU-accelerated accelerator modeling package

    International Nuclear Information System (INIS)

    Lu, Q; Amundson, J

    2014-01-01

    Synergia is a parallel, 3-dimensional space-charge particle-in-cell accelerator modeling code. We present our work porting the purely MPI-based version of the code to a hybrid of CPU and GPU computing kernels. The hybrid code uses the CUDA platform in the same framework as the pure MPI solution. We have implemented a lock-free collaborative charge-deposition algorithm for the GPU, as well as other optimizations, including local communication avoidance for GPUs, a customized FFT, and fine-tuned memory access patterns. On a small GPU cluster (up to 4 Tesla C1070 GPUs), our benchmarks exhibit both superior peak performance and better scaling than a CPU cluster with 16 nodes and 128 cores. We also compare the code performance on different GPU architectures, including C1070 Tesla and K20 Kepler.

  1. GPU credit reduced, tie to TMI-1 cheating discounted

    International Nuclear Information System (INIS)

    Utroska, D.

    1981-01-01

    The recent reduction of credit available to General Public Utilities (GPU) Nuclear may be linked to a cheating incident involving two reactor operators at the Three Mile Island-1 (TMI-1) reactor. The incident caused the Nuclear Regulatory Commission to reopen the managerial portion of the restart hearings and may delay the restart. The delay and the lower credit line will worsen GPU's financial position. Banks claim that misgivings about TMI-1 influence them more than the cheating, although GPU had been gradually improving its financial situation since the TMI-2 accident. The new agreement gives GPU $150 million in immediate credit, but lowers the interim ceiling from $292 million to $200 million. A spokesman from the Office of Management and Budget acknowledges that administration plans to limit the federal role to research and development softened under political pressure

  2. GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model

    International Nuclear Information System (INIS)

    Takaishi, Tetsuya

    2015-01-01

    The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran

  3. FY1995 evolvable hardware chip; 1995 nendo shinkasuru hardware chip

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1997-03-01

    This project aims at the development of 'Evolvable Hardware' (EHW) which can adapt its hardware structure to the environment to attain better hardware performance, under the control of genetic algorithms. EHW is a key technology to explore the new application area requiring real-time performance and on-line adaptation. 1. Development of EHW-LSI for function level hardware evolution, which includes 15 DSPs in one chip. 2. Application of the EHW to the practical industrial applications such as data compression, ATM control, digital mobile communication. 3. Two patents : (1) the architecture and the processing method for programmable EHW-LSI. (2) The method of data compression for loss-less data, using EHW. 4. The first international conference for evolvable hardware was held by authors: Intl. Conf. on Evolvable Systems (ICES96). It was determined at ICES96 that ICES will be held every two years between Japan and Europe. So the new society has been established by us. (NEDO)

  4. a method of gravity and seismic sequential inversion and its GPU implementation

    Science.gov (United States)

    Liu, G.; Meng, X.

    2011-12-01

    In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing

  5. A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)

    Science.gov (United States)

    Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun

    2015-09-01

    Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by

  6. A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).

    Science.gov (United States)

    Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun

    2015-10-07

    Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by

  7. A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)

    International Nuclear Information System (INIS)

    Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun

    2015-01-01

    Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon–electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783–97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48–0.53% for the electron beam cases and 0.15–0.17% for the photon beam cases. In terms of efficiency, goMC was ∼4–16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was

  8. GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs.

    Directory of Open Access Journals (Sweden)

    Ahmed Shamsul Arefin

    Full Text Available BACKGROUND: The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers. An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU, can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. RESULTS: We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50-60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. CONCLUSION: Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL at https://sourceforge.net/p/gpufsknn/.

  9. Fast magnetic field computation in fusion technology using GPU technology

    Energy Technology Data Exchange (ETDEWEB)

    Chiariello, Andrea Gaetano [Ass. EURATOM/ENEA/CREATE, Dipartimento di Ingegneria Industriale e dell’Informazione, Seconda Università di Napoli, Via Roma 29, Aversa (CE) (Italy); Formisano, Alessandro, E-mail: Alessandro.Formisano@unina2.it [Ass. EURATOM/ENEA/CREATE, Dipartimento di Ingegneria Industriale e dell’Informazione, Seconda Università di Napoli, Via Roma 29, Aversa (CE) (Italy); Martone, Raffaele [Ass. EURATOM/ENEA/CREATE, Dipartimento di Ingegneria Industriale e dell’Informazione, Seconda Università di Napoli, Via Roma 29, Aversa (CE) (Italy)

    2013-10-15

    Highlights: ► The paper deals with high accuracy numerical simulations of high field magnets. ► The porting of existing codes of High Performance Computing architectures allowed to obtain a relevant speedup while not reducing computational accuracy. ► Some examples of applications, referred to ITER-like magnets, are reported. -- Abstract: One of the main issues in the simulation of Tokamaks functioning is the reliable and accurate computation of actual field maps in the plasma chamber. In this paper a tool able to accurately compute magnetic field maps produced by active coils of any 3D shape, wound with high number of conductors, is presented. Under linearity assumption, the coil winding is modeled by means of “sticks”, following each conductor's shape, and the contribution of each stick is computed using high speed Graphic Computing Units (GPU's). Relevant speed enhancements with respect to standard parallel computing environment are achieved in this way.

  10. GALARIO: a GPU accelerated library for analysing radio interferometer observations

    Science.gov (United States)

    Tazzari, Marco; Beaujean, Frederik; Testi, Leonardo

    2018-06-01

    We present GALARIO, a computational library that exploits the power of modern graphical processing units (GPUs) to accelerate the analysis of observations from radio interferometers like Atacama Large Millimeter and sub-millimeter Array or the Karl G. Jansky Very Large Array. GALARIO speeds up the computation of synthetic visibilities from a generic 2D model image or a radial brightness profile (for axisymmetric sources). On a GPU, GALARIO is 150 faster than standard PYTHON and 10 times faster than serial C++ code on a CPU. Highly modular, easy to use, and to adopt in existing code, GALARIO comes as two compiled libraries, one for Nvidia GPUs and one for multicore CPUs, where both have the same functions with identical interfaces. GALARIO comes with PYTHON bindings but can also be directly used in C or C++. The versatility and the speed of GALARIO open new analysis pathways that otherwise would be prohibitively time consuming, e.g. fitting high-resolution observations of large number of objects, or entire spectral cubes of molecular gas emission. It is a general tool that can be applied to any field that uses radio interferometer observations. The source code is available online at http://github.com/mtazzari/galario under the open source GNU Lesser General Public License v3.

  11. Accelerated finite element elastodynamic simulations using the GPU

    Energy Technology Data Exchange (ETDEWEB)

    Huthwaite, Peter, E-mail: p.huthwaite@imperial.ac.uk

    2014-01-15

    An approach is developed to perform explicit time domain finite element simulations of elastodynamic problems on the graphical processing unit, using Nvidia's CUDA. Of critical importance for this problem is the arrangement of nodes in memory, allowing data to be loaded efficiently and minimising communication between the independently executed blocks of threads. The initial stage of memory arrangement is partitioning the mesh; both a well established ‘greedy’ partitioner and a new, more efficient ‘aligned’ partitioner are investigated. A method is then developed to efficiently arrange the memory within each partition. The software is applied to three models from the fields of non-destructive testing, vibrations and geophysics, demonstrating a memory bandwidth of very close to the card's maximum, reflecting the bandwidth-limited nature of the algorithm. Comparison with Abaqus, a widely used commercial CPU equivalent, validated the accuracy of the results and demonstrated a speed improvement of around two orders of magnitude. A software package, Pogo, incorporating these developments, is released open source, downloadable from (http://www.pogo-fea.com/) to benefit the community. -- Highlights: •A novel memory arrangement approach is discussed for finite elements on the GPU. •The mesh is partitioned then nodes are arranged efficiently within each partition. •Models from ultrasonics, vibrations and geophysics are run. •The code is significantly faster than an equivalent commercial CPU package. •Pogo, the new software package, is released open source.

  12. Accelerated finite element elastodynamic simulations using the GPU

    International Nuclear Information System (INIS)

    Huthwaite, Peter

    2014-01-01

    An approach is developed to perform explicit time domain finite element simulations of elastodynamic problems on the graphical processing unit, using Nvidia's CUDA. Of critical importance for this problem is the arrangement of nodes in memory, allowing data to be loaded efficiently and minimising communication between the independently executed blocks of threads. The initial stage of memory arrangement is partitioning the mesh; both a well established ‘greedy’ partitioner and a new, more efficient ‘aligned’ partitioner are investigated. A method is then developed to efficiently arrange the memory within each partition. The software is applied to three models from the fields of non-destructive testing, vibrations and geophysics, demonstrating a memory bandwidth of very close to the card's maximum, reflecting the bandwidth-limited nature of the algorithm. Comparison with Abaqus, a widely used commercial CPU equivalent, validated the accuracy of the results and demonstrated a speed improvement of around two orders of magnitude. A software package, Pogo, incorporating these developments, is released open source, downloadable from (http://www.pogo-fea.com/) to benefit the community. -- Highlights: •A novel memory arrangement approach is discussed for finite elements on the GPU. •The mesh is partitioned then nodes are arranged efficiently within each partition. •Models from ultrasonics, vibrations and geophysics are run. •The code is significantly faster than an equivalent commercial CPU package. •Pogo, the new software package, is released open source

  13. Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture

    Science.gov (United States)

    Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek

    2015-01-01

    This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.

  14. Processamento da rede neocognitron para reconhecimento facial em ambiente de alto desempenho GPU

    OpenAIRE

    Gustavo Poli Lameirão da Silva

    2007-01-01

    Neste trabalho é apresentada a implementação da Rede Neural Neocognitron, usando uma arquitetura de computação de alto desempenho baseada em GPU (Graphics Processing Unit). O Neocognitron é uma rede neural artificial, proposta por Fukushima e colaboradores, constituída de vários estágios de camadas de neurônios, organizados em matrizes bidimensionais denominadas planos celulares. Para o processamento de alto desempenho da aplicação de reconhecimento facial usando neocognitron foi utilizado o ...

  15. Accelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System.

    Science.gov (United States)

    Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun

    2015-01-01

    The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.

  16. Development of Hardware Dual Modality Tomography System

    Directory of Open Access Journals (Sweden)

    R. M. Zain

    2009-06-01

    Full Text Available The paper describes the hardware development and performance of the Dual Modality Tomography (DMT system. DMT consists of optical and capacitance sensors. The optical sensors consist of 16 LEDs and 16 photodiodes. The Electrical Capacitance Tomography (ECT electrode design use eight electrode plates as the detecting sensor. The digital timing and the control unit have been developing in order to control the light projection of optical emitters, switching the capacitance electrodes and to synchronize the operation of data acquisition. As a result, the developed system is able to provide a maximum 529 set data per second received from the signal conditioning circuit to the computer.

  17. cellGPU: Massively parallel simulations of dynamic vertex models

    Science.gov (United States)

    Sussman, Daniel M.

    2017-10-01

    Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation

  18. Raspberry Pi hardware projects 1

    CERN Document Server

    Robinson, Andrew

    2013-01-01

    Learn how to take full advantage of all of Raspberry Pi's amazing features and functions-and have a blast doing it! Congratulations on becoming a proud owner of a Raspberry Pi, the credit-card-sized computer! If you're ready to dive in and start finding out what this amazing little gizmo is really capable of, this ebook is for you. Taken from the forthcoming Raspberry Pi Projects, Raspberry Pi Hardware Projects 1 contains three cool hardware projects that let you have fun with the Raspberry Pi while developing your Raspberry Pi skills. The authors - PiFace inventor, Andrew Robinson and Rasp

  19. A survey and measurement study of GPU DVFS on energy conservation

    Directory of Open Access Journals (Sweden)

    Xinxin Mei

    2017-05-01

    Full Text Available Energy efficiency has become one of the top design criteria for current computing systems. The dynamic voltage and frequency scaling (DVFS has been widely adopted by laptop computers, servers, and mobile devices to conserve energy, while the GPU DVFS is still at a certain early age. This paper aims at exploring the impact of GPU DVFS on the application performance and power consumption, and furthermore, on energy conservation. We survey the state-of-the-art GPU DVFS characterizations, and then summarize recent research works on GPU power and performance models. We also conduct real GPU DVFS experiments on NVIDIA Fermi and Maxwell GPUs. According to our experimental results, GPU DVFS has significant potential for energy saving. The effect of scaling core voltage/frequency and memory voltage/frequency depends on not only the GPU architectures, but also the characteristic of GPU applications.

  20. Accelerating Dense Linear Algebra on the GPU

    DEFF Research Database (Denmark)

    Sørensen, Hans Henrik Brandenborg

    and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. The target hardware is the most recent NVIDIA Tesla 20-series (Fermi...

  1. Parallel fuzzy connected image segmentation on GPU

    OpenAIRE

    Zhuge, Ying; Cao, Yong; Udupa, Jayaram K.; Miller, Robert W.

    2011-01-01

    Purpose: Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm impleme...

  2. Comparison of two accelerators for Monte Carlo radiation transport calculations, Nvidia Tesla M2090 GPU and Intel Xeon Phi 5110p coprocessor: A case study for X-ray CT imaging dose calculation

    International Nuclear Information System (INIS)

    Liu, T.; Xu, X.G.; Carothers, C.D.

    2015-01-01

    Highlights: • A new Monte Carlo photon transport code ARCHER-CT for CT dose calculations is developed to execute on the GPU and coprocessor. • ARCHER-CT is verified against MCNP. • The GPU code on an Nvidia M2090 GPU is 5.15–5.81 times faster than the parallel CPU code on an Intel X5650 6-core CPU. • The coprocessor code on an Intel Xeon Phi 5110p coprocessor is 3.30–3.38 times faster than the CPU code. - Abstract: Hardware accelerators are currently becoming increasingly important in boosting high performance computing systems. In this study, we tested the performance of two accelerator models, Nvidia Tesla M2090 GPU and Intel Xeon Phi 5110p coprocessor, using a new Monte Carlo photon transport package called ARCHER-CT we have developed for fast CT imaging dose calculation. The package contains three components, ARCHER-CT CPU , ARCHER-CT GPU and ARCHER-CT COP designed to be run on the multi-core CPU, GPU and coprocessor architectures respectively. A detailed GE LightSpeed Multi-Detector Computed Tomography (MDCT) scanner model and a family of voxel patient phantoms are included in the code to calculate absorbed dose to radiosensitive organs under user-specified scan protocols. The results from ARCHER agree well with those from the production code Monte Carlo N-Particle eXtended (MCNPX). It is found that all the code components are significantly faster than the parallel MCNPX run on 12 MPI processes, and that the GPU and coprocessor codes are 5.15–5.81 and 3.30–3.38 times faster than the parallel ARCHER-CT CPU , respectively. The M2090 GPU performs better than the 5110p coprocessor in our specific test. Besides, the heterogeneous computation mode in which the CPU and the hardware accelerator work concurrently can increase the overall performance by 13–18%

  3. Semi-automatic tool to ease the creation and optimization of GPU programs

    DEFF Research Database (Denmark)

    Jepsen, Jacob

    2014-01-01

    We present a tool that reduces the development time of GPU-executable code. We implement a catalogue of common optimizations specific to the GPU architecture. Through the tool, the programmer can semi-automatically transform a computationally-intensive code section into GPU-executable form...... of the transformations can be performed automatically, which makes the tool usable for both novices and experts in GPU programming....

  4. Performance analysis and acceleration of explicit integration for large kinetic networks using batched GPU computations

    Energy Technology Data Exchange (ETDEWEB)

    Shyles, Daniel [University of Tennessee (UT); Dongarra, Jack J. [University of Tennessee, Knoxville (UTK); Guidry, Mike W. [ORNL; Tomov, Stanimire Z. [ORNL; Billings, Jay Jay [ORNL; Brock, Benjamin A. [ORNL; Haidar Ahmad, Azzam A. [ORNL

    2016-09-01

    Abstract—We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms that solve efficiently N coupled ordinary differential equations (subject to initial conditions) on modern GPUs. We take representative test cases (Type Ia supernova explosions) and demonstrate two or more orders of magnitude increase in efficiency for solving such systems (of realistic thermonuclear networks coupled to fluid dynamics). This implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications we present the computational techniques developed for our ongoing deployment of these new methods on modern GPU accelerators. We show that similarly to many other scientific applications, ranging from national security to medical advances, the computation can be split into many independent computational tasks, each of relatively small-size. As the size of each individual task does not provide sufficient parallelism for the underlying hardware, especially for accelerators, these tasks must be computed concurrently as a single routine, that we call batched routine, in order to saturate the hardware with enough work.

  5. Hardware standardization for embedded systems

    International Nuclear Information System (INIS)

    Sharma, M.K.; Kalra, Mohit; Patil, M.B.; Mohanty, Ashutos; Ganesh, G.; Biswas, B.B.

    2010-01-01

    Reactor Control Division (RCnD) has been one of the main designers of safety and safety related systems for power reactors. These systems have been built using in-house developed hardware. Since the present set of hardware was designed long ago, a need was felt to design a new family of hardware boards. A Working Group on Electronics Hardware Standardization (WG-EHS) was formed with an objective to develop a family of boards, which is general purpose enough to meet the requirements of the system designers/end users. RCnD undertook the responsibility of design, fabrication and testing of boards for embedded systems. VME and a proprietary I/O bus were selected as the two system buses. The boards have been designed based on present day technology and components. The intelligence of these boards has been implemented on FPGA/CPLD using VHDL. This paper outlines the various boards that have been developed with a brief description. (author)

  6. Commodity hardware and software summary

    International Nuclear Information System (INIS)

    Wolbers, S.

    1997-04-01

    A review is given of the talks and papers presented in the Commodity Hardware and Software Session at the CHEP97 conference. An examination of the trends leading to the consideration of PC's for HEP is given, and a status of the work that is being done at various HEP labs and Universities is given

  7. Accelerating large-scale phase-field simulations with GPU

    Directory of Open Access Journals (Sweden)

    Xiaoming Shi

    2017-10-01

    Full Text Available A new package for accelerating large-scale phase-field simulations was developed by using GPU based on the semi-implicit Fourier method. The package can solve a variety of equilibrium equations with different inhomogeneity including long-range elastic, magnetostatic, and electrostatic interactions. Through using specific algorithm in Compute Unified Device Architecture (CUDA, Fourier spectral iterative perturbation method was integrated in GPU package. The Allen-Cahn equation, Cahn-Hilliard equation, and phase-field model with long-range interaction were solved based on the algorithm running on GPU respectively to test the performance of the package. From the comparison of the calculation results between the solver executed in single CPU and the one on GPU, it was found that the speed on GPU is enormously elevated to 50 times faster. The present study therefore contributes to the acceleration of large-scale phase-field simulations and provides guidance for experiments to design large-scale functional devices.

  8. Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU

    Directory of Open Access Journals (Sweden)

    Yong Xia

    2015-01-01

    Full Text Available Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation and the other is the diffusion term of the monodomain model (partial differential equation. Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.

  9. A GPU-based large-scale Monte Carlo simulation method for systems with long-range interactions

    Science.gov (United States)

    Liang, Yihao; Xing, Xiangjun; Li, Yaohang

    2017-06-01

    In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures, and adopts the sequential updating scheme of Metropolis algorithm. It makes no approximation in the computation of energy, and reaches a remarkable 440-fold speedup, compared with the serial implementation on CPU. We further use this method to simulate primitive model electrolytes, and measure very precisely all ion-ion pair correlation functions at high concentrations. From these data, we extract the renormalized Debye length, renormalized valences of constituent ions, and renormalized dielectric constants. These results demonstrate unequivocally physics beyond the classical Poisson-Boltzmann theory.

  10. GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration

    International Nuclear Information System (INIS)

    Sharp, G C; Kandasamy, N; Singh, H; Folkert, M

    2007-01-01

    This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup-up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration

  11. GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES

    Energy Technology Data Exchange (ETDEWEB)

    Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal [Department of Astronomy, University of Arizona, 933 N. Cherry Ave., Tucson, AZ 85721 (United States)

    2013-11-01

    We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.

  12. A GPU-based incompressible Navier-Stokes solver on moving overset grids

    Science.gov (United States)

    Chandar, Dominic D. J.; Sitaraman, Jayanarayanan; Mavriplis, Dimitri J.

    2013-07-01

    In pursuit of obtaining high fidelity solutions to the fluid flow equations in a short span of time, graphics processing units (GPUs) which were originally intended for gaming applications are currently being used to accelerate computational fluid dynamics (CFD) codes. With a high peak throughput of about 1 TFLOPS on a PC, GPUs seem to be favourable for many high-resolution computations. One such computation that involves a lot of number crunching is computing time accurate flow solutions past moving bodies. The aim of the present paper is thus to discuss the development of a flow solver on unstructured and overset grids and its implementation on GPUs. In its present form, the flow solver solves the incompressible fluid flow equations on unstructured/hybrid/overset grids using a fully implicit projection method. The resulting discretised equations are solved using a matrix-free Krylov solver using several GPU kernels such as gradient, Laplacian and reduction. Some of the simple arithmetic vector calculations are implemented using the CU++: An Object Oriented Framework for Computational Fluid Dynamics Applications using Graphics Processing Units, Journal of Supercomputing, 2013, doi:10.1007/s11227-013-0985-9 approach where GPU kernels are automatically generated at compile time. Results are presented for two- and three-dimensional computations on static and moving grids.

  13. A Novel CPU/GPU Simulation Environment for Large-Scale Biologically-Realistic Neural Modeling

    Directory of Open Access Journals (Sweden)

    Roger V Hoang

    2013-10-01

    Full Text Available Computational Neuroscience is an emerging field that provides unique opportunities to studycomplex brain structures through realistic neural simulations. However, as biological details are added tomodels, the execution time for the simulation becomes longer. Graphics Processing Units (GPUs are now being utilized to accelerate simulations due to their ability to perform computations in parallel. As such, they haveshown significant improvement in execution time compared to Central Processing Units (CPUs. Most neural simulators utilize either multiple CPUs or a single GPU for better performance, but still show limitations in execution time when biological details are not sacrificed. Therefore, we present a novel CPU/GPU simulation environment for large-scale biological networks,the NeoCortical Simulator version 6 (NCS6. NCS6 is a free, open-source, parallelizable, and scalable simula-tor, designed to run on clusters of multiple machines, potentially with high performance computing devicesin each of them. It has built-in leaky-integrate-and-fire (LIF and Izhikevich (IZH neuron models, but usersalso have the capability to design their own plug-in interface for different neuron types as desired. NCS6is currently able to simulate one million cells and 100 million synapses in quasi real time by distributing dataacross these heterogeneous clusters of CPUs and GPUs.

  14. MGUPGMA: A Fast UPGMA Algorithm With Multiple Graphics Processing Units Using NCCL.

    Science.gov (United States)

    Hua, Guan-Jie; Hung, Che-Lun; Lin, Chun-Yuan; Wu, Fu-Che; Chan, Yu-Wei; Tang, Chuan Yi

    2017-01-01

    A phylogenetic tree is a visual diagram of the relationship between a set of biological species. The scientists usually use it to analyze many characteristics of the species. The distance-matrix methods, such as Unweighted Pair Group Method with Arithmetic Mean and Neighbor Joining, construct a phylogenetic tree by calculating pairwise genetic distances between taxa. These methods have the computational performance issue. Although several new methods with high-performance hardware and frameworks have been proposed, the issue still exists. In this work, a novel parallel Unweighted Pair Group Method with Arithmetic Mean approach on multiple Graphics Processing Units is proposed to construct a phylogenetic tree from extremely large set of sequences. The experimental results present that the proposed approach on a DGX-1 server with 8 NVIDIA P100 graphic cards achieves approximately 3-fold to 7-fold speedup over the implementation of Unweighted Pair Group Method with Arithmetic Mean on a modern CPU and a single GPU, respectively.

  15. GGEMS-Brachy: GPU GEant4-based Monte Carlo simulation for brachytherapy applications

    International Nuclear Information System (INIS)

    Lemaréchal, Yannick; Bert, Julien; Schick, Ulrike; Pradier, Olivier; Garcia, Marie-Paule; Boussion, Nicolas; Visvikis, Dimitris; Falconnet, Claire; Després, Philippe; Valeri, Antoine

    2015-01-01

    In brachytherapy, plans are routinely calculated using the AAPM TG43 formalism which considers the patient as a simple water object. An accurate modeling of the physical processes considering patient heterogeneity using Monte Carlo simulation (MCS) methods is currently too time-consuming and computationally demanding to be routinely used. In this work we implemented and evaluated an accurate and fast MCS on Graphics Processing Units (GPU) for brachytherapy low dose rate (LDR) applications. A previously proposed Geant4 based MCS framework implemented on GPU (GGEMS) was extended to include a hybrid GPU navigator, allowing navigation within voxelized patient specific images and analytically modeled 125 I seeds used in LDR brachytherapy. In addition, dose scoring based on track length estimator including uncertainty calculations was incorporated. The implemented GGEMS-brachy platform was validated using a comparison with Geant4 simulations and reference datasets. Finally, a comparative dosimetry study based on the current clinical standard (TG43) and the proposed platform was performed on twelve prostate cancer patients undergoing LDR brachytherapy. Considering patient 3D CT volumes of 400  × 250  × 65 voxels and an average of 58 implanted seeds, the mean patient dosimetry study run time for a 2% dose uncertainty was 9.35 s (≈500 ms 10 −6 simulated particles) and 2.5 s when using one and four GPUs, respectively. The performance of the proposed GGEMS-brachy platform allows envisaging the use of Monte Carlo simulation based dosimetry studies in brachytherapy compatible with clinical practice. Although the proposed platform was evaluated for prostate cancer, it is equally applicable to other LDR brachytherapy clinical applications. Future extensions will allow its application in high dose rate brachytherapy applications. (paper)

  16. A distributed multi-GPU system for high speed electron microscopic tomographic reconstruction

    International Nuclear Information System (INIS)

    Zheng, Shawn Q.; Branlund, Eric; Kesthelyi, Bettina; Braunfeld, Michael B.; Cheng, Yifan; Sedat, John W.; Agard, David A.

    2011-01-01

    Full resolution electron microscopic tomographic (EMT) reconstruction of large-scale tilt series requires significant computing power. The desire to perform multiple cycles of iterative reconstruction and realignment dramatically increases the pressing need to improve reconstruction performance. This has motivated us to develop a distributed multi-GPU (graphics processing unit) system to provide the required computing power for rapid constrained, iterative reconstructions of very large three-dimensional (3D) volumes. The participating GPUs reconstruct segments of the volume in parallel, and subsequently, the segments are assembled to form the complete 3D volume. Owing to its power and versatility, the CUDA (NVIDIA, USA) platform was selected for GPU implementation of the EMT reconstruction. For a system containing 10 GPUs provided by 5 GTX295 cards, 10 cycles of SIRT reconstruction for a tomogram of 4096 2 x512 voxels from an input tilt series containing 122 projection images of 4096 2 pixels (single precision float) takes a total of 1845 s of which 1032 s are for computation with the remainder being the system overhead. The same system takes only 39 s total to reconstruct 1024 2 x256 voxels from 122 1024 2 pixel projections. While the system overhead is non-trivial, performance analysis indicates that adding extra GPUs to the system would lead to steadily enhanced overall performance. Therefore, this system can be easily expanded to generate superior computing power for very large tomographic reconstructions and especially to empower iterative cycles of reconstruction and realignment. -- Highlights: → A distributed multi-GPU system has been developed for electron microscopic tomography (EMT). → This system allows for rapid constrained, iterative reconstruction of very large volumes. → This system can be easily expanded to generate superior computing power for large-scale iterative EMT realignment.

  17. Fully 3D iterative scatter-corrected OSEM for HRRT PET using a GPU

    Energy Technology Data Exchange (ETDEWEB)

    Kim, Kyung Sang; Ye, Jong Chul, E-mail: kssigari@kaist.ac.kr, E-mail: jong.ye@kaist.ac.kr [Bio-Imaging and Signal Processing Lab., Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), 335 Gwahak-no, Yuseong-gu, Daejon 305-701 (Korea, Republic of)

    2011-08-07

    Accurate scatter correction is especially important for high-resolution 3D positron emission tomographies (PETs) such as high-resolution research tomograph (HRRT) due to large scatter fraction in the data. To address this problem, a fully 3D iterative scatter-corrected ordered subset expectation maximization (OSEM) in which a 3D single scatter simulation (SSS) is alternatively performed with a 3D OSEM reconstruction was recently proposed. However, due to the computational complexity of both SSS and OSEM algorithms for a high-resolution 3D PET, it has not been widely used in practice. The main objective of this paper is, therefore, to accelerate the fully 3D iterative scatter-corrected OSEM using a graphics processing unit (GPU) and verify its performance for an HRRT. We show that to exploit the massive thread structures of the GPU, several algorithmic modifications are necessary. For SSS implementation, a sinogram-driven approach is found to be more appropriate compared to a detector-driven approach, as fast linear interpolation can be performed in the sinogram domain through the use of texture memory. Furthermore, a pixel-driven backprojector and a ray-driven projector can be significantly accelerated by assigning threads to voxels and sinograms, respectively. Using Nvidia's GPU and compute unified device architecture (CUDA), the execution time of a SSS is less than 6 s, a single iteration of OSEM with 16 subsets takes 16 s, and a single iteration of the fully 3D scatter-corrected OSEM composed of a SSS and six iterations of OSEM takes under 105 s for the HRRT geometry, which corresponds to acceleration factors of 125x and 141x for OSEM and SSS, respectively. The fully 3D iterative scatter-corrected OSEM algorithm is validated in simulations using Geant4 application for tomographic emission and in actual experiments using an HRRT.

  18. A distributed multi-GPU system for high speed electron microscopic tomographic reconstruction

    Energy Technology Data Exchange (ETDEWEB)

    Zheng, Shawn Q.; Branlund, Eric; Kesthelyi, Bettina; Braunfeld, Michael B.; Cheng, Yifan; Sedat, John W. [The Howard Hughes Medical Institute and the W.M. Keck Advanced Microscopy Laboratory, Department of Biochemistry and Biophysics, University of California, San Francisco, 600, 16th Street, Room S412D, CA 94158-2517 (United States); Agard, David A., E-mail: agard@msg.ucsf.edu [The Howard Hughes Medical Institute and the W.M. Keck Advanced Microscopy Laboratory, Department of Biochemistry and Biophysics, University of California, San Francisco, 600, 16th Street, Room S412D, CA 94158-2517 (United States)

    2011-07-15

    Full resolution electron microscopic tomographic (EMT) reconstruction of large-scale tilt series requires significant computing power. The desire to perform multiple cycles of iterative reconstruction and realignment dramatically increases the pressing need to improve reconstruction performance. This has motivated us to develop a distributed multi-GPU (graphics processing unit) system to provide the required computing power for rapid constrained, iterative reconstructions of very large three-dimensional (3D) volumes. The participating GPUs reconstruct segments of the volume in parallel, and subsequently, the segments are assembled to form the complete 3D volume. Owing to its power and versatility, the CUDA (NVIDIA, USA) platform was selected for GPU implementation of the EMT reconstruction. For a system containing 10 GPUs provided by 5 GTX295 cards, 10 cycles of SIRT reconstruction for a tomogram of 4096{sup 2}x512 voxels from an input tilt series containing 122 projection images of 4096{sup 2} pixels (single precision float) takes a total of 1845 s of which 1032 s are for computation with the remainder being the system overhead. The same system takes only 39 s total to reconstruct 1024{sup 2}x256 voxels from 122 1024{sup 2} pixel projections. While the system overhead is non-trivial, performance analysis indicates that adding extra GPUs to the system would lead to steadily enhanced overall performance. Therefore, this system can be easily expanded to generate superior computing power for very large tomographic reconstructions and especially to empower iterative cycles of reconstruction and realignment. -- Highlights: {yields} A distributed multi-GPU system has been developed for electron microscopic tomography (EMT). {yields} This system allows for rapid constrained, iterative reconstruction of very large volumes. {yields} This system can be easily expanded to generate superior computing power for large-scale iterative EMT realignment.

  19. GGEMS-Brachy: GPU GEant4-based Monte Carlo simulation for brachytherapy applications

    Science.gov (United States)

    Lemaréchal, Yannick; Bert, Julien; Falconnet, Claire; Després, Philippe; Valeri, Antoine; Schick, Ulrike; Pradier, Olivier; Garcia, Marie-Paule; Boussion, Nicolas; Visvikis, Dimitris

    2015-07-01

    In brachytherapy, plans are routinely calculated using the AAPM TG43 formalism which considers the patient as a simple water object. An accurate modeling of the physical processes considering patient heterogeneity using Monte Carlo simulation (MCS) methods is currently too time-consuming and computationally demanding to be routinely used. In this work we implemented and evaluated an accurate and fast MCS on Graphics Processing Units (GPU) for brachytherapy low dose rate (LDR) applications. A previously proposed Geant4 based MCS framework implemented on GPU (GGEMS) was extended to include a hybrid GPU navigator, allowing navigation within voxelized patient specific images and analytically modeled 125I seeds used in LDR brachytherapy. In addition, dose scoring based on track length estimator including uncertainty calculations was incorporated. The implemented GGEMS-brachy platform was validated using a comparison with Geant4 simulations and reference datasets. Finally, a comparative dosimetry study based on the current clinical standard (TG43) and the proposed platform was performed on twelve prostate cancer patients undergoing LDR brachytherapy. Considering patient 3D CT volumes of 400  × 250  × 65 voxels and an average of 58 implanted seeds, the mean patient dosimetry study run time for a 2% dose uncertainty was 9.35 s (≈500 ms 10-6 simulated particles) and 2.5 s when using one and four GPUs, respectively. The performance of the proposed GGEMS-brachy platform allows envisaging the use of Monte Carlo simulation based dosimetry studies in brachytherapy compatible with clinical practice. Although the proposed platform was evaluated for prostate cancer, it is equally applicable to other LDR brachytherapy clinical applications. Future extensions will allow its application in high dose rate brachytherapy applications.

  20. Web-based Tsunami Early Warning System with instant Tsunami Propagation Calculations in the GPU Cloud

    Science.gov (United States)

    Hammitzsch, M.; Spazier, J.; Reißland, S.

    2014-12-01

    Usually, tsunami early warning and mitigation systems (TWS or TEWS) are based on several software components deployed in a client-server based infrastructure. The vast majority of systems importantly include desktop-based clients with a graphical user interface (GUI) for the operators in early warning centers. However, in times of cloud computing and ubiquitous computing the use of concepts and paradigms, introduced by continuously evolving approaches in information and communications technology (ICT), have to be considered even for early warning systems (EWS). Based on the experiences and the knowledge gained in three research projects - 'German Indonesian Tsunami Early Warning System' (GITEWS), 'Distant Early Warning System' (DEWS), and 'Collaborative, Complex, and Critical Decision-Support in Evolving Crises' (TRIDEC) - new technologies are exploited to implement a cloud-based and web-based prototype to open up new prospects for EWS. This prototype, named 'TRIDEC Cloud', merges several complementary external and in-house cloud-based services into one platform for automated background computation with graphics processing units (GPU), for web-mapping of hazard specific geospatial data, and for serving relevant functionality to handle, share, and communicate threat specific information in a collaborative and distributed environment. The prototype in its current version addresses tsunami early warning and mitigation. The integration of GPU accelerated tsunami simulation computations have been an integral part of this prototype to foster early warning with on-demand tsunami predictions based on actual source parameters. However, the platform is meant for researchers around the world to make use of the cloud-based GPU computation to analyze other types of geohazards and natural hazards and react upon the computed situation picture with a web-based GUI in a web browser at remote sites. The current website is an early alpha version for demonstration purposes to give the

  1. Hardware and Software Design of FPGA-based PCIe Gen3 interface for APEnet+ network interconnect system

    International Nuclear Information System (INIS)

    Ammendola, R.; Biagioni, A.; Frezza, O.; Cicero, F. Lo; Lonardo, A; Martinelli, M.; Paolucci, P. S.; Pastorelli, E.; Simula, F.; Tosoratto, L.; Vicini, P.; Rossetti, D.

    2015-01-01

    In the attempt to develop an interconnection architecture optimized for hybrid HPC systems dedicated to scientific computing, we designed APEnet+, a point-to-point, low-latency and high-performance network controller supporting 6 fully bidirectional off-board links over a 3D torus topology.The first release of APEnet+ (named V4) was a board based on a 40 nm Altera FPGA, integrating 6 channels at 34 Gbps of raw bandwidth per direction and a PCIe Gen2 x8 host interface. It has been the first-of-its-kind device to implement an RDMA protocol to directly read/write data from/to Fermi and Kepler NVIDIA GPUs using NVIDIA peer-to-peer and GPUDirect RDMA protocols, obtaining real zero-copy GPU-to-GPU transfers over the network.The latest generation of APEnet+ systems (now named V5) implements a PCIe Gen3 x8 host interface on a 28 nm Altera Stratix V FPGA, with multi-standard fast transceivers (up to 14.4 Gbps) and an increased amount of configurable internal resources and hardware IP cores to support main interconnection standard protocols.Herein we present the APEnet+ V5 architecture, the status of its hardware and its system software design. Both its Linux Device Driver and the low-level libraries have been redeveloped to support the PCIe Gen3 protocol, introducing optimizations and solutions based on hardware/software co-design. (paper)

  2. Hardware and Software Design of FPGA-based PCIe Gen3 interface for APEnet+ network interconnect system

    Science.gov (United States)

    Ammendola, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Paolucci, P. S.; Pastorelli, E.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.

    2015-12-01

    In the attempt to develop an interconnection architecture optimized for hybrid HPC systems dedicated to scientific computing, we designed APEnet+, a point-to-point, low-latency and high-performance network controller supporting 6 fully bidirectional off-board links over a 3D torus topology. The first release of APEnet+ (named V4) was a board based on a 40 nm Altera FPGA, integrating 6 channels at 34 Gbps of raw bandwidth per direction and a PCIe Gen2 x8 host interface. It has been the first-of-its-kind device to implement an RDMA protocol to directly read/write data from/to Fermi and Kepler NVIDIA GPUs using NVIDIA peer-to-peer and GPUDirect RDMA protocols, obtaining real zero-copy GPU-to-GPU transfers over the network. The latest generation of APEnet+ systems (now named V5) implements a PCIe Gen3 x8 host interface on a 28 nm Altera Stratix V FPGA, with multi-standard fast transceivers (up to 14.4 Gbps) and an increased amount of configurable internal resources and hardware IP cores to support main interconnection standard protocols. Herein we present the APEnet+ V5 architecture, the status of its hardware and its system software design. Both its Linux Device Driver and the low-level libraries have been redeveloped to support the PCIe Gen3 protocol, introducing optimizations and solutions based on hardware/software co-design.

  3. Research on GPU acceleration for Monte Carlo criticality calculation

    International Nuclear Information System (INIS)

    Xu, Q.; Yu, G.; Wang, K.

    2013-01-01

    The Monte Carlo (MC) neutron transport method can be naturally parallelized by multi-core architectures due to the dependency between particles during the simulation. The GPU+CPU heterogeneous parallel mode has become an increasingly popular way of parallelism in the field of scientific supercomputing. Thus, this work focuses on the GPU acceleration method for the Monte Carlo criticality simulation, as well as the computational efficiency that GPUs can bring. The 'neutron transport step' is introduced to increase the GPU thread occupancy. In order to test the sensitivity of the MC code's complexity, a 1D one-group code and a 3D multi-group general purpose code are respectively transplanted to GPUs, and the acceleration effects are compared. The result of numerical experiments shows considerable acceleration effect of the 'neutron transport step' strategy. However, the performance comparison between the 1D code and the 3D code indicates the poor scalability of MC codes on GPUs. (authors)

  4. 2D to 3D conversion implemented in different hardware

    Science.gov (United States)

    Ramos-Diaz, Eduardo; Gonzalez-Huitron, Victor; Ponomaryov, Volodymyr I.; Hernandez-Fragoso, Araceli

    2015-02-01

    Conversion of available 2D data for release in 3D content is a hot topic for providers and for success of the 3D applications, in general. It naturally completely relies on virtual view synthesis of a second view given by original 2D video. Disparity map (DM) estimation is a central task in 3D generation but still follows a very difficult problem for rendering novel images precisely. There exist different approaches in DM reconstruction, among them manually and semiautomatic methods that can produce high quality DMs but they demonstrate hard time consuming and are computationally expensive. In this paper, several hardware implementations of designed frameworks for an automatic 3D color video generation based on 2D real video sequence are proposed. The novel framework includes simultaneous processing of stereo pairs using the following blocks: CIE L*a*b* color space conversions, stereo matching via pyramidal scheme, color segmentation by k-means on an a*b* color plane, and adaptive post-filtering, DM estimation using stereo matching between left and right images (or neighboring frames in a video), adaptive post-filtering, and finally, the anaglyph 3D scene generation. Novel technique has been implemented on DSP TMS320DM648, Matlab's Simulink module over a PC with Windows 7, and using graphic card (NVIDIA Quadro K2000) demonstrating that the proposed approach can be applied in real-time processing mode. The time values needed, mean Similarity Structural Index Measure (SSIM) and Bad Matching Pixels (B) values for different hardware implementations (GPU, Single CPU, and DSP) are exposed in this paper.

  5. Hardware-Accelerated Simulated Radiography

    International Nuclear Information System (INIS)

    Laney, D; Callahan, S; Max, N; Silva, C; Langer, S.; Frank, R

    2005-01-01

    We present the application of hardware accelerated volume rendering algorithms to the simulation of radiographs as an aid to scientists designing experiments, validating simulation codes, and understanding experimental data. The techniques presented take advantage of 32-bit floating point texture capabilities to obtain solutions to the radiative transport equation for X-rays. The hardware accelerated solutions are accurate enough to enable scientists to explore the experimental design space with greater efficiency than the methods currently in use. An unsorted hexahedron projection algorithm is presented for curvilinear hexahedral meshes that produces simulated radiographs in the absorption-only regime. A sorted tetrahedral projection algorithm is presented that simulates radiographs of emissive materials. We apply the tetrahedral projection algorithm to the simulation of experimental diagnostics for inertial confinement fusion experiments on a laser at the University of Rochester

  6. The principles of computer hardware

    CERN Document Server

    Clements, Alan

    2000-01-01

    Principles of Computer Hardware, now in its third edition, provides a first course in computer architecture or computer organization for undergraduates. The book covers the core topics of such a course, including Boolean algebra and logic design; number bases and binary arithmetic; the CPU; assembly language; memory systems; and input/output methods and devices. It then goes on to cover the related topics of computer peripherals such as printers; the hardware aspects of the operating system; and data communications, and hence provides a broader overview of the subject. Its readable, tutorial-based approach makes it an accessible introduction to the subject. The book has extensive in-depth coverage of two microprocessors, one of which (the 68000) is widely used in education. All chapters in the new edition have been updated. Major updates include: powerful software simulations of digital systems to accompany the chapters on digital design; a tutorial-based introduction to assembly language, including many exam...

  7. Computer hardware for radiologists: Part 2

    Directory of Open Access Journals (Sweden)

    Indrajit I

    2010-01-01

    Full Text Available Computers are an integral part of modern radiology equipment. In the first half of this two-part article, we dwelt upon some fundamental concepts regarding computer hardware, covering components like motherboard, central processing unit (CPU, chipset, random access memory (RAM, and memory modules. In this article, we describe the remaining computer hardware components that are of relevance to radiology. "Storage drive" is a term describing a "memory" hardware used to store data for later retrieval. Commonly used storage drives are hard drives, floppy drives, optical drives, flash drives, and network drives. The capacity of a hard drive is dependent on many factors, including the number of disk sides, number of tracks per side, number of sectors on each track, and the amount of data that can be stored in each sector. "Drive interfaces" connect hard drives and optical drives to a computer. The connections of such drives require both a power cable and a data cable. The four most popular "input/output devices" used commonly with computers are the printer, monitor, mouse, and keyboard. The "bus" is a built-in electronic signal pathway in the motherboard to permit efficient and uninterrupted data transfer. A motherboard can have several buses, including the system bus, the PCI express bus, the PCI bus, the AGP bus, and the (outdated ISA bus. "Ports" are the location at which external devices are connected to a computer motherboard. All commonly used peripheral devices, such as printers, scanners, and portable drives, need ports. A working knowledge of computers is necessary for the radiologist if the workflow is to realize its full potential and, besides, this knowledge will prepare the radiologist for the coming innovations in the ′ever increasing′ digital future.

  8. Computer hardware for radiologists: Part 2

    International Nuclear Information System (INIS)

    Indrajit, IK; Alam, A

    2010-01-01

    Computers are an integral part of modern radiology equipment. In the first half of this two-part article, we dwelt upon some fundamental concepts regarding computer hardware, covering components like motherboard, central processing unit (CPU), chipset, random access memory (RAM), and memory modules. In this article, we describe the remaining computer hardware components that are of relevance to radiology. “Storage drive” is a term describing a “memory” hardware used to store data for later retrieval. Commonly used storage drives are hard drives, floppy drives, optical drives, flash drives, and network drives. The capacity of a hard drive is dependent on many factors, including the number of disk sides, number of tracks per side, number of sectors on each track, and the amount of data that can be stored in each sector. “Drive interfaces” connect hard drives and optical drives to a computer. The connections of such drives require both a power cable and a data cable. The four most popular “input/output devices” used commonly with computers are the printer, monitor, mouse, and keyboard. The “bus” is a built-in electronic signal pathway in the motherboard to permit efficient and uninterrupted data transfer. A motherboard can have several buses, including the system bus, the PCI express bus, the PCI bus, the AGP bus, and the (outdated) ISA bus. “Ports” are the location at which external devices are connected to a computer motherboard. All commonly used peripheral devices, such as printers, scanners, and portable drives, need ports. A working knowledge of computers is necessary for the radiologist if the workflow is to realize its full potential and, besides, this knowledge will prepare the radiologist for the coming innovations in the ‘ever increasing’ digital future

  9. GPU Nuclear Corporation's radiation exposure management system

    International Nuclear Information System (INIS)

    Slobodien, M.J.; Bovino, A.A.; Perry, O.R.; Hildebrand, J.E.

    1984-01-01

    GPU Nuclear Corporation has developed a central main frame (IBM 3081) based radiation exposure management system which provides real time and batch transactions for three separate reactor facilities. The structure and function of the data base are discussed. The system's main features include real time on-line radiation work permit generation and personnel exposure tracking; dose accountability as a function of system and component, job type, worker classification, and work location; and personnel dosemeter (TLD and self-reading pocket dosemeters) data processing. The system also carries the qualifications of all radiation workers including RWP training, respiratory protection training, results of respirator fit tests and medical exams. A warning system is used to prevent non-qualified persons from entering controlled areas. The main frame system is interfaced with a variety of mini and micro computer systems for dosemetry, statistical and graphics applications. These are discussed. Some unique dosemetry features which are discussed include assessment of dose for up to 140 parts of the body with dose evaluations at 7,300 and 1000 mg/cm 2 for each part, tracking of MPC hours on a 7 day rolling schedule; automatic pairing of TLD and self-reading pocket dosemeter values, creation and updating of NRC Forms 4 and 5, generation of NRC required 20.407 and Reg Guide 1.16 reports. As of July 1983, over 20 remote on-line stations were in use with plans to add 20-30 more by May 1984. The system provides response times for on-line activities of 2-7 seconds and 23 1/2 hours per day ''up time''. Examples of the various on-line and batch transactions are described

  10. GPU-accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC) Earth system model (version 2.52)

    Science.gov (United States)

    Alvanos, Michail; Christoudias, Theodoros

    2017-10-01

    This paper presents an application of GPU accelerators in Earth system modeling. We focus on atmospheric chemical kinetics, one of the most computationally intensive tasks in climate-chemistry model simulations. We developed a software package that automatically generates CUDA kernels to numerically integrate atmospheric chemical kinetics in the global climate model ECHAM/MESSy Atmospheric Chemistry (EMAC), used to study climate change and air quality scenarios. A source-to-source compiler outputs a CUDA-compatible kernel by parsing the FORTRAN code generated by the Kinetic PreProcessor (KPP) general analysis tool. All Rosenbrock methods that are available in the KPP numerical library are supported.Performance evaluation, using Fermi and Pascal CUDA-enabled GPU accelerators, shows achieved speed-ups of 4. 5 × and 20. 4 × , respectively, of the kernel execution time. A node-to-node real-world production performance comparison shows a 1. 75 × speed-up over the non-accelerated application using the KPP three-stage Rosenbrock solver. We provide a detailed description of the code optimizations used to improve the performance including memory optimizations, control code simplification, and reduction of idle time. The accuracy and correctness of the accelerated implementation are evaluated by comparing to the CPU-only code of the application. The median relative difference is found to be less than 0.000000001 % when comparing the output of the accelerated kernel the CPU-only code.The approach followed, including the computational workload division, and the developed GPU solver code can potentially be used as the basis for hardware acceleration of numerous geoscientific models that rely on KPP for atmospheric chemical kinetics applications.

  11. GPU-accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC Earth system model (version 2.52

    Directory of Open Access Journals (Sweden)

    M. Alvanos

    2017-10-01

    Full Text Available This paper presents an application of GPU accelerators in Earth system modeling. We focus on atmospheric chemical kinetics, one of the most computationally intensive tasks in climate–chemistry model simulations. We developed a software package that automatically generates CUDA kernels to numerically integrate atmospheric chemical kinetics in the global climate model ECHAM/MESSy Atmospheric Chemistry (EMAC, used to study climate change and air quality scenarios. A source-to-source compiler outputs a CUDA-compatible kernel by parsing the FORTRAN code generated by the Kinetic PreProcessor (KPP general analysis tool. All Rosenbrock methods that are available in the KPP numerical library are supported.Performance evaluation, using Fermi and Pascal CUDA-enabled GPU accelerators, shows achieved speed-ups of 4. 5 ×  and 20. 4 × , respectively, of the kernel execution time. A node-to-node real-world production performance comparison shows a 1. 75 ×  speed-up over the non-accelerated application using the KPP three-stage Rosenbrock solver. We provide a detailed description of the code optimizations used to improve the performance including memory optimizations, control code simplification, and reduction of idle time. The accuracy and correctness of the accelerated implementation are evaluated by comparing to the CPU-only code of the application. The median relative difference is found to be less than 0.000000001 % when comparing the output of the accelerated kernel the CPU-only code.The approach followed, including the computational workload division, and the developed GPU solver code can potentially be used as the basis for hardware acceleration of numerous geoscientific models that rely on KPP for atmospheric chemical kinetics applications.

  12. A nonvoxel-based dose convolution/superposition algorithm optimized for scalable GPU architectures

    Energy Technology Data Exchange (ETDEWEB)

    Neylon, J., E-mail: jneylon@mednet.ucla.edu; Sheng, K.; Yu, V.; Low, D. A.; Kupelian, P.; Santhanam, A. [Department of Radiation Oncology, University of California Los Angeles, 200 Medical Plaza, #B265, Los Angeles, California 90095 (United States); Chen, Q. [Department of Radiation Oncology, University of Virginia, 1300 Jefferson Park Avenue, Charlottesville, California 22908 (United States)

    2014-10-15

    Purpose: Real-time adaptive planning and treatment has been infeasible due in part to its high computational complexity. There have been many recent efforts to utilize graphics processing units (GPUs) to accelerate the computational performance and dose accuracy in radiation therapy. Data structure and memory access patterns are the key GPU factors that determine the computational performance and accuracy. In this paper, the authors present a nonvoxel-based (NVB) approach to maximize computational and memory access efficiency and throughput on the GPU. Methods: The proposed algorithm employs a ray-tracing mechanism to restructure the 3D data sets computed from the CT anatomy into a nonvoxel-based framework. In a process that takes only a few milliseconds of computing time, the algorithm restructured the data sets by ray-tracing through precalculated CT volumes to realign the coordinate system along the convolution direction, as defined by zenithal and azimuthal angles. During the ray-tracing step, the data were resampled according to radial sampling and parallel ray-spacing parameters making the algorithm independent of the original CT resolution. The nonvoxel-based algorithm presented in this paper also demonstrated a trade-off in computational performance and dose accuracy for different coordinate system configurations. In order to find the best balance between the computed speedup and the accuracy, the authors employed an exhaustive parameter search on all sampling parameters that defined the coordinate system configuration: zenithal, azimuthal, and radial sampling of the convolution algorithm, as well as the parallel ray spacing during ray tracing. The angular sampling parameters were varied between 4 and 48 discrete angles, while both radial sampling and parallel ray spacing were varied from 0.5 to 10 mm. The gamma distribution analysis method (γ) was used to compare the dose distributions using 2% and 2 mm dose difference and distance-to-agreement criteria

  13. A nonvoxel-based dose convolution/superposition algorithm optimized for scalable GPU architectures

    International Nuclear Information System (INIS)

    Neylon, J.; Sheng, K.; Yu, V.; Low, D. A.; Kupelian, P.; Santhanam, A.; Chen, Q.

    2014-01-01

    Purpose: Real-time adaptive planning and treatment has been infeasible due in part to its high computational complexity. There have been many recent efforts to utilize graphics processing units (GPUs) to accelerate the computational performance and dose accuracy in radiation therapy. Data structure and memory access patterns are the key GPU factors that determine the computational performance and accuracy. In this paper, the authors present a nonvoxel-based (NVB) approach to maximize computational and memory access efficiency and throughput on the GPU. Methods: The proposed algorithm employs a ray-tracing mechanism to restructure the 3D data sets computed from the CT anatomy into a nonvoxel-based framework. In a process that takes only a few milliseconds of computing time, the algorithm restructured the data sets by ray-tracing through precalculated CT volumes to realign the coordinate system along the convolution direction, as defined by zenithal and azimuthal angles. During the ray-tracing step, the data were resampled according to radial sampling and parallel ray-spacing parameters making the algorithm independent of the original CT resolution. The nonvoxel-based algorithm presented in this paper also demonstrated a trade-off in computational performance and dose accuracy for different coordinate system configurations. In order to find the best balance between the computed speedup and the accuracy, the authors employed an exhaustive parameter search on all sampling parameters that defined the coordinate system configuration: zenithal, azimuthal, and radial sampling of the convolution algorithm, as well as the parallel ray spacing during ray tracing. The angular sampling parameters were varied between 4 and 48 discrete angles, while both radial sampling and parallel ray spacing were varied from 0.5 to 10 mm. The gamma distribution analysis method (γ) was used to compare the dose distributions using 2% and 2 mm dose difference and distance-to-agreement criteria

  14. Hunting for hardware changes in data centres

    International Nuclear Information System (INIS)

    Coelho dos Santos, M; Steers, I; Szebenyi, I; Xafi, A; Barring, O; Bonfillou, E

    2012-01-01

    With many servers and server parts the environment of warehouse sized data centres is increasingly complex. Server life-cycle management and hardware failures are responsible for frequent changes that need to be managed. To manage these changes better a project codenamed “hardware hound” focusing on hardware failure trending and hardware inventory has been started at CERN. By creating and using a hardware oriented data set - the inventory - with detailed information on servers and their parts as well as tracking changes to this inventory, the project aims at, for example, being able to discover trends in hardware failure rates.

  15. NLSEmagic: Nonlinear Schrödinger equation multi-dimensional Matlab-based GPU-accelerated integrators using compact high-order schemes

    Science.gov (United States)

    Caplan, R. M.

    2013-04-01

    We present a simple to use, yet powerful code package called NLSEmagic to numerically integrate the nonlinear Schrödinger equation in one, two, and three dimensions. NLSEmagic is a high-order finite-difference code package which utilizes graphic processing unit (GPU) parallel architectures. The codes running on the GPU are many times faster than their serial counterparts, and are much cheaper to run than on standard parallel clusters. The codes are developed with usability and portability in mind, and therefore are written to interface with MATLAB utilizing custom GPU-enabled C codes with the MEX-compiler interface. The packages are freely distributed, including user manuals and set-up files. Catalogue identifier: AEOJ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOJ_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 124453 No. of bytes in distributed program, including test data, etc.: 4728604 Distribution format: tar.gz Programming language: C, CUDA, MATLAB. Computer: PC, MAC. Operating system: Windows, MacOS, Linux. Has the code been vectorized or parallelized?: Yes. Number of processors used: Single CPU, number of GPU processors dependent on chosen GPU card (max is currently 3072 cores on GeForce GTX 690). Supplementary material: Setup guide, Installation guide. RAM: Highly dependent on dimensionality and grid size. For typical medium-large problem size in three dimensions, 4GB is sufficient. Keywords: Nonlinear Schröodinger Equation, GPU, high-order finite difference, Bose-Einstien condensates. Classification: 4.3, 7.7. Nature of problem: Integrate solutions of the time-dependent one-, two-, and three-dimensional cubic nonlinear Schrödinger equation. Solution method: The integrators utilize a fully-explicit fourth-order Runge-Kutta scheme in time

  16. A parallel finite element procedure for contact-impact problems using edge-based smooth triangular element and GPU

    Science.gov (United States)

    Cai, Yong; Cui, Xiangyang; Li, Guangyao; Liu, Wenyang

    2018-04-01

    The edge-smooth finite element method (ES-FEM) can improve the computational accuracy of triangular shell elements and the mesh partition efficiency of complex models. In this paper, an approach is developed to perform explicit finite element simulations of contact-impact problems with a graphical processing unit (GPU) using a special edge-smooth triangular shell element based on ES-FEM. Of critical importance for this problem is achieving finer-grained parallelism to enable efficient data loading and to minimize communication between the device and host. Four kinds of parallel strategies are then developed to efficiently solve these ES-FEM based shell element formulas, and various optimization methods are adopted to ensure aligned memory access. Special focus is dedicated to developing an approach for the parallel construction of edge systems. A parallel hierarchy-territory contact-searching algorithm (HITA) and a parallel penalty function calculation method are embedded in this parallel explicit algorithm. Finally, the program flow is well designed, and a GPU-based simulation system is developed, using Nvidia's CUDA. Several numerical examples are presented to illustrate the high quality of the results obtained with the proposed methods. In addition, the GPU-based parallel computation is shown to significantly reduce the computing time.

  17. Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms

    Science.gov (United States)

    Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel

    2016-04-01

    Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and

  18. ACL2 Meets the GPU: Formalizing a CUDA-based Parallelizable All-Pairs Shortest Path Algorithm in ACL2

    Directory of Open Access Journals (Sweden)

    David S. Hardin

    2013-04-01

    Full Text Available As Graphics Processing Units (GPUs have gained in capability and GPU development environments have matured, developers are increasingly turning to the GPU to off-load the main host CPU of numerically-intensive, parallelizable computations. Modern GPUs feature hundreds of cores, and offer programming niceties such as double-precision floating point, and even limited recursion. This shift from CPU to GPU, however, raises the question: how do we know that these new GPU-based algorithms are correct? In order to explore this new verification frontier, we formalized a parallelizable all-pairs shortest path (APSP algorithm for weighted graphs, originally coded in NVIDIA's CUDA language, in ACL2. The ACL2 specification is written using a single-threaded object (stobj and tail recursion, as the stobj/tail recursion combination yields the most straightforward translation from imperative programming languages, as well as efficient, scalable executable specifications within ACL2 itself. The ACL2 version of the APSP algorithm can process millions of vertices and edges with little to no garbage generation, and executes at one-sixth the speed of a host-based version of APSP coded in C – a very respectable result for a theorem prover. In addition to formalizing the APSP algorithm (which uses Dijkstra's shortest path algorithm at its core, we have also provided capability that the original APSP code lacked, namely shortest path recovery. Path recovery is accomplished using a secondary ACL2 stobj implementing a LIFO stack, which is proven correct. To conclude the experiment, we ported the ACL2 version of the APSP kernels back to C, resulting in a less than 5% slowdown, and also performed a partial back-port to CUDA, which, surprisingly, yielded a slight performance increase.

  19. A GPU-accelerated semi-implicit fractional-step method for numerical solutions of incompressible Navier-Stokes equations

    Science.gov (United States)

    Ha, Sanghyun; Park, Junshin; You, Donghyun

    2018-01-01

    Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. The Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution methods used in the semi-implicit fractional-step method take advantage of multiple tridiagonal matrices whose inversion is known as the major bottleneck for acceleration on a typical multi-core machine. A novel implementation of the semi-implicit fractional-step method designed for GPU acceleration of the incompressible Navier-Stokes equations is presented. Aspects of the programing model of Compute Unified Device Architecture (CUDA), which are critical to the bandwidth-bound nature of the present method are discussed in detail. A data layout for efficient use of CUDA libraries is proposed for acceleration of tridiagonal matrix inversion and fast Fourier transform. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while the Navier-Stokes equations are computed on a GPU. Performance of the present method using CUDA is assessed by comparing the speed of solving three tridiagonal matrices using ADI with the speed of solving one heptadiagonal matrix using a conjugate gradient method. An overall speedup of 20 times is achieved using a Tesla K40 GPU in comparison with a single-core Xeon E5-2660 v3 CPU in simulations of turbulent boundary-layer flow over a flat plate conducted on over 134 million grids. Enhanced performance of 48 times speedup is reached for the same problem using a Tesla P100 GPU.

  20. GPU-based real-time triggering in the NA62 experiment

    CERN Document Server

    Ammendola, R.; Cretaro, P.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P.S.; Pastorelli, E.; Piandani, R.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.

    2016-01-01

    Over the last few years the GPGPU (General-Purpose computing on Graphics Processing Units) paradigm represented a remarkable development in the world of computing. Computing for High-Energy Physics is no exception: several works have demonstrated the effectiveness of the integration of GPU-based systems in high level trigger of different experiments. On the other hand the use of GPUs in the low level trigger systems, characterized by stringent real-time constraints, such as tight time budget and high throughput, poses several challenges. In this paper we focus on the low level trigger in the CERN NA62 experiment, investigating the use of real-time computing on GPUs in this synchronous system. Our approach aimed at harvesting the GPU computing power to build in real-time refined physics-related trigger primitives for the RICH detector, as the the knowledge of Cerenkov rings parameters allows to build stringent conditions for data selection at trigger level. Latencies of all components of the trigger chain have...

  1. Fully accelerating quantum Monte Carlo simulations of real materials on GPU clusters

    Science.gov (United States)

    Esler, Kenneth

    2011-03-01

    Quantum Monte Carlo (QMC) has proved to be an invaluable tool for predicting the properties of matter from fundamental principles, combining very high accuracy with extreme parallel scalability. By solving the many-body Schrödinger equation through a stochastic projection, it achieves greater accuracy than mean-field methods and better scaling with system size than quantum chemical methods, enabling scientific discovery across a broad spectrum of disciplines. In recent years, graphics processing units (GPUs) have provided a high-performance and low-cost new approach to scientific computing, and GPU-based supercomputers are now among the fastest in the world. The multiple forms of parallelism afforded by QMC algorithms make the method an ideal candidate for acceleration in the many-core paradigm. We present the results of porting the QMCPACK code to run on GPU clusters using the NVIDIA CUDA platform. Using mixed precision on GPUs and MPI for intercommunication, we observe typical full-application speedups of approximately 10x to 15x relative to quad-core CPUs alone, while reproducing the double-precision CPU results within statistical error. We discuss the algorithm modifications necessary to achieve good performance on this heterogeneous architecture and present the results of applying our code to molecules and bulk materials. Supported by the U.S. DOE under Contract No. DOE-DE-FG05-08OR23336 and by the NSF under No. 0904572.

  2. ODYSSEY: A PUBLIC GPU-BASED CODE FOR GENERAL RELATIVISTIC RADIATIVE TRANSFER IN KERR SPACETIME

    Energy Technology Data Exchange (ETDEWEB)

    Pu, Hung-Yi [Institute of Astronomy and Astrophysics, Academia Sinica, 11F of Astronomy-Mathematics Building, AS/NTU No. 1, Taipei 10617, Taiwan (China); Yun, Kiyun; Yoon, Suk-Jin [Department of Astronomy and Center for Galaxy Evolution Research, Yonsei University, Seoul 120-749 (Korea, Republic of); Younsi, Ziri [Institut für Theoretische Physik, Max-von-Laue-Straße 1, D-60438 Frankfurt am Main (Germany)

    2016-04-01

    General relativistic radiative transfer calculations coupled with the calculation of geodesics in the Kerr spacetime are an essential tool for determining the images, spectra, and light curves from matter in the vicinity of black holes. Such studies are especially important for ongoing and upcoming millimeter/submillimeter very long baseline interferometry observations of the supermassive black holes at the centers of Sgr A* and M87. To this end we introduce Odyssey, a graphics processing unit (GPU) based code for ray tracing and radiative transfer in the Kerr spacetime. On a single GPU, the performance of Odyssey can exceed 1 ns per photon, per Runge–Kutta integration step. Odyssey is publicly available, fast, accurate, and flexible enough to be modified to suit the specific needs of new users. Along with a Graphical User Interface powered by a video-accelerated display architecture, we also present an educational software tool, Odyssey-Edu, for showing in real time how null geodesics around a Kerr black hole vary as a function of black hole spin and angle of incidence onto the black hole.

  3. GPU acceleration of Eulerian-Lagrangian particle-laden turbulent flow simulations

    Science.gov (United States)

    Richter, David; Sweet, James; Thain, Douglas

    2017-11-01

    The Lagrangian point-particle approximation is a popular numerical technique for representing dispersed phases whose properties can substantially deviate from the local fluid. In many cases, particularly in the limit of one-way coupled systems, large numbers of particles are desired; this may be either because many physical particles are present (e.g. LES of an entire cloud), or because the use of many particles increases statistical convergence (e.g. high-order statistics). Solving the trajectories of very large numbers of particles can be problematic in traditional MPI implementations, however, and this study reports the benefits of using graphical processing units (GPUs) to integrate the particle equations of motion while preserving the original MPI version of the Eulerian flow solver. It is found that GPU acceleration becomes cost effective around one million particles, and performance enhancements of up to 15x can be achieved when O(108) particles are computed on the GPU rather than the CPU cluster. Optimizations and limitations will be discussed, as will prospects for expanding to two- and four-way coupled systems. ONR Grant No. N00014-16-1-2472.

  4. GPU-Based Block-Wise Nonlocal Means Denoising for 3D Ultrasound Images

    Directory of Open Access Journals (Sweden)

    Liu Li

    2013-01-01

    Full Text Available Speckle suppression plays an important role in improving ultrasound (US image quality. While lots of algorithms have been proposed for 2D US image denoising with remarkable filtering quality, there is relatively less work done on 3D ultrasound speckle suppression, where the whole volume data rather than just one frame needs to be considered. Then, the most crucial problem with 3D US denoising is that the computational complexity increases tremendously. The nonlocal means (NLM provides an effective method for speckle suppression in US images. In this paper, a programmable graphic-processor-unit- (GPU- based fast NLM filter is proposed for 3D ultrasound speckle reduction. A Gamma distribution noise model, which is able to reliably capture image statistics for Log-compressed ultrasound images, was used for the 3D block-wise NLM filter on basis of Bayesian framework. The most significant aspect of our method was the adopting of powerful data-parallel computing capability of GPU to improve the overall efficiency. Experimental results demonstrate that the proposed method can enormously accelerate the algorithm.

  5. Papaya Tree Detection with UAV Images Using a GPU-Accelerated Scale-Space Filtering Method

    Directory of Open Access Journals (Sweden)

    Hao Jiang

    2017-07-01

    Full Text Available The use of unmanned aerial vehicles (UAV can allow individual tree detection for forest inventories in a cost-effective way. The scale-space filtering (SSF algorithm is commonly used and has the capability of detecting trees of different crown sizes. In this study, we made two improvements with regard to the existing method and implementations. First, we incorporated SSF with a Lab color transformation to reduce over-detection problems associated with the original luminance image. Second, we ported four of the most time-consuming processes to the graphics processing unit (GPU to improve computational efficiency. The proposed method was implemented using PyCUDA, which enabled access to NVIDIA’s compute unified device architecture (CUDA through high-level scripting of the Python language. Our experiments were conducted using two images captured by the DJI Phantom 3 Professional and a most recent NVIDIA GPU GTX1080. The resulting accuracy was high, with an F-measure larger than 0.94. The speedup achieved by our parallel implementation was 44.77 and 28.54 for the first and second test image, respectively. For each 4000 × 3000 image, the total runtime was less than 1 s, which was sufficient for real-time performance and interactive application.

  6. GPU Enhancement of the Trigger to Extend Physics Reach at the LHC

    CERN Document Server

    Lujan, P.; Hunt, A.; Jindal, P.; LeGresley, P.

    2014-01-01

    At the Large Hadron Collider (LHC), the trigger systems for the detectors must be able to process a very large amount of data in a very limited amount of time, so that the nominal collision rate of 40 MHz can be reduced to a data rate that can be stored and processed in a reasonable amount of time. This need for high performance places very stringent requirements on the complexity of the algorithms that can be used for identifying events of interest in the trigger system, which potentially limits the ability to trigger on signatures of various new physics models. In this paper, we present an alternative tracking algorithm, based on the Hough transform, which avoids many of the problems associated with the standard combinatorial track finding currently used. The Hough transform is also well-adapted for Graphics Processing Unit (GPU)-based computing, and such GPU-based systems could be easily integrated into the existing High-Level Trigger (HLT). This algorithm offers the ability to trigger on topological signa...

  7. Multi–GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL

    Directory of Open Access Journals (Sweden)

    Jan Masek

    2016-06-01

    Full Text Available Using modern Graphic Processing Units (GPUs becomes very useful for computing complex and time consuming processes. GPUs provide high–performance computation capabilities with a good price. This paper deals with a multi–GPU OpenCL and CUDA implementations of k–Nearest Neighbor (k–NN algorithm. This work compares performances of OpenCLand CUDA implementations where each of them is suitable for different number of used attributes. The proposed CUDA algorithm achieves acceleration up to 880x in comparison witha single thread CPU version. The common k-NN was modified to be faster when the lower number of k neighbors is set. The performance of algorithm was verified with two GPUs dual-core NVIDIA GeForce GTX 690 and CPU Intel Core i7 3770 with 4.1 GHz frequency. The results of speed up were measured for one GPU, two GPUs, three and four GPUs. We performed several tests with data sets containing up to 4 million elements with various number of attributes.

  8. Implementation of GPU parallel equilibrium reconstruction for plasma control in EAST

    Energy Technology Data Exchange (ETDEWEB)

    Huang, Yao, E-mail: yaohuang@ipp.ac.cn [Institute of Plasma Physics, Chinese Academy of Sciences, Hefei (China); Xiao, B.J. [Institute of Plasma Physics, Chinese Academy of Sciences, Hefei (China); School of Nuclear Science & Technology, University of Science & Technology of China (China); Luo, Z.P.; Yuan, Q.P.; Pei, X.F. [Institute of Plasma Physics, Chinese Academy of Sciences, Hefei (China); Yue, X.N. [School of Nuclear Science & Technology, University of Science & Technology of China (China)

    2016-11-15

    Highlights: • We described parallel equilibrium reconstruction code P-EFIT running on GPU was integrated with EAST plasma control system. • Compared with RT-EFIT used in EAST, P-EFIT has better spatial resolution and full algorithm of EFIT per iteration. • With the data interface through RFM, 65 × 65 spatial grids P-EFIT can satisfy the accuracy and time feasibility requirements for plasma control. • Successful control using ISOFLUX/P-EFIT was established in the dedicated experiment during the EAST 2014 campaign. • This work is a stepping-stone towards versatile ISOFLUX/P-EFIT control, such as real-time equilibrium reconstruction with more diagnostics. - Abstract: Implementation of P-EFIT code for plasma control in EAST is described. P-EFIT is based on the EFIT framework, but built with the CUDA™ architecture to take advantage of massively parallel Graphical Processing Unit (GPU) cores to significantly accelerate the computation. 65 × 65 grid size P-EFIT can complete one reconstruction iteration in 300 μs, with one iteration strategy, it can satisfy the needs of real-time plasma shape control. Data interface between P-EFIT and PCS is realized and developed by transferring data through RFM. First application of P-EFIT to discharge control in EAST is described.

  9. Large-Scale Multi-Dimensional Document Clustering on GPU Clusters

    Energy Technology Data Exchange (ETDEWEB)

    Cui, Xiaohui [ORNL; Mueller, Frank [North Carolina State University; Zhang, Yongpeng [ORNL; Potok, Thomas E [ORNL

    2010-01-01

    Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteennode GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrates the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.

  10. Reconstruction of the neutron spectrum using an artificial neural network in CPU and GPU

    International Nuclear Information System (INIS)

    Hernandez D, V. M.; Moreno M, A.; Ortiz L, M. A.; Vega C, H. R.; Alonso M, O. E.

    2016-10-01

    The increase in computing power in personal computers has been increasing, computers now have several processors in the CPU and in addition multiple CUDA cores in the graphics processing unit (GPU); both systems can be used individually or combined to perform scientific computation without resorting to processor or supercomputing arrangements. The Bonner sphere spectrometer is the most commonly used multi-element system for neutron detection purposes and its associated spectrum. Each sphere-detector combination gives a particular response that depends on the energy of the neutrons, and the total set of these responses is known like the responses matrix Rφ(E). Thus, the counting rates obtained with each sphere and the neutron spectrum is related to the Fredholm equation in its discrete version. For the reconstruction of the spectrum has a system of poorly conditioned equations with an infinite number of solutions and to find the appropriate solution, it has been proposed the use of artificial intelligence through neural networks with different platforms CPU and GPU. (Author)

  11. GPU-Accelerated Foreground Segmentation and Labeling for Real-Time Video Surveillance

    Directory of Open Access Journals (Sweden)

    Wei Song

    2016-09-01

    Full Text Available Real-time and accurate background modeling is an important researching topic in the fields of remote monitoring and video surveillance. Meanwhile, effective foreground detection is a preliminary requirement and decision-making basis for sustainable energy management, especially in smart meters. The environment monitoring results provide a decision-making basis for energy-saving strategies. For real-time moving object detection in video, this paper applies a parallel computing technology to develop a feedback foreground–background segmentation method and a parallel connected component labeling (PCCL algorithm. In the background modeling method, pixel-wise color histograms in graphics processing unit (GPU memory is generated from sequential images. If a pixel color in the current image does not locate around the peaks of its histogram, it is segmented as a foreground pixel. From the foreground segmentation results, a PCCL algorithm is proposed to cluster the foreground pixels into several groups in order to distinguish separate blobs. Because the noisy spot and sparkle in the foreground segmentation results always contain a small quantity of pixels, the small blobs are removed as noise in order to refine the segmentation results. The proposed GPU-based image processing algorithms are implemented using the compute unified device architecture (CUDA toolkit. The testing results show a significant enhancement in both speed and accuracy.

  12. GPU-accelerated denoising of 3D magnetic resonance images

    Energy Technology Data Exchange (ETDEWEB)

    Howison, Mark; Wes Bethel, E.

    2014-05-29

    The raw computational power of GPU accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. In practice, applying these filtering operations requires setting multiple parameters. This study was designed to provide better guidance to practitioners for choosing the most appropriate parameters by answering two questions: what parameters yield the best denoising results in practice? And what tuning is necessary to achieve optimal performance on a modern GPU? To answer the first question, we use two different metrics, mean squared error (MSE) and mean structural similarity (MSSIM), to compare denoising quality against a reference image. Surprisingly, the best improvement in structural similarity with the bilateral filter is achieved with a small stencil size that lies within the range of real-time execution on an NVIDIA Tesla M2050 GPU. Moreover, inappropriate choices for parameters, especially scaling parameters, can yield very poor denoising performance. To answer the second question, we perform an autotuning study to empirically determine optimal memory tiling on the GPU. The variation in these results suggests that such tuning is an essential step in achieving real-time performance. These results have important implications for the real-time application of denoising to MR images in clinical settings that require fast turn-around times.

  13. GPU accelerated CT reconstruction for clinical use: quality driven performance

    Science.gov (United States)

    Vaz, Michael S.; Sneyders, Yuri; McLin, Matthew; Ricker, Alan; Kimpe, Tom

    2007-03-01

    We present performance and quality analysis of GPU accelerated FDK filtered backprojection for cone beam computed tomography (CBCT) reconstruction. Our implementation of the FDK CT reconstruction algorithm does not compromise fidelity at any stage and yields a result that is within 1 HU of a reference C++ implementation. Our streaming implementation is able to perform reconstruction as the images are acquired; it addresses low latency as well as fast throughput, which are key considerations for a "real-time" design. Further, it is scaleable to multiple GPUs for increased performance. The implementation does not place any constraints on image acquisition; it works effectively for arbitrary angular coverage with arbitrary angular spacing. As such, this GPU accelerated CT reconstruction solution may easily be used with scanners that are already deployed. We are able to reconstruct a 512 x 512 x 340 volume from 625 projections, each sized 1024 x 768, in less than 50 seconds. The quoted 50 second timing encompasses the entire reconstruction using bilinear interpolation and includes filtering on the CPU, uploading the filtered projections to the GPU, and also downloading the reconstructed volume from GPU memory to system RAM.

  14. GPU based contouring method on grid DEM data

    Science.gov (United States)

    Tan, Liheng; Wan, Gang; Li, Feng; Chen, Xiaohui; Du, Wenlong

    2017-08-01

    This paper presents a novel method to generate contour lines from grid DEM data based on the programmable GPU pipeline. The previous contouring approaches often use CPU to construct a finite element mesh from the raw DEM data, and then extract contour segments from the elements. They also need a tracing or sorting strategy to generate the final continuous contours. These approaches can be heavily CPU-costing and time-consuming. Meanwhile the generated contours would be unsmooth if the raw data is sparsely distributed. Unlike the CPU approaches, we employ the GPU's vertex shader to generate a triangular mesh with arbitrary user-defined density, in which the height of each vertex is calculated through a third-order Cardinal spline function. Then in the same frame, segments are extracted from the triangles by the geometry shader, and translated to the CPU-side with an internal order in the GPU's transform feedback stage. Finally we propose a "Grid Sorting" algorithm to achieve the continuous contour lines by travelling the segments only once. Our method makes use of multiple stages of GPU pipeline for computation, which can generate smooth contour lines, and is significantly faster than the previous CPU approaches. The algorithm can be easily implemented with OpenGL 3.3 API or higher on consumer-level PCs.

  15. Large Scale Simulations of the Euler Equations on GPU Clusters

    KAUST Repository

    Liebmann, Manfred; Douglas, Craig C.; Haase, Gundolf; Horvá th, Zoltá n

    2010-01-01

    The paper investigates the scalability of a parallel Euler solver, using the Vijayasundaram method, on a GPU cluster with 32 Nvidia Geforce GTX 295 boards. The aim of this research is to enable large scale fluid dynamics simulations with up to one

  16. STEM image simulation with hybrid CPU/GPU programming

    International Nuclear Information System (INIS)

    Yao, Y.; Ge, B.H.; Shen, X.; Wang, Y.G.; Yu, R.C.

    2016-01-01

    STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. - Highlights: • STEM image simulation is achieved by hybrid CPU/GPU programming under parallel algorithm architecture to speed up the calculation in the personal computer (PC). • In order to fully utilize the calculation power of the PC, the simulation is performed by GPU core and multi-CPU cores at the same time so efficiency is improved significantly. • GaSb and artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. The results reveal some unintuitive phenomena about the contrast variation with the atom numbers.

  17. GPU accelerated likelihoods for stereo-based articulated tracking

    DEFF Research Database (Denmark)

    Friborg, Rune Møllegaard; Hauberg, Søren; Erleben, Kenny

    2010-01-01

    than a traditional CPU implementation. We explain the non-intuitive steps required to attain an optimized GPU implementation, where the dominant part is to hide the memory latency effectively. Benchmarks show that computations which previously required several minutes, are now performed in few seconds....

  18. STEM image simulation with hybrid CPU/GPU programming

    Energy Technology Data Exchange (ETDEWEB)

    Yao, Y., E-mail: yaoyuan@iphy.ac.cn; Ge, B.H.; Shen, X.; Wang, Y.G.; Yu, R.C.

    2016-07-15

    STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. - Highlights: • STEM image simulation is achieved by hybrid CPU/GPU programming under parallel algorithm architecture to speed up the calculation in the personal computer (PC). • In order to fully utilize the calculation power of the PC, the simulation is performed by GPU core and multi-CPU cores at the same time so efficiency is improved significantly. • GaSb and artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. The results reveal some unintuitive phenomena about the contrast variation with the atom numbers.

  19. The GPU implementation of micro - Doppler period estimation

    Science.gov (United States)

    Yang, Liyuan; Wang, Junling; Bi, Ran

    2018-03-01

    Aiming at the problem that the computational complexity and the deficiency of real-time of the wideband radar echo signal, a program is designed to improve the performance of real-time extraction of micro-motion feature in this paper based on the CPU-GPU heterogeneous parallel structure. Firstly, we discuss the principle of the micro-Doppler effect generated by the rolling of the scattering points on the orbiting satellite, analyses how to use Kalman filter to compensate the translational motion of tumbling satellite and how to use the joint time-frequency analysis and inverse Radon transform to extract the micro-motion features from the echo after compensation. Secondly, the advantages of GPU in terms of real-time processing and the working principle of CPU-GPU heterogeneous parallelism are analysed, and a program flow based on GPU to extract the micro-motion feature from the radar echo signal of rolling satellite is designed. At the end of the article the results of extraction are given to verify the correctness of the program and algorithm.

  20. Qualification of software and hardware

    International Nuclear Information System (INIS)

    Gossner, S.; Schueller, H.; Gloee, G.

    1987-01-01

    The qualification of on-line process control equipment is subdivided into three areas: 1) materials and structural elements; 2) on-line process-control components and devices; 3) electrical systems (reactor protection and confinement system). Microprocessor-aided process-control equipment are difficult to verify for failure-free function owing to the complexity of the functional structures of the hardware and to the variety of the software feasible for microprocessors. Hence, qualification will make great demands on the inspecting expert. (DG) [de

  1. Mobile Ultrasound Plane Wave Beamforming on iPhone or iPad using Metal- based GPU Processing

    Science.gov (United States)

    Hewener, Holger J.; Tretbar, Steffen H.

    Mobile and cost effective ultrasound devices are being used in point of care scenarios or the drama room. To reduce the costs of such devices we already presented the possibilities of consumer devices like the Apple iPad for full signal processing of raw data for ultrasound image generation. Using technologies like plane wave imaging to generate a full image with only one excitation/reception event the acquisition times and power consumption of ultrasound imaging can be reduced for low power mobile devices based on consumer electronics realizing the transition from FPGA or ASIC based beamforming into more flexible software beamforming. The massive parallel beamforming processing can be done with the Apple framework "Metal" for advanced graphics and general purpose GPU processing for the iOS platform. We were able to integrate the beamforming reconstruction into our mobile ultrasound processing application with imaging rates up to 70 Hz on iPad Air 2 hardware.

  2. Comparison of 2 accelerators of Monte Carlo radiation transport calculations, NVIDIA tesla M2090 GPU and Intel Xeon Phi 5110p coprocessor: a case study for X-ray CT Imaging Dose calculation

    International Nuclear Information System (INIS)

    Liu, T.; Xu, X.G.; Carothers, C.D.

    2013-01-01

    Hardware accelerators are currently becoming increasingly important in boosting high performance computing systems. In this study, we tested the performance of two accelerator models, NVIDIA Tesla M2090 GPU and Intel Xeon Phi 5110p coprocessor, using a new Monte Carlo photon transport package called ARCHER-CT we have developed for fast CT imaging dose calculation. The package contains three code variants, ARCHER-CT(CPU), ARCHER-CT(GPU) and ARCHER-CT(COP) to run in parallel on the multi-core CPU, GPU and coprocessor architectures respectively. A detailed GE LightSpeed Multi-Detector Computed Tomography (MDCT) scanner model and a family of voxel patient phantoms were included in the code to calculate absorbed dose to radiosensitive organs under specified scan protocols. The results from ARCHER agreed well with those from the production code Monte Carlo N-Particle eXtended (MCNPX). It was found that all the code variants were significantly faster than the parallel MCNPX running on 12 MPI processes, and that the GPU and coprocessor performed equally well, being 2.89-4.49 and 3.01-3.23 times faster than the parallel ARCHER-CT(CPU) running with 12 hyper-threads. (authors)

  3. Surface moisture measurement system hardware acceptance test report

    Energy Technology Data Exchange (ETDEWEB)

    Ritter, G.A., Westinghouse Hanford

    1996-05-28

    This document summarizes the results of the hardware acceptance test for the Surface Moisture Measurement System (SMMS). This test verified that the mechanical and electrical features of the SMMS functioned as designed and that the unit is ready for field service. The bulk of hardware testing was performed at the 306E Facility in the 300 Area and the Fuels and Materials Examination Facility in the 400 Area. The SMMS was developed primarily in support of Tank Waste Remediation System (TWRS) Safety Programs for moisture measurement in organic and ferrocyanide watch list tanks.

  4. Door Hardware and Installations; Carpentry: 901894.

    Science.gov (United States)

    Dade County Public Schools, Miami, FL.

    The curriculum guide outlines a course designed to provide instruction in the selection, preparation, and installation of hardware for door assemblies. The course is divided into five blocks of instruction (introduction to doors and hardware, door hardware, exterior doors and jambs, interior doors and jambs, and a quinmester post-test) totaling…

  5. Fast Simulation of Dynamic Ultrasound Images Using the GPU.

    Science.gov (United States)

    Storve, Sigurd; Torp, Hans

    2017-10-01

    Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.

  6. Avaliação de desempenho e consumo energético para configurações de Wavefront pools de uma GPU AMD

    Directory of Open Access Journals (Sweden)

    Ariel Gustavo Zuquello

    2016-07-01

    Full Text Available O uso de sistemas heterogêneos CPU-GPU para atender à crescente demanda por aplicações com grande paralelismo de dados resulta na necessidade de estudar e avaliar tais arquiteturas para melhorá-las continuamente. Neste artigo foram feitas simulações da execução de uma suíte de benchmark em uma GPU AMD ATI RadeonTM HD 7970, de modo a avaliar o impacto sobre o desempenho e o consumo energético quando alterado o número de Wavefront Pools presentes em cada compute unit da GPU, que é 4 por padrão. O resultado mais significante evidencia um aumento de velocidade de cerca de 5,7% para a configuração com duas Wavefront Pools em conjunto com um aumento no consumo de energia de cerca de 5,1%. Todavia, as outras configurações avaliadas também representam opções para diferentes tipos de necessidades, conforme a categoria de demanda computacional.Palavras-chave: Sistemas heterogêneos. Simulações. Desempenho.Performance evaluation and energy consumption for settings of Wavefront pools of a GPU AMDAbstractThe use of CPU-GPU heterogeneous systems to meet the growing demand for applications with large data parallelism results in the need to study and evaluate these architectures in order to improve them continuously. In this paper we made simulations of running a benchmark suite on an AMD GPU ATI RadeonTM HD 7970 in order to assess the impact on performance and power consumption when tuning the number of Wavefront Pools present in each GPU compute unit, which is 4 by default. The most significant result shows a speedup of about 5.7% for configuration with two Wavefront Pools in conjunction with an increase of about 5.1% in the energy consumption. However, the other evaluated configuration also represent options for different kinds of needs, according to   the  computational demand.Keyworks: Heterogeneous systems. Simulation. Performance.

  7. SU-E-T-180: Fano Cavity Test of Proton Transport in Monte Carlo Codes Running On GPU and Xeon Phi

    International Nuclear Information System (INIS)

    Sterpin, E; Sorriaux, J; Souris, K; Lee, J; Vynckier, S; Schuemann, J; Paganetti, H; Jia, X; Jiang, S

    2014-01-01

    Purpose: In proton dose calculation, clinically compatible speeds are now achieved with Monte Carlo codes (MC) that combine 1) adequate simplifications in the physics of transport and 2) the use of hardware architectures enabling massive parallel computing (like GPUs). However, the uncertainties related to the transport algorithms used in these codes must be kept minimal. Such algorithms can be checked with the so-called “Fano cavity test”. We implemented the test in two codes that run on specific hardware: gPMC on an nVidia GPU and MCsquare on an Intel Xeon Phi (60 cores). Methods: gPMC and MCsquare are designed for transporting protons in CT geometries. Both codes use the method of fictitious interaction to sample the step-length for each transport step. The considered geometry is a water cavity (2×2×0.2 cm 3 , 0.001 g/cm 3 ) in a 10×10×50 cm 3 water phantom (1 g/cm 3 ). CPE in the cavity is established by generating protons over the phantom volume with a uniform momentum (energy E) and a uniform intensity per unit mass I. Assuming no nuclear reactions and no generation of other secondaries, the computed cavity dose should equal IE, according to Fano's theorem. Both codes were tested for initial proton energies of 50, 100, and 200 MeV. Results: For all energies, gPMC and MCsquare are within 0.3 and 0.2 % of the theoretical value IE, respectively (0.1% standard deviation). Single-precision computations (instead of double) increased the error by about 0.1% in MCsquare. Conclusion: Despite the simplifications in the physics of transport, both gPMC and MCsquare successfully pass the Fano test. This ensures optimal accuracy of the codes for clinical applications within the uncertainties on the underlying physical models. It also opens the path to other applications of these codes, like the simulation of ion chamber response

  8. SU-E-T-180: Fano Cavity Test of Proton Transport in Monte Carlo Codes Running On GPU and Xeon Phi

    Energy Technology Data Exchange (ETDEWEB)

    Sterpin, E; Sorriaux, J; Souris, K; Lee, J; Vynckier, S [Universite catholique de Louvain, Brussels, Brussels (Belgium); Schuemann, J; Paganetti, H [Massachusetts General Hospital, Boston, MA (United States); Jia, X; Jiang, S [The University of Texas Southwestern Medical Ctr, Dallas, TX (United States)

    2014-06-01

    Purpose: In proton dose calculation, clinically compatible speeds are now achieved with Monte Carlo codes (MC) that combine 1) adequate simplifications in the physics of transport and 2) the use of hardware architectures enabling massive parallel computing (like GPUs). However, the uncertainties related to the transport algorithms used in these codes must be kept minimal. Such algorithms can be checked with the so-called “Fano cavity test”. We implemented the test in two codes that run on specific hardware: gPMC on an nVidia GPU and MCsquare on an Intel Xeon Phi (60 cores). Methods: gPMC and MCsquare are designed for transporting protons in CT geometries. Both codes use the method of fictitious interaction to sample the step-length for each transport step. The considered geometry is a water cavity (2×2×0.2 cm{sup 3}, 0.001 g/cm{sup 3}) in a 10×10×50 cm{sup 3} water phantom (1 g/cm{sup 3}). CPE in the cavity is established by generating protons over the phantom volume with a uniform momentum (energy E) and a uniform intensity per unit mass I. Assuming no nuclear reactions and no generation of other secondaries, the computed cavity dose should equal IE, according to Fano's theorem. Both codes were tested for initial proton energies of 50, 100, and 200 MeV. Results: For all energies, gPMC and MCsquare are within 0.3 and 0.2 % of the theoretical value IE, respectively (0.1% standard deviation). Single-precision computations (instead of double) increased the error by about 0.1% in MCsquare. Conclusion: Despite the simplifications in the physics of transport, both gPMC and MCsquare successfully pass the Fano test. This ensures optimal accuracy of the codes for clinical applications within the uncertainties on the underlying physical models. It also opens the path to other applications of these codes, like the simulation of ion chamber response.

  9. High performance direct gravitational N-body simulations on graphics processing units II: An implementation in CUDA

    NARCIS (Netherlands)

    Belleman, R.G.; Bédorf, J.; Portegies Zwart, S.F.

    2008-01-01

    We present the results of gravitational direct N-body simulations using the graphics processing unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the N-body problem is implemented in "Compute Unified Device Architecture" (CUDA) using the GPU to

  10. A High Performance QDWH-SVD Solver using Hardware Accelerators

    KAUST Repository

    Sukkari, Dalal E.; Ltaief, Hatem; Keyes, David E.

    2015-01-01

    few digits of accuracy, compared to the full double precision floating point arithmetic. We further leverage the single GPU QDWH-SVD implementation by introducing the first multi-GPU SVD solver to study the scalability of the QDWH-SVD framework.

  11. Hardware Support for Dynamic Languages

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; Karlsson, Sven; Probst, Christian W.

    2011-01-01

    In recent years, dynamic programming languages have enjoyed increasing popularity. For example, JavaScript has become one of the most popular programming languages on the web. As the complexity of web applications is growing, compute-intensive workloads are increasingly handed off to the client...... side. While a lot of effort is put in increasing the performance of web browsers, we aim for multicore systems with dedicated cores to effectively support dynamic languages. We have designed Tinuso, a highly flexible core for experimentation that is optimized for high performance when implemented...... on FPGA. We composed a scalable multicore configuration where we study how hardware support for software speculation can be used to increase the performance of dynamic languages....

  12. Blaze-DEMGPU: Modular high performance DEM framework for the GPU architecture

    Directory of Open Access Journals (Sweden)

    Nicolin Govender

    2016-01-01

    Full Text Available Blaze-DEMGPU is a modular GPU based discrete element method (DEM framework that supports polyhedral shaped particles. The high level performance is attributed to the light weight and Single Instruction Multiple Data (SIMD that the GPU architecture offers. Blaze-DEMGPU offers suitable algorithms to conduct DEM simulations on the GPU and these algorithms can be extended and modified. Since a large number of scientific simulations are particle based, many of the algorithms and strategies for GPU implementation present in Blaze-DEMGPU can be applied to other fields. Blaze-DEMGPU will make it easier for new researchers to use high performance GPU computing as well as stimulate wider GPU research efforts by the DEM community.

  13. Dynamic Resolution in GPU-Accelerated Volume Rendering to Autostereoscopic Multiview Lenticular Displays

    Directory of Open Access Journals (Sweden)

    Daniel Ruijters

    2008-09-01

    Full Text Available The generation of multiview stereoscopic images of large volume rendered data demands an enormous amount of calculations. We propose a method for hardware accelerated volume rendering of medical data sets to multiview lenticular displays, offering interactive manipulation throughout. The method is based on buffering GPU-accelerated direct volume rendered visualizations of the individual views from their respective focal spot positions, and composing the output signal for the multiview lenticular screen in a second pass. This compositing phase is facilitated by the fact that the view assignment per subpixel is static, and therefore can be precomputed. We decoupled the resolution of the individual views from the resolution of the composited signal, and adjust the resolution on-the-fly, depending on the available processing resources, in order to maintain interactive refresh rates. The optimal resolution for the volume rendered views is determined by means of an analysis of the lattice of the output signal for the lenticular screen in the Fourier domain.

  14. Accelerating Molecular Dynamic Simulation on Graphics Processing Units

    Science.gov (United States)

    Friedrichs, Mark S.; Eastman, Peter; Vaidyanathan, Vishal; Houston, Mike; Legrand, Scott; Beberg, Adam L.; Ensign, Daniel L.; Bruns, Christopher M.; Pande, Vijay S.

    2009-01-01

    We describe a complete implementation of all-atom protein molecular dynamics running entirely on a graphics processing unit (GPU), including all standard force field terms, integration, constraints, and implicit solvent. We discuss the design of our algorithms and important optimizations needed to fully take advantage of a GPU. We evaluate its performance, and show that it can be more than 700 times faster than a conventional implementation running on a single CPU core. PMID:19191337

  15. A Generic High-performance GPU-based Library for PDE solvers

    DEFF Research Database (Denmark)

    Glimberg, Stefan Lemvig; Engsig-Karup, Allan Peter

    , the privilege of high-performance parallel computing is now in principle accessible for many scientific users, no matter their economic resources. Though being highly effective units, GPUs and parallel architectures in general, pose challenges for software developers to utilize their efficiency. Sequential...... legacy codes are not always easily parallelized and the time spent on conversion might not pay o in the end. We present a highly generic C++ library for fast assembling of partial differential equation (PDE) solvers, aiming at utilizing the computational resources of GPUs. The library requires a minimum...... of GPU computing knowledge, while still oering the possibility to customize user-specic solvers at kernel level if desired. Spatial dierential operators are based on matrix free exible order nite dierence approximations. These matrix free operators minimize both memory consumption and main memory access...

  16. Gpufit: An open-source toolkit for GPU-accelerated curve fitting.

    Science.gov (United States)

    Przybylski, Adrian; Thiel, Björn; Keller-Findeisen, Jan; Stock, Bernd; Bates, Mark

    2017-11-16

    We present a general purpose, open-source software library for estimation of non-linear parameters by the Levenberg-Marquardt algorithm. The software, Gpufit, runs on a Graphics Processing Unit (GPU) and executes computations in parallel, resulting in a significant gain in performance. We measured a speed increase of up to 42 times when comparing Gpufit with an identical CPU-based algorithm, with no loss of precision or accuracy. Gpufit is designed such that it is easily incorporated into existing applications or adapted for new ones. Multiple software interfaces, including to C, Python, and Matlab, ensure that Gpufit is accessible from most programming environments. The full source code is published as an open source software repository, making its function transparent to the user and facilitating future improvements and extensions. As a demonstration, we used Gpufit to accelerate an existing scientific image analysis package, yielding significantly improved processing times for super-resolution fluorescence microscopy datasets.

  17. GPU acceleration for digitally reconstructed radiographs using bindless texture objects and CUDA/OpenGL interoperability.

    Science.gov (United States)

    Abdellah, Marwan; Eldeib, Ayman; Owis, Mohamed I

    2015-01-01

    This paper features an advanced implementation of the X-ray rendering algorithm that harnesses the giant computing power of the current commodity graphics processors to accelerate the generation of high resolution digitally reconstructed radiographs (DRRs). The presented pipeline exploits the latest features of NVIDIA Graphics Processing Unit (GPU) architectures, mainly bindless texture objects and dynamic parallelism. The rendering throughput is substantially improved by exploiting the interoperability mechanisms between CUDA and OpenGL. The benchmarks of our optimized rendering pipeline reflect its capability of generating DRRs with resolutions of 2048(2) and 4096(2) at interactive and semi interactive frame-rates using an NVIDIA GeForce 970 GTX device.

  18. GPU-accelerated few-view CT reconstruction using the OSC and TV techniques

    Energy Technology Data Exchange (ETDEWEB)

    Matenine, Dmitri [Montreal Univ., QC (Canada). Dept. de Physique; Hissoiny, Sami [Ecole Polytechnique de Montreal, QC (Canada). Dept. de Genie Informatique et Genie Logiciel; Despres, Philippe [Centre Hospitalier Univ. de Quebec, QC (Canada). Dept. de Radio-Oncologie

    2011-07-01

    The present work proposes a promising iterative reconstruction technique designed specifically for X-ray transmission computed tomography (CT). The main objective is to reduce diagnostic radiation dose through the reduction of the number of CT projections, while preserving image quality. The second objective is to provide a fast implementation compatible with clinical activities. The proposed tomographic reconstruction technique is a combination of the Ordered Subsets Convex (OSC) algorithm and the Total Variation minimization (TV) regularization technique. The results in terms of image quality and computational speed are discussed. Using this technique, it was possible to obtain reconstructed slices of relatively good quality with as few as 100 projections, leading to potential dose reduction factors of up to an order of magnitude depending on the application. The algorithm was implemented on a Graphical Processing Unit (GPU) and yielded reconstruction times of approximately 185 ms per slice. (orig.)

  19. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

    International Nuclear Information System (INIS)

    Babich, Ronald; Clark, Michael; Joo, Balint

    2010-01-01

    Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the '9g' cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.

  20. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

    Energy Technology Data Exchange (ETDEWEB)

    Ronald Babich, Michael Clark, Balint Joo

    2010-11-01

    Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.