National Aeronautics and Space Administration — The objective of this project was to use GPU enabled computing to accelerate the analyses of heat transfer and thermal effects. Graphical processing unit (GPU)...
Exploring Graphics Processing Unit (GPU Resource Sharing Efficiency for High Performance Computing
Directory of Open Access Journals (Sweden)
Teng Li
2013-11-01
Full Text Available The increasing incorporation of Graphics Processing Units (GPUs as accelerators has been one of the forefront High Performance Computing (HPC trends and provides unprecedented performance; however, the prevalent adoption of the Single-Program Multiple-Data (SPMD programming model brings with it challenges of resource underutilization. In other words, under SPMD, every CPU needs GPU capability available to it. However, since CPUs generally outnumber GPUs, the asymmetric resource distribution gives rise to overall computing resource underutilization. In this paper, we propose to efficiently share the GPU under SPMD and formally define a series of GPU sharing scenarios. We provide performance-modeling analysis for each sharing scenario with accurate experimentation validation. With the modeling basis, we further conduct experimental studies to explore potential GPU sharing efficiency improvements from multiple perspectives. Both further theoretical and experimental GPU sharing performance analysis and results are presented. Our results not only demonstrate the significant performance gain for SPMD programs with the proposed efficient GPU sharing, but also the further improved sharing efficiency with the optimization techniques based on our accurate modeling.
GPU computing and applications
See, Simon
2015-01-01
This book presents a collection of state of the art research on GPU Computing and Application. The major part of this book is selected from the work presented at the 2013 Symposium on GPU Computing and Applications held in Nanyang Technological University, Singapore (Oct 9, 2013). Three major domains of GPU application are covered in the book including (1) Engineering design and simulation; (2) Biomedical Sciences; and (3) Interactive & Digital Media. The book also addresses the fundamental issues in GPU computing with a focus on big data processing. Researchers and developers in GPU Computing and Applications will benefit from this book. Training professionals and educators can also benefit from this book to learn the possible application of GPU technology in various areas.
GPU Computing Gems Emerald Edition
Hwu, Wen-mei W
2011-01-01
".the perfect companion to Programming Massively Parallel Processors by Hwu & Kirk." -Nicolas Pinto, Research Scientist at Harvard & MIT, NVIDIA Fellow 2009-2010 Graphics processing units (GPUs) can do much more than render graphics. Scientists and researchers increasingly look to GPUs to improve the efficiency and performance of computationally-intensive experiments across a range of disciplines. GPU Computing Gems: Emerald Edition brings their techniques to you, showcasing GPU-based solutions including: Black hole simulations with CUDA GPU-accelerated computation and interactive display of
Distributed GPU Computing in GIScience
Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
2013-12-01
Transactions on, 9(3), 378-394. 2. Li, J., Jiang, Y., Yang, C., Huang, Q., & Rice, M. (2013). Visualizing 3D/4D Environmental Data Using Many-core Graphics Processing Units (GPUs) and Multi-core Central Processing Units (CPUs). Computers & Geosciences, 59(9), 78-89. 3. Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5), 879-899.
Fast computation of MadGraph amplitudes on graphics processing unit (GPU)
Hagiwara, K; Li, Q; Okamura, N; Stelzer, T
2013-01-01
Continuing our previous studies on QED and QCD processes, we use the graphics processing unit (GPU) for fast calculations of helicity amplitudes for general Standard Model (SM) processes. Additional HEGET codes to handle all SM interactions are introduced, as well assthe program MG2CUDA that converts arbitrary MadGraph generated HELAS amplitudess(FORTRAN) into HEGET codes in CUDA. We test all the codes by comparing amplitudes and cross sections for multi-jet srocesses at the LHC associated with production of single and double weak bosonss a top-quark pair, Higgs boson plus a weak boson or a top-quark pair, and multisle Higgs bosons via weak-boson fusion, where all the heavy particles are allowes to decay into light quarks and leptons with full spin correlations. All the helicity amplitudes computed by HEGET are found to agree with those comsuted by HELAS within the expected numerical accuracy, and the cross sections obsained by gBASES, a GPU version of the Monte Carlo integration program, agree wish those obt...
GPU-accelerated computation of electron transfer.
Höfinger, Siegfried; Acocella, Angela; Pop, Sergiu C; Narumi, Tetsu; Yasuoka, Kenji; Beu, Titus; Zerbetto, Francesco
2012-11-05
Electron transfer is a fundamental process that can be studied with the help of computer simulation. The underlying quantum mechanical description renders the problem a computationally intensive application. In this study, we probe the graphics processing unit (GPU) for suitability to this type of problem. Time-critical components are identified via profiling of an existing implementation and several different variants are tested involving the GPU at increasing levels of abstraction. A publicly available library supporting basic linear algebra operations on the GPU turns out to accelerate the computation approximately 50-fold with minor dependence on actual problem size. The performance gain does not compromise numerical accuracy and is of significant value for practical purposes. Copyright © 2012 Wiley Periodicals, Inc.
Graphics processing unit (GPU)-based computation of heat conduction in thermally anisotropic solids
Nahas, C. A.; Balasubramaniam, Krishnan; Rajagopal, Prabhu
2013-01-01
Numerical modeling of anisotropic media is a computationally intensive task since it brings additional complexity to the field problem in such a way that the physical properties are different in different directions. Largely used in the aerospace industry because of their lightweight nature, composite materials are a very good example of thermally anisotropic media. With advancements in video gaming technology, parallel processors are much cheaper today and accessibility to higher-end graphical processing devices has increased dramatically over the past couple of years. Since these massively parallel GPUs are very good in handling floating point arithmetic, they provide a new platform for engineers and scientists to accelerate their numerical models using commodity hardware. In this paper we implement a parallel finite difference model of thermal diffusion through anisotropic media using the NVIDIA CUDA (Compute Unified device Architecture). We use the NVIDIA GeForce GTX 560 Ti as our primary computing device which consists of 384 CUDA cores clocked at 1645 MHz with a standard desktop pc as the host platform. We compare the results from standard CPU implementation for its accuracy and speed and draw implications for simulation using the GPU paradigm.
Suh, Joohyung; Ma, Kevin; Le, Anh
2011-03-01
Multiple Sclerosis (MS) is a disease which is caused by damaged myelin around axons of the brain and spinal cord. Currently, MR Imaging is used for diagnosis, but it is very highly variable and time-consuming since the lesion detection and estimation of lesion volume are performed manually. For this reason, we developed a CAD (Computer Aided Diagnosis) system which would assist segmentation of MS to facilitate physician's diagnosis. The MS CAD system utilizes K-NN (k-nearest neighbor) algorithm to detect and segment the lesion volume in an area based on the voxel. The prototype MS CAD system was developed under the MATLAB environment. Currently, the MS CAD system consumes a huge amount of time to process data. In this paper we will present the development of a second version of MS CAD system which has been converted into C/C++ in order to take advantage of the GPU (Graphical Processing Unit) which will provide parallel computation. With the realization of C/C++ and utilizing the GPU, we expect to cut running time drastically. The paper investigates the conversion from MATLAB to C/C++ and the utilization of a high-end GPU for parallel computing of data to improve algorithm performance of MS CAD.
GPU-accelerated micromagnetic simulations using cloud computing
Jermain, C. L.; Rowlands, G. E.; Buhrman, R. A.; Ralph, D. C.
2016-03-01
Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics.
GPU-accelerated micromagnetic simulations using cloud computing
Jermain, C L; Buhrman, R A; Ralph, D C
2015-01-01
Highly-parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics.
GPU-based high-performance computing for radiation therapy.
Jia, Xun; Ziegenhein, Peter; Jiang, Steve B
2014-02-21
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of study has been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this paper, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented.
GPU in Physics Computation: Case Geant4 Navigation
Seiskari, Otto; Niemi, Tapio
2012-01-01
General purpose computing on graphic processing units (GPU) is a potential method of speeding up scientific computation with low cost and high energy efficiency. We experimented with the particle physics simulation toolkit Geant4 used at CERN to benchmark its geometry navigation functionality on a GPU. The goal was to find out whether Geant4 physics simulations could benefit from GPU acceleration and how difficult it is to modify Geant4 code to run in a GPU. We ported selected parts of Geant4 code to C99 & CUDA and implemented a simple gamma physics simulation utilizing this code to measure efficiency. The performance of the program was tested by running it on two different platforms: NVIDIA GeForce 470 GTX GPU and a 12-core AMD CPU system. Our conclusion was that GPUs can be a competitive alternate for multi-core computers but porting existing software in an efficient way is challenging.
GpuCV : a GPU-accelerated framework for image processing and computer vision
ALLUSSE, Yannick; Horain, Patrick; Agarwal, Ankit; Saipriyadarshan, Cindula
2008-01-01
International audience; This paper presents briefly describes the state of the art of accelerating image processing with graphics hardware (GPU) and discusses some of its caveats. Then it describes GpuCV, an open source multi-platform library for GPU-accelerated image processing and Computer Vision operators and applications. It is meant for computer vision scientist not familiar with GPU technologies. GpuCV is designed to be compatible with the popular OpenCV library by offering GPU-accelera...
Fast calculation of HELAS amplitudes using graphics processing unit (GPU)
Hagiwara, K; Okamura, N; Rainwater, D L; Stelzer, T
2009-01-01
We use the graphics processing unit (GPU) for fast calculations of helicity amplitudes of physics processes. As our first attempt, we compute $u\\overline{u}\\to n\\gamma$ ($n=2$ to 8) processes in $pp$ collisions at $\\sqrt{s} = 14$TeV by transferring the MadGraph generated HELAS amplitudes (FORTRAN) into newly developed HEGET ({\\bf H}ELAS {\\bf E}valuation with {\\bf G}PU {\\bf E}nhanced {\\bf T}echnology) codes written in CUDA, a C-platform developed by NVIDIA for general purpose computing on the GPU. Compared with the usual CPU programs, we obtain 40-150 times better performance on the GPU.
GPU Computing to Improve Game Engine Performance
Directory of Open Access Journals (Sweden)
Abu Asaduzzaman
2014-07-01
Full Text Available Although the graphics processing unit (GPU was originally designed to accelerate the image creation for output to display, today’s general purpose GPU (GPGPU computing offers unprecedented performance by offloading computing-intensive portions of the application to the GPGPU, while running the remainder of the code on the central processing unit (CPU. The highly parallel structure of a many core GPGPU can process large blocks of data faster using multithreaded concurrent processing. A game engine has many “components” and multithreading can be used to implement their parallelism. However, effective implementation of multithreading in a multicore processor has challenges, such as data and task parallelism. In this paper, we investigate the impact of using a GPGPU with a CPU to design high-performance game engines. First, we implement a separable convolution filter (heavily used in image processing with the GPGPU. Then, we implement a multiobject interactive game console in an eight-core workstation using a multithreaded asynchronous model (MAM, a multithreaded synchronous model (MSM, and an MSM with data parallelism (MSMDP. According to the experimental results, speedup of about 61x and 5x is achieved due to GPGPU and MSMDP implementation, respectively. Therefore, GPGPU-assisted parallel computing has the potential to improve multithreaded game engine performance.
Accelerated 3D Monte Carlo light dosimetry using a graphics processing unit (GPU) cluster
Lo, William Chun Yip; Lilge, Lothar
2010-11-01
This paper presents a basic computational framework for real-time, 3-D light dosimetry on graphics processing unit (GPU) clusters. The GPU-based approach offers a direct solution to overcome the long computation time preventing Monte Carlo simulations from being used in complex optimization problems such as treatment planning, particularly if simulated annealing is employed as the optimization algorithm. The current multi- GPU implementation is validated using a commercial light modelling software (ASAP from Breault Research Organization). It also supports the latest Fermi GPU architecture and features an interactive 3-D visualization interface. The software is available for download at http://code.google.com/p/gpu3d.
A GPU-Computing Approach to Solar Stokes Profile Inversion
Harker, Brian J
2012-01-01
We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS (GENEtic Stokes Inversion Strategy), employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units GPUs, along with algorithms designed to exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (GA) engineered specifically for use with a GPU, we produce full-disc maps of the photospheric vector magnetic field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investigations of the Sun (SOLIS) Vector Spectromagnetograph (VSM) instrument. We show the advantages of pairing a population-parallel genetic algorithm with data-parallel GPU-computing techniques, and present an overview of the Stokes inversion problem, including a description of our adaptation to the GPU-computing paradigm. Full-disc vector ma...
Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
2014-01-01
Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
Computing 2D constrained delaunay triangulation using the GPU.
Qi, Meng; Cao, Thanh-Tung; Tan, Tiow-Seng
2013-05-01
We propose the first graphics processing unit (GPU) solution to compute the 2D constrained Delaunay triangulation (CDT) of a planar straight line graph (PSLG) consisting of points and edges. There are many existing CPU algorithms to solve the CDT problem in computational geometry, yet there has been no prior approach to solve this problem efficiently using the parallel computing power of the GPU. For the special case of the CDT problem where the PSLG consists of just points, which is simply the normal Delaunay triangulation (DT) problem, a hybrid approach using the GPU together with the CPU to partially speed up the computation has already been presented in the literature. Our work, on the other hand, accelerates the entire computation on the GPU. Our implementation using the CUDA programming model on NVIDIA GPUs is numerically robust, and runs up to an order of magnitude faster than the best sequential implementations on the CPU. This result is reflected in our experiment with both randomly generated PSLGs and real-world GIS data having millions of points and edges.
A survey of GPU-based medical image computing techniques.
Shi, Lin; Liu, Wen; Zhang, Heye; Xie, Yongming; Wang, Defeng
2012-09-01
Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine.
Accelerated rescaling of single Monte Carlo simulation runs with the Graphics Processing Unit (GPU).
Yang, Owen; Choi, Bernard
2013-01-01
To interpret fiber-based and camera-based measurements of remitted light from biological tissues, researchers typically use analytical models, such as the diffusion approximation to light transport theory, or stochastic models, such as Monte Carlo modeling. To achieve rapid (ideally real-time) measurement of tissue optical properties, especially in clinical situations, there is a critical need to accelerate Monte Carlo simulation runs. In this manuscript, we report on our approach using the Graphics Processing Unit (GPU) to accelerate rescaling of single Monte Carlo runs to calculate rapidly diffuse reflectance values for different sets of tissue optical properties. We selected MATLAB to enable non-specialists in C and CUDA-based programming to use the generated open-source code. We developed a software package with four abstraction layers. To calculate a set of diffuse reflectance values from a simulated tissue with homogeneous optical properties, our rescaling GPU-based approach achieves a reduction in computation time of several orders of magnitude as compared to other GPU-based approaches. Specifically, our GPU-based approach generated a diffuse reflectance value in 0.08ms. The transfer time from CPU to GPU memory currently is a limiting factor with GPU-based calculations. However, for calculation of multiple diffuse reflectance values, our GPU-based approach still can lead to processing that is ~3400 times faster than other GPU-based approaches.
Fast distributed large-pixel-count hologram computation using a GPU cluster.
Pan, Yuechao; Xu, Xuewu; Liang, Xinan
2013-09-10
Large-pixel-count holograms are one essential part for big size holographic three-dimensional (3D) display, but the generation of such holograms is computationally demanding. In order to address this issue, we have built a graphics processing unit (GPU) cluster with 32.5 Tflop/s computing power and implemented distributed hologram computation on it with speed improvement techniques, such as shared memory on GPU, GPU level adaptive load balancing, and node level load distribution. Using these speed improvement techniques on the GPU cluster, we have achieved 71.4 times computation speed increase for 186M-pixel holograms. Furthermore, we have used the approaches of diffraction limits and subdivision of holograms to overcome the GPU memory limit in computing large-pixel-count holograms. 745M-pixel and 1.80G-pixel holograms were computed in 343 and 3326 s, respectively, for more than 2 million object points with RGB colors. Color 3D objects with 1.02M points were successfully reconstructed from 186M-pixel hologram computed in 8.82 s with all the above three speed improvement techniques. It is shown that distributed hologram computation using a GPU cluster is a promising approach to increase the computation speed of large-pixel-count holograms for large size holographic display.
Accelerating the XGBoost algorithm using GPU computing
Directory of Open Access Journals (Sweden)
Rory Mitchell
2017-07-01
Full Text Available We present a CUDA-based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the graphics processing unit (GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelised, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort-based approach for larger depths. We show speedups of between 3× and 6× using a Titan X compared to a 4 core i7 CPU, and 1.2× using a Titan X compared to 2× Xeon CPUs (24 cores. We show that it is possible to process the Higgs dataset (10 million instances, 28 features entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks.
Wu, Xin; Koslowski, Axel; Thiel, Walter
2012-07-10
In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
The Coming Role of GPU in Computational Geodynamics (Invited)
Yuen, D. A.; Knepley, M. G.; Erlebacher, G.; Wright, G. B.
2009-12-01
With the proliferation of GPU ( graphics accelerator board) the computing landscape has changed enormously in the last 3 years. The new additional capabilities of the GPU , such as larger shared memories and load-store operations , allow it to be considered as a viable stand-alone computational and visualization engine. Today the massive threading and computing capability of GPU can deliver at least an order of magnitude performance over the multi-core CPU architecture. The cost of a GPU system is also considerably cheaper than a CPU cluster by more than an order of magnitude.The introduction of CUDA and ancillary software aids, such as Jackets, have allowed the rapid translation of many venerable codes into software usable on GPU. We will discuss our experience acquired over the past year in attacking five different computational problems in the geosciences, using the GPU. They include (1.) 3-D seismic wave propagation with the spectral-element method (2.)2-D shallow water equation as applied to tsunami wave propagation, using finite-differences (3.) 3-D mantle convection with constant viscosity using a 4th order compact finite-difference operator (4.) non-linear heat-diffusion equation in 2-D using a collocation method based on radial basis functions over an elliptical area . Grid points are divided so as to lie on a centroidal Voronoi mesh . Derivatives are calculated at each grid point using a point-dependent stencil computed from the nearest neighbors .(5.) Stokes flow with variable viscosity by means of a pre-conditioner calculated on the GPU based on the vortex method using Green’s functions, along with the radial basis functions and the fast multi-pole method. The Krylov method is used on the CPU for the final iterative step .We will discuss the relative speed-ups of the GPU over the CPU in each of these cases. We will point out the need to go to more computationally intensive mode with multiple GPUs, which calls for key CPUs to control the message
Work-Efficient Parallel Skyline Computation for the GPU
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira
2015-01-01
offers the potential for parallelizing skyline computation across thousands of cores. However, attempts to port skyline algorithms to the GPU have prioritized throughput and failed to outperform sequential algorithms. In this paper, we introduce a new skyline algorithm, designed for the GPU, that uses...... a global, static partitioning scheme. With the partitioning, we can permit controlled branching to exploit transitive relationships and avoid most point-to-point comparisons. The result is a non-traditional GPU algorithm, SkyAlign, that prioritizes work-effciency and respectable throughput, rather than...
Calculation of HELAS amplitudes for QCD processes using graphics processing unit (GPU)
Hagiwara, K; Okamura, N; Rainwater, D L; Stelzer, T
2009-01-01
We use a graphics processing unit (GPU) for fast calculations of helicity amplitudes of quark and gluon scattering processes in massless QCD. New HEGET ({\\bf H}ELAS {\\bf E}valuation with {\\bf G}PU {\\bf E}nhanced {\\bf T}echnology) codes for gluon self-interactions are introduced, and a C++ program to convert the MadGraph generated FORTRAN codes into HEGET codes in CUDA (a C-platform for general purpose computing on GPU) is created. Because of the proliferation of the number of Feynman diagrams and the number of independent color amplitudes, the maximum number of final state jets we can evaluate on a GPU is limited to 4 for pure gluon processes ($gg\\to 4g$), or 5 for processes with one or more quark lines such as $q\\bar{q}\\to 5g$ and $qq\\to qq+3g$. Compared with the usual CPU-based programs, we obtain 60-100 times better performance on the GPU, except for 5-jet production processes and the $gg\\to 4g$ processes for which the GPU gain over the CPU is about 20.
A Novel Architecture of Multi-GPU Computing Card
Directory of Open Access Journals (Sweden)
Sen Guo
2013-08-01
Full Text Available The data transmission between GPUS in the existing multi_GPU computing card is often through PCIE which is in relative low speed, so the PCIE has become bottleneck of Overall performance. A novel architecture of multi_GPU computing card have been proposed in this paper: A multi-channel memory which have multiple interfaces is added, including one common interface shared by different GPUs, which is connected with a FPGA arbitration circuit and several other interfaces connected with dedicated GPUs frame buffer independently, and this multi-channel memory is called "global shared memory". The result of a simulation of accelerating computer tomography algebraic reconstruction on multi-GPU demonstrates effectiveness of this approach.
4D MR phase and magnitude segmentations with GPU parallel computing.
Bergen, Robert V; Lin, Hung-Yu; Alexander, Murray E; Bidinosti, Christopher P
2015-01-01
The increasing size and number of data sets of large four dimensional (three spatial, one temporal) magnetic resonance (MR) cardiac images necessitates efficient segmentation algorithms. Analysis of phase-contrast MR images yields cardiac flow information which can be manipulated to produce accurate segmentations of the aorta. Phase contrast segmentation algorithms are proposed that use simple mean-based calculations and least mean squared curve fitting techniques. The initial segmentations are generated on a multi-threaded central processing unit (CPU) in 10 seconds or less, though the computational simplicity of the algorithms results in a loss of accuracy. A more complex graphics processing unit (GPU)-based algorithm fits flow data to Gaussian waveforms, and produces an initial segmentation in 0.5 seconds. Level sets are then applied to a magnitude image, where the initial conditions are given by the previous CPU and GPU algorithms. A comparison of results shows that the GPU algorithm appears to produce the most accurate segmentation.
Energy Technology Data Exchange (ETDEWEB)
Gallarno, George [Christian Brothers University; Rogers, James H [ORNL; Maxwell, Don E [ORNL
2015-01-01
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
GPU and APU computations of Finite Time Lyapunov Exponent fields
Conti, Christian; Rossinelli, Diego; Koumoutsakos, Petros
2012-03-01
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows.
CFD Computations on Multi-GPU Configurations.
Menon, Sandeep; Perot, Blair
2007-11-01
Programmable graphics processors have shown favorable potential for use in practical CFD simulations -- often delivering a speed-up factor between 3 to 5 times over conventional CPUs. In recent times, most PCs are supplied with the option of installing multiple GPUs on a single motherboard, thereby providing the option of a parallel GPU configuration in a shared-memory paradigm. We demonstrate our implementation of an unstructured CFD solver using a set up which is configured to run two GPUs in parallel, and discuss its performance details.
Massively Parallel Computation of Soil Surface Roughness Parameters on A Fermi GPU
Li, Xiaojie; Song, Changhe
2016-06-01
Surface roughness is description of the surface micro topography of randomness or irregular. The standard deviation of surface height and the surface correlation length describe the statistical variation for the random component of a surface height relative to a reference surface. When the number of data points is large, calculation of surface roughness parameters is time-consuming. With the advent of Graphics Processing Unit (GPU) architectures, inherently parallel problem can be effectively solved using GPUs. In this paper we propose a GPU-based massively parallel computing method for 2D bare soil surface roughness estimation. This method was applied to the data collected by the surface roughness tester based on the laser triangulation principle during the field experiment in April 2012. The total number of data points was 52,040. It took 47 seconds on a Fermi GTX 590 GPU whereas its serial CPU version took 5422 seconds, leading to a significant 115x speedup.
Monte Carlo integration on GPU
Kanzaki, J.
2010-01-01
We use a graphics processing unit (GPU) for fast computations of Monte Carlo integrations. Two widely used Monte Carlo integration programs, VEGAS and BASES, are parallelized on GPU. By using $W^{+}$ plus multi-gluon production processes at LHC, we test integrated cross sections and execution time for programs in FORTRAN and C on CPU and those on GPU. Integrated results agree with each other within statistical errors. Execution time of programs on GPU run about 50 times faster than those in C...
A study on the GPU based parallel computation of a projection image
Lee, Hyunjeong; Han, Miseon; Kim, Jeongtae
2017-05-01
Fast computation of projection images is crucial in many applications such as medical image reconstruction and light field image processing. To do that, parallelization of the computation and efficient implementation of the computation using a parallel processor such as GPGPU (General-Purpose computing on Graphics Processing Units) is essential. In this research, we investigate methods for parallel computation of projection images and efficient implementation of the methods using CUDA (Compute Unified Device Architecture). We also study how to efficiently use the memory of GPU for the parallel processing.
CrystalGPU: Transparent and Efficient Utilization of GPU Power
Gharaibeh, Abdullah; Al-Kiswany, Samer; Ripeanu, Matei
2010-01-01
General-purpose computing on graphics processing units (GPGPU) has recently gained considerable attention in various domains such as bioinformatics, databases and distributed computing. GPGPU is based on using the GPU as a co-processor accelerator to offload computationally-intensive tasks from the CPU. This study starts from the observation that a number of GPU features (such as overlapping communication and computation, short lived buffer reuse, and harnessing multi-GPU systems) can be abst...
permGPU: Using graphics processing units in RNA microarray association studies
Directory of Open Access Journals (Sweden)
George Stephen L
2010-06-01
Full Text Available Abstract Background Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. Results We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. Conclusions permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.
Mendel-GPU: haplotyping and genotype imputation on graphics processing units.
Chen, Gary K; Wang, Kai; Stram, Alex H; Sobel, Eric M; Lange, Kenneth
2012-11-15
In modern sequencing studies, one can improve the confidence of genotype calls by phasing haplotypes using information from an external reference panel of fully typed unrelated individuals. However, the computational demands are so high that they prohibit researchers with limited computational resources from haplotyping large-scale sequence data. Our graphics processing unit based software delivers haplotyping and imputation accuracies comparable to competing programs at a fraction of the computational cost and peak memory demand. Mendel-GPU, our OpenCL software, runs on Linux platforms and is portable across AMD and nVidia GPUs. Users can download both code and documentation at http://code.google.com/p/mendel-gpu/. gary.k.chen@usc.edu. Supplementary data are available at Bioinformatics online.
GPU Triggered Revolution in Computational Chemistry%GPU引发的计算化学革命
Institute of Scientific and Technical Information of China (English)
鲍建樟; 丰鑫田; 于建国
2011-01-01
综述了图形处理器(GPU)在计算化学中的应用和进展.首先简单介绍了GPU在科学计算中应用的发展,然后分别详细讲述了迄今几个使用GPU和CUDA(compute unified device architecture,显卡厂商Nvidia推出的计算平台)开发工具设计的量子化学计算和分子动力学(MD)模拟的算法和程序,尤其对目前唯一完全使用GPU技术开发的量子化学计算软件TeraChem做了完备的介绍,包括算法、实现的细节和程序目前的功能.此外,本文还对GPU在计算化学上将会发挥的作用做出了极为乐观的展望.%Over the last 3 years, the use of graphics processing units (GPU) in general purpose computing has been increasing because of the development of GPU hardware and programming tools such as CUDA (compute unified device architecture). Here, we summarize the progress in algorithms and the corresponding software with regard to computational chemistry using GPU including quantum chemistry and molecular dynamics simulations in detail. We introduce and explore the newly developed TeraChem program, which is unique quantum chemical software and we discuss the algorithms, implementations, and functionality of the program. Finally, we give an optimistic outlook for the use of GPU in computational chemistry.
Heterogeneous Gpu&Cpu Cluster For High Performance Computing In Cryptography
Directory of Open Access Journals (Sweden)
Michał Marks
2012-01-01
Full Text Available This paper addresses issues associated with distributed computing systems andthe application of mixed GPU&CPU technology to data encryption and decryptionalgorithms. We describe a heterogenous cluster HGCC formed by twotypes of nodes: Intel processor with NVIDIA graphics processing unit and AMDprocessor with AMD graphics processing unit (formerly ATI, and a novel softwareframework that hides the heterogeneity of our cluster and provides toolsfor solving complex scientific and engineering problems. Finally, we present theresults of numerical experiments. The considered case study is concerned withparallel implementations of selected cryptanalysis algorithms. The main goal ofthe paper is to show the wide applicability of the GPU&CPU technology tolarge scale computation and data processing.
Tempest: GPU-CPU computing for high-throughput database spectral matching.
Milloy, Jeffrey A; Faherty, Brendan K; Gerber, Scott A
2012-07-06
Modern mass spectrometers are now capable of producing hundreds of thousands of tandem (MS/MS) spectra per experiment, making the translation of these fragmentation spectra into peptide matches a common bottleneck in proteomics research. When coupled with experimental designs that enrich for post-translational modifications such as phosphorylation and/or include isotopically labeled amino acids for quantification, additional burdens are placed on this computational infrastructure by shotgun sequencing. To address this issue, we have developed a new database searching program that utilizes the massively parallel compute capabilities of a graphical processing unit (GPU) to produce peptide spectral matches in a very high throughput fashion. Our program, named Tempest, combines efficient database digestion and MS/MS spectral indexing on a CPU with fast similarity scoring on a GPU. In our implementation, the entire similarity score, including the generation of full theoretical peptide candidate fragmentation spectra and its comparison to experimental spectra, is conducted on the GPU. Although Tempest uses the classical SEQUEST XCorr score as a primary metric for evaluating similarity for spectra collected at unit resolution, we have developed a new "Accelerated Score" for MS/MS spectra collected at high resolution that is based on a computationally inexpensive dot product but exhibits scoring accuracy similar to that of the classical XCorr. In our experience, Tempest provides compute-cluster level performance in an affordable desktop computer.
GPU computing with Kaczmarz's and other iterative algorithms for linear systems.
Elble, Joseph M; Sahinidis, Nikolaos V; Vouzis, Panagiotis
2010-06-01
The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz's, Cimmino's, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method.
Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.
2017-07-01
Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).
Numerical Study of Geometric Multigrid Methods on CPU--GPU Heterogeneous Computers
Feng, Chunsheng; Xu, Jinchao; Zhang, Chen-Song
2012-01-01
The geometric multigrid method (GMG) is one of the most efficient solving techniques for discrete algebraic systems arising from many types of partial differential equations. GMG utilizes a hierarchy of grids or discretizations and reduces the error at a number of frequencies simultaneously. Graphics processing units (GPUs) have recently burst onto the scientific computing scene as a technology that has yielded substantial performance and energy-efficiency improvements. A central challenge in implementing GMG on GPUs, though, is that computational work on coarse levels cannot fully utilize the capacity of a GPU. In this work, we perform numerical studies of GMG on CPU--GPU heterogeneous computers. Furthermore, we compare our implementation with an efficient CPU implementation of GMG and with the most popular fast Poisson solver, Fast Fourier Transform, in the cuFFT library developed by NVIDIA.
A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing
Directory of Open Access Journals (Sweden)
Ginés D. Guerrero
2014-01-01
Full Text Available Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO. This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor.
GPU-Based FFT Computation for Multi-Gigabit WirelessHD Baseband Processing
Directory of Open Access Journals (Sweden)
Nicholas Hinitt
2010-01-01
Full Text Available The next generation Graphics Processing Units (GPUs are being considered for non-graphics applications. Millimeter wave (60 Ghz wireless networks that are capable of multi-gigabit per second (Gbps transfer rates require a significant baseband throughput. In this work, we consider the baseband of WirelessHD, a 60 GHz communications system, which can provide a data rate of up to 3.8 Gbps over a short range wireless link. Thus, we explore the feasibility of achieving gigabit baseband throughput using the GPUs. One of the most computationally intensive functions commonly used in baseband communications, the Fast Fourier Transform (FFT algorithm, is implemented on an NVIDIA GPU using their general-purpose computing platform called the Compute Unified Device Architecture (CUDA. The paper, first, investigates the implementation of an FFT algorithm using the GPU hardware and exploiting the computational capability available. It then outlines the limitations discovered and the methods used to overcome these challenges. Finally a new algorithm to compute FFT is proposed, which reduces interprocessor communication. It is further optimized by improving memory access, enabling the processing rate to exceed 4 Gbps, achieving a processing time of a 512-point FFT in less than 200 ns using a two-GPU solution.
GPU-accelerated computational tool for studying the effectiveness of asteroid disruption techniques
Zimmerman, Ben J.; Wie, Bong
2016-10-01
This paper presents the development of a new Graphics Processing Unit (GPU) accelerated computational tool for asteroid disruption techniques. Numerical simulations are completed using the high-order spectral difference (SD) method. Due to the compact nature of the SD method, it is well suited for implementation with the GPU architecture, hence solutions are generated at orders of magnitude faster than the Central Processing Unit (CPU) counterpart. A multiphase model integrated with the SD method is introduced, and several asteroid disruption simulations are conducted, including kinetic-energy impactors, multi-kinetic energy impactor systems, and nuclear options. Results illustrate the benefits of using multi-kinetic energy impactor systems when compared to a single impactor system. In addition, the effectiveness of nuclear options is observed.
Work-Efficient Parallel Skyline Computation for the GPU
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira
2015-01-01
The skyline operator returns records in a dataset that provide optimal trade-offs of multiple dimensions. State-of-the-art skyline computation involves complex tree traversals, data-ordering, and conditional branching to minimize the number of point-to-point comparisons. Meanwhile, GPGPU computing...... a global, static partitioning scheme. With the partitioning, we can permit controlled branching to exploit transitive relationships and avoid most point-to-point comparisons. The result is a non-traditional GPU algorithm, SkyAlign, that prioritizes work-effciency and respectable throughput, rather than...... maximal throughput, to achieve orders of magnitude faster performance....
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.
Directory of Open Access Journals (Sweden)
Wei-Jen Wang
Full Text Available This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP in MATLAB by using external function calls to a graphics processing unit (GPU. DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
Efficient GPU-based skyline computation
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Assent, Ira; Magnani, Matteo
2013-01-01
The skyline operator for multi-criteria search returns the most interesting points of a data set with respect to any monotone preference function. Existing work has almost exclusively focused on efficiently computing skylines on one or more CPUs, ignoring the high parallelism possible in GPUs. In...
Yu, Fengchao; Liu, Huafeng; Hu, Zhenghui; Shi, Pengcheng
2012-04-01
As a consequence of the random nature of photon emissions and detections, the data collected by a positron emission tomography (PET) imaging system can be shown to be Poisson distributed. Meanwhile, there have been considerable efforts within the tracer kinetic modeling communities aimed at establishing the relationship between the PET data and physiological parameters that affect the uptake and metabolism of the tracer. Both statistical and physiological models are important to PET reconstruction. The majority of previous efforts are based on simplified, nonphysical mathematical expression, such as Poisson modeling of the measured data, which is, on the whole, completed without consideration of the underlying physiology. In this paper, we proposed a graphics processing unit (GPU)-accelerated reconstruction strategy that can take both statistical model and physiological model into consideration with the aid of state-space evolution equations. The proposed strategy formulates the organ activity distribution through tracer kinetics models and the photon-counting measurements through observation equations, thus making it possible to unify these two constraints into a general framework. In order to accelerate reconstruction, GPU-based parallel computing is introduced. Experiments of Zubal-thorax-phantom data, Monte Carlo simulated phantom data, and real phantom data show the power of the method. Furthermore, thanks to the computing power of the GPU, the reconstruction time is practical for clinical application.
Redundancy computation analysis and implementation of phase diversity based on GPU
Zhang, Quan; Bao, Hua; Rao, Changhui; Peng, Zhenming
2015-10-01
Phase diversity method is not only used as an image restoration technique, but also as a wavefront sensor. However, its computations have been perceived as being too burdensome to achieve its real-time applications on a desktop computer platform. In this paper, the implementation of the phase diversity algorithm based on graphic processing unit (GPU) is presented. The redundancy computations for the pupil function, point spread function, and optical transfer function are analyzed. Two kinds of implementation methods based on GPU are compared: one is the general method which is accomplished by GPU library CUFFT without precision loss (method-1) and the other one performed by our own custom FFT with little damage of precision considering the redundant calculations (method-2). The results show the cost and gradient functions can be speeded up by method-2 in contrast with the method-1 and the overhead of global memory access by kernel fusion can be reduced. For the image of 256 × 256 with the sampling factor of 3, the results reveal that method-2 achieves speedup of 1.83× compared with method-1 when the central 128 × 128 pixels of the point spread function are used.
Su, Xiaoquan; Wang, Xuetao; Jing, Gongchao; Ning, Kang
2014-04-01
The number of microbial community samples is increasing with exponential speed. Data-mining among microbial community samples could facilitate the discovery of valuable biological information that is still hidden in the massive data. However, current methods for the comparison among microbial communities are limited by their ability to process large amount of samples each with complex community structure. We have developed an optimized GPU-based software, GPU-Meta-Storms, to efficiently measure the quantitative phylogenetic similarity among massive amount of microbial community samples. Our results have shown that GPU-Meta-Storms would be able to compute the pair-wise similarity scores for 10 240 samples within 20 min, which gained a speed-up of >17 000 times compared with single-core CPU, and >2600 times compared with 16-core CPU. Therefore, the high-performance of GPU-Meta-Storms could facilitate in-depth data mining among massive microbial community samples, and make the real-time analysis and monitoring of temporal or conditional changes for microbial communities possible. GPU-Meta-Storms is implemented by CUDA (Compute Unified Device Architecture) and C++. Source code is available at http://www.computationalbioenergy.org/meta-storms.html.
Institute of Scientific and Technical Information of China (English)
Ji Xu; Jing hai Li; Hua biao Qi; Xiao jian Fang; Li qiang Lu; Wei Ge; Xiao wei Wang; Ming Xu; Fei guo Chen; Xian feng He
2011-01-01
Real-time simulation of industrial equipment is a huge challenge nowadays.The high performance and fine-grained parallel computing provided by graphics processing units (GPUs) bring us closer to our goals.In this article,an industrial-scale rotating drum is simulated using simplified discrete element method (DEM) without consideration of the tangential components of contact force and particle rotation.A single GPU is used first to simulate a small model system with about 8000 particles in real-time,and the simulation is then scaled up to industrial scale using more than 200 GPUs in a 1D domain-decomposition parallelization mode.The overall speed is about 1/11 of the real-time.Optimization of the communication part of the parallel GPU codes can speed up the simulation further,indicating that such real-time simulations have not only methodological but also industrial implications in the near future.
GPU-accelerated molecular mechanics computations.
Anthopoulos, Athanasios; Grimstead, Ian; Brancale, Andrea
2013-10-05
In this article, we describe an improved cell-list approach designed to match the Kepler architecture of General-purpose graphics processing units (GPGPU). We explain how our approach improves load balancing for the above algorithm and how warp intrinsics are used to implement Newton's third law for the nonbonded force calculations. We also talk through our approach to exclusions handling together with a method to calculate bonded forces and 1-4 electrostatic scaling using a single Cuda kernel. Performance benchmarks are included in the last sections to show the linear scaling of our implementation using a step minimization method. In addition, multiple performance benchmarks demonstrate the contribution of various optimizations we used for our implementations. © 2013 Wiley Periodicals, Inc.
Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods
Directory of Open Access Journals (Sweden)
David S. Smith
2012-01-01
Full Text Available Compressive sensing (CS has been shown to enable dramatic acceleration of MRI acquisition in some applications. Being an iterative reconstruction technique, CS MRI reconstructions can be more time-consuming than traditional inverse Fourier reconstruction. We have accelerated our CS MRI reconstruction by factors of up to 27 by using a split Bregman solver combined with a graphics processing unit (GPU computing platform. The increases in speed we find are similar to those we measure for matrix multiplication on this platform, suggesting that the split Bregman methods parallelize efficiently. We demonstrate that the combination of the rapid convergence of the split Bregman algorithm and the massively parallel strategy of GPU computing can enable real-time CS reconstruction of even acquisition data matrices of dimension 40962 or more, depending on available GPU VRAM. Reconstruction of two-dimensional data matrices of dimension 10242 and smaller took ~0.3 s or less, showing that this platform also provides very fast iterative reconstruction for small-to-moderate size images.
Software Graphics Processing Unit (sGPU) for Deep Space Applications
McCabe, Mary; Salazar, George; Steele, Glen
2015-01-01
A graphics processing capability will be required for deep space missions and must include a range of applications, from safety-critical vehicle health status to telemedicine for crew health. However, preliminary radiation testing of commercial graphics processing cards suggest they cannot operate in the deep space radiation environment. Investigation into an Software Graphics Processing Unit (sGPU)comprised of commercial-equivalent radiation hardened/tolerant single board computers, field programmable gate arrays, and safety-critical display software shows promising results. Preliminary performance of approximately 30 frames per second (FPS) has been achieved. Use of multi-core processors may provide a significant increase in performance.
Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong
2010-10-01
Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
High-level GPU computing with jacket for MATLAB and C/C++
Pryor, Gallagher; Lucey, Brett; Maddipatla, Sandeep; McClanahan, Chris; Melonakos, John; Venugopalakrishnan, Vishwanath; Patel, Krunal; Yalamanchili, Pavan; Malcolm, James
2011-06-01
We describe a software platform for the rapid development of general purpose GPU (GPGPU) computing applications within the MATLAB computing environment, C, and C++: Jacket. Jacket provides thousands of GPU-tuned function syntaxes within MATLAB, C, and C++, including linear algebra, convolutions, reductions, and FFTs as well as signal, image, statistics, and graphics libraries. Additionally, Jacket includes a compiler that translates MATLAB and C++ code to CUDA PTX assembly and OpenGL shaders on demand at runtime. A facility is also included to compile a domain specific version of the MATLAB language to CUDA assembly at build time. Jacket includes the first parallel GPU FOR-loop construction and the first profiler for comparative analysis of CPU and GPU execution times. Jacket provides full GPU compute capability on CUDA hardware and limited, image processing focused compute on OpenGL/ES (2.0 and up) devices for mobile and embedded applications.
GPU computing of compressible flow problems by a meshless method with space-filling curves
Ma, Z. H.; Wang, H.; Pu, S. H.
2014-04-01
A graphic processing unit (GPU) implementation of a meshless method for solving compressible flow problems is presented in this paper. Least-square fit is used to discretize the spatial derivatives of Euler equations and an upwind scheme is applied to estimate the flux terms. The compute unified device architecture (CUDA) C programming model is employed to efficiently and flexibly port the meshless solver from CPU to GPU. Considering the data locality of randomly distributed points, space-filling curves are adopted to re-number the points in order to improve the memory performance. Detailed evaluations are firstly carried out to assess the accuracy and conservation property of the underlying numerical method. Then the GPU accelerated flow solver is used to solve external steady flows over aerodynamic configurations. Representative results are validated through extensive comparisons with the experimental, finite volume or other available reference solutions. Performance analysis reveals that the running time cost of simulations is significantly reduced while impressive (more than an order of magnitude) speedups are achieved.
GPU computing for 2-d spin systems: CUDA vs OpenGL
Anselmi, V; Di Renzo, F
2008-01-01
In recent years the more and more powerful GPU's available on the PC market have attracted attention as a cost effective solution for parallel (SIMD) computing. CUDA is a solid evidence of the attention that the major companies are devoting to the field. CUDA is a hardware and software architecture developed by Nvidia for computing on the GPU. It qualifies as a friendly alternative to the approach to GPU computing that has been pioneered in the OpenGL environment. We discuss the application of both the CUDA and the OpenGL approach to the simulation of 2-d spin systems (XY model).
Liu, Yongchao; Schmidt, Bertil; Maskell, Douglas L
2011-03-29
Next-generation sequencing technologies have led to the high-throughput production of sequence data (reads) at low cost. However, these reads are significantly shorter and more error-prone than conventional Sanger shotgun reads. This poses a challenge for the de novo assembly in terms of assembly quality and scalability for large-scale short read datasets. We present DecGPU, the first parallel and distributed error correction algorithm for high-throughput short reads (HTSRs) using a hybrid combination of CUDA and MPI parallel programming models. DecGPU provides CPU-based and GPU-based versions, where the CPU-based version employs coarse-grained and fine-grained parallelism using the MPI and OpenMP parallel programming models, and the GPU-based version takes advantage of the CUDA and MPI parallel programming models and employs a hybrid CPU+GPU computing model to maximize the performance by overlapping the CPU and GPU computation. The distributed feature of our algorithm makes it feasible and flexible for the error correction of large-scale HTSR datasets. Using simulated and real datasets, our algorithm demonstrates superior performance, in terms of error correction quality and execution speed, to the existing error correction algorithms. Furthermore, when combined with Velvet and ABySS, the resulting DecGPU-Velvet and DecGPU-ABySS assemblers demonstrate the potential of our algorithm to improve de novo assembly quality for de-Bruijn-graph-based assemblers. DecGPU is publicly available open-source software, written in CUDA C++ and MPI. The experimental results suggest that DecGPU is an effective and feasible error correction algorithm to tackle the flood of short reads produced by next-generation sequencing technologies.
Directory of Open Access Journals (Sweden)
Schmidt Bertil
2011-03-01
Full Text Available Abstract Background Next-generation sequencing technologies have led to the high-throughput production of sequence data (reads at low cost. However, these reads are significantly shorter and more error-prone than conventional Sanger shotgun reads. This poses a challenge for the de novo assembly in terms of assembly quality and scalability for large-scale short read datasets. Results We present DecGPU, the first parallel and distributed error correction algorithm for high-throughput short reads (HTSRs using a hybrid combination of CUDA and MPI parallel programming models. DecGPU provides CPU-based and GPU-based versions, where the CPU-based version employs coarse-grained and fine-grained parallelism using the MPI and OpenMP parallel programming models, and the GPU-based version takes advantage of the CUDA and MPI parallel programming models and employs a hybrid CPU+GPU computing model to maximize the performance by overlapping the CPU and GPU computation. The distributed feature of our algorithm makes it feasible and flexible for the error correction of large-scale HTSR datasets. Using simulated and real datasets, our algorithm demonstrates superior performance, in terms of error correction quality and execution speed, to the existing error correction algorithms. Furthermore, when combined with Velvet and ABySS, the resulting DecGPU-Velvet and DecGPU-ABySS assemblers demonstrate the potential of our algorithm to improve de novo assembly quality for de-Bruijn-graph-based assemblers. Conclusions DecGPU is publicly available open-source software, written in CUDA C++ and MPI. The experimental results suggest that DecGPU is an effective and feasible error correction algorithm to tackle the flood of short reads produced by next-generation sequencing technologies.
GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs.
Directory of Open Access Journals (Sweden)
Ahmed Shamsul Arefin
Full Text Available BACKGROUND: The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers. An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU, can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. RESULTS: We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50-60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. CONCLUSION: Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL at https://sourceforge.net/p/gpufsknn/.
Commodity CPU-GPU System for Low-Cost , High-Performance Computing
Wang, S.; Zhang, S.; Weiss, R. M.; Barnett, G. A.; Yuen, D. A.
2009-12-01
We have put together a desktop computer system for under 2.5 K dollars from commodity components that consist of one quad-core CPU (Intel Core 2 Quad Q6600 Kentsfield 2.4GHz) and two high end GPUs (nVidia's GeForce GTX 295 and Tesla C1060). A 1200 watt power supply is required. On this commodity system, we have constructed an easy-to-use hybrid computing environment, in which Message Passing Interface (MPI) is used for managing the working loads, for transferring the data among different GPU devices, and for minimizing the need of CPU’s memory. The test runs using the MAGMA (Matrix Algebra on GPU and Multicore Architectures) library show that the speed ups for double precision calculations can be greater than 10 (GPU vs. CPU) and they are bigger (> 20) for single precision calculations. In addition we have enabled the combination of Matlab with CUDA for interactive visualization through MPI, i.e., two GPU devices are used for simulation and one GPU device is used for visualizing the computing results as the simulation goes. Our experience with this commodity system has shown that running multiple applications on one GPU device or running one application across multiple GPU devices can be done as conveniently as on CPUs. With NVIDIA CEO Jen-Hsun Huang's claim that over the next 6 years GPU processing power will increase by 570x compared to the 3x for CPUs, future low-cost commodity computers such as ours may be a remedy for the long wait queues of the world's supercomputers, especially for small- and mid-scale computation. Our goal here is to explore the limits and capabilities of this emerging technology and to get ourselves ready to run large-scale simulations on the next generation of computing environment, which we believe will hybridize CPU and GPU architectures.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-04-07
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing
Directory of Open Access Journals (Sweden)
Fan Zhang
2016-04-01
Full Text Available With the development of synthetic aperture radar (SAR technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO. However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
Real-Time Nonlinear Finite Element Computations on GPU - Application to Neurosurgical Simulation.
Joldes, Grand Roman; Wittek, Adam; Miller, Karol
2010-12-15
Application of biomechanical modeling techniques in the area of medical image analysis and surgical simulation implies two conflicting requirements: accurate results and high solution speeds. Accurate results can be obtained only by using appropriate models and solution algorithms. In our previous papers we have presented algorithms and solution methods for performing accurate nonlinear finite element analysis of brain shift (which includes mixed mesh, different non-linear material models, finite deformations and brain-skull contacts) in less than a minute on a personal computer for models having up to 50.000 degrees of freedom. In this paper we present an implementation of our algorithms on a Graphics Processing Unit (GPU) using the new NVIDIA Compute Unified Device Architecture (CUDA) which leads to more than 20 times increase in the computation speed. This makes possible the use of meshes with more elements, which better represent the geometry, are easier to generate, and provide more accurate results.
利用GPU进行通用数值计算的研究%Research on General-Purpose Computation Using GPU
Institute of Scientific and Technical Information of China (English)
徐品; 蓝善祯; 刘兰兰
2009-01-01
近年来,图形处理器(GPU)的发展日益成熟,应用范围不在局限于计算机图形学本身,已逐步扩展到通用数值计算领域.本文介绍了最新GPU用于通用计算的原理和方法,并在图像处理和科学计算方面对GPU和CPU算法进行了计算速度的对比研究,实验结果表明GPU在通用计算领域相对于CPU具有明显优势.%Recently, the development of Graphics Processing Unit (GPU) has become more and more sophisticated. The scope of application of GPU has been expanded to general purpose com-putations, except from those applications in graphics itself. In this paper, a detail introduction is given to the principles and methods of general purpose computation by GPU, and a comparative study of the calculation speed of GPU and CPU algorithm in image processing and scientific com-puting is made, and the experimental results show that GPU has an obvious advantage in general purpose computing compared with CPU.
A new embedded solution of hyperspectral data processing platform: the embedded GPU computer
Zhang, Lei; Gao, Jiao Bo; Hu, Yu; Sun, Ke Feng; Wang, Ying Hui; Cheng, Juan; Sun, Dan Dan; Li, Yu
2016-10-01
During the research of hyper-spectral imaging spectrometer, how to process the huge amount of image data is a difficult problem for all researchers. The amount of image data is about the order of magnitude of several hundred megabytes per second. Traditional solution of the embedded hyper-spectral data processing platform such as DSP and FPGA has its own drawback. With the development of GPU, parallel computing on GPU is increasingly applied in large-scale data processing. In this paper, we propose a new embedded solution of hyper-spectral data processing platform which is based on the embedded GPU computer. We also give a detailed discussion of how to acquire and process hyper-spectral data in embedded GPU computer. We use C++ AMP technology to control GPU and schedule the parallel computing. Experimental results show that the speed of hyper-spectral data processing on embedded GPU computer is apparently faster than ordinary computer. Our research has significant meaning for the engineering application of hyper-spectral imaging spectrometer.
Lokavarapu, H. V.; Matsui, H.
2015-12-01
Convection and magnetic field of the Earth's outer core are expected to have vast length scales. To resolve these flows, high performance computing is required for geodynamo simulations using spherical harmonics transform (SHT), a significant portion of the execution time is spent on the Legendre transform. Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of the Earth. The code has been shown to scale well on computer clusters capable of computing at the order of 10⁵ cores using Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallelization for CPUs. To further optimize, we investigate three different algorithms of the SHT using GPUs. One is to preemptively compute the Legendre polynomials on the CPU before executing SHT on the GPU within the time integration loop. In the second approach, both the Legendre polynomials and the SHT are computed on the GPU simultaneously. In the third approach , we initially partition the radial grid for the forward transform and the harmonic order for the backward transform between the CPU and GPU. There after, the partitioned works are simultaneously computed in the time integration loop. We examine the trade-offs between space and time, memory bandwidth and GPU computations on Maverick, a Texas Advanced Computing Center (TACC) supercomputer. We have observed improved performance using a GPU enabled Legendre transform. Furthermore, we will compare and contrast the different algorithms in the context of GPUs.
gpuPOM: a GPU-based Princeton Ocean Model
Directory of Open Access Journals (Sweden)
S. Xu
2014-11-01
Full Text Available Rapid advances in the performance of the graphics processing unit (GPU have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O between the hybrid Central Processing Unit (CPU and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
GPU-accelerated compressive holography.
Endo, Yutaka; Shimobaba, Tomoyoshi; Kakue, Takashi; Ito, Tomoyoshi
2016-04-18
In this paper, we show fast signal reconstruction for compressive holography using a graphics processing unit (GPU). We implemented a fast iterative shrinkage-thresholding algorithm on a GPU to solve the ℓ1 and total variation (TV) regularized problems that are typically used in compressive holography. Since the algorithm is highly parallel, GPUs can compute it efficiently by data-parallel computing. For better performance, our implementation exploits the structure of the measurement matrix to compute the matrix multiplications. The results show that GPU-based implementation is about 20 times faster than CPU-based implementation.
BioEM: GPU-accelerated computing of Bayesian inference of electron microscopy images
Cossio, Pilar; Baruffa, Fabio; Rampp, Markus; Lindenstruth, Volker; Hummer, Gerhard
2016-01-01
In cryo-electron microscopy (EM), molecular structures are determined from large numbers of projection images of individual particles. To harness the full power of this single-molecule information, we use the Bayesian inference of EM (BioEM) formalism. By ranking structural models using posterior probabilities calculated for individual images, BioEM in principle addresses the challenge of working with highly dynamic or heterogeneous systems not easily handled in traditional EM reconstruction. However, the calculation of these posteriors for large numbers of particles and models is computationally demanding. Here we present highly parallelized, GPU-accelerated computer software that performs this task efficiently. Our flexible formulation employs CUDA, OpenMP, and MPI parallelization combined with both CPU and GPU computing. The resulting BioEM software scales nearly ideally both on pure CPU and on CPU+GPU architectures, thus enabling Bayesian analysis of tens of thousands of images in a reasonable time. The g...
Exploiting graphics processing units for computational biology and bioinformatics.
Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H
2010-09-01
Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.
3D fast adaptive correlation imaging for large-scale gravity data based on GPU computation
Chen, Z.; Meng, X.; Guo, L.; Liu, G.
2011-12-01
comtinue to perform 3D correlation imaging for the redisual gravity data. After several iterations, we can obtain a satisfactoy results. Newly developed general purpose computing technology from Nvidia GPU (Graphics Processing Unit) has been put into practice and received widespread attention in many areas. Based on the GPU programming mode and two parallel levels, five CPU loops for the main computation of 3D correlation imaging are converted into three loops in GPU kernel functions, thus achieving GPU/CPU collaborative computing. The two inner loops are defined as the dimensions of blocks and the three outer loops are defined as the dimensions of threads, thus realizing the double loop block calculation. Theoretical and real gravity data tests show that results are reliable and the computing time is greatly reduced. Acknowledgments We acknowledge the financial support of Sinoprobe project (201011039 and 201011049-03), the Fundamental Research Funds for the Central Universities (2010ZY26 and 2011PY0183), the National Natural Science Foundation of China (41074095) and the Open Project of State Key Laboratory of Geological Processes and Mineral Resources (GPMR0945).
Computing OpenSURF on OpenCL and General Purpose GPU
Directory of Open Access Journals (Sweden)
Wanglong Yan
2013-10-01
Full Text Available Speeded-Up Robust Feature (SURF algorithm is widely used for image feature detecting and matching in computer vision area. Open Computing Language (OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. This paper introduces how to implement an open-sourced SURF program, namely OpenSURF, on general purpose GPU by OpenCL, and discusses the optimizations in terms of the thread architectures and memory models in detail. Our final OpenCL implementation of OpenSURF is on average 37% and 64% faster than the OpenCV SURF v2.4.5 CUDA implementation on NVidia's GTX660 and GTX460SE GPUs, repectively. Our OpenCL program achieved real-time performance (>25 Frames Per Second for almost all the input images with different sizes from 320*240 to 1024*768 on NVidia's GTX660 GPU, NVidia's GTX460SE GPU and AMD's Radeon HD 6850 GPU. Our OpenCL approach on NVidia's GTX660 GPU is more than 22.8 times faster than its original CPU version on Intel's Dual-Core E5400 2.7G on average.
Computing OpenSURF on OpenCL and General Purpose GPU
Directory of Open Access Journals (Sweden)
Wanglong Yan
2013-10-01
Full Text Available Speeded-Up Robust Feature (SURF algorithm is widely used for image feature detecting and matching in computer vision area. Open Computing Language (OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. This paper introduces how to implement an open-sourced SURF program, namely OpenSURF, on general purpose GPU by OpenCL, and discusses the optimizations in terms of the thread architectures and memory models in detail. Our final OpenCL implementation of OpenSURF is on average 37% and 64% faster than the OpenCV SURF v2.4.5 CUDA implementation on NVidia’s GTX660 and GTX460SE GPUs, repectively. Our OpenCL program achieved real-time performance (>25 Frames Per Second for almost all the input images with different sizes from 320*240 to 1024*768 on NVidia’s GTX660 GPU, NVidia’s GTX460SE GPU and AMD’s Radeon HD 6850 GPU. Our OpenCL approach on NVidia’s GTX660 GPU is more than 22.8 times faster than its original CPU version on Intel’s Dual-Core E5400 2.7G on average.
GPU applications for data processing
Energy Technology Data Exchange (ETDEWEB)
Vladymyrov, Mykhailo, E-mail: mykhailo.vladymyrov@cern.ch [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); Aleksandrov, Andrey [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); INFN sezione di Napoli, I-80125 Napoli (Italy); Tioukov, Valeri [INFN sezione di Napoli, I-80125 Napoli (Italy)
2015-12-31
Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.
GPU computing and its application in biomedical research%CPU计算及其在生物医学研究中的应用
Institute of Scientific and Technical Information of China (English)
李江域; 赵东升; 王玉民
2011-01-01
High-performance computing is an important tool and method for modern biomedical research. The traditional central processing unit( CPU )-based computer is unable to satisfy all the demands in computing performance, efficiency and cost of biomedical research. In recent years, graphics processing unit( GPU ) computing has emerged to become a hot-spot in high-performance computing. The concept, programming method and feature of GPU computing is introduced in the is paper, then applications of and problems with GPU computing in biomedicine are summarized. Finally, the author gives advice on GPU computing application in our academy.%高性能计算是现代生物医学研究的重要工具和手段,传统的基于通用处理器(CPU)的计算机已很难满足生物医学研究对计算性能、效率和成本等多方面的综合性要求.近年来,图形处理器(GPU)计算技术异军突起,成为高性能计算领域的研究热点.本文介绍了GPU计算的基本概念、编程方法和特点,总结和讨论了GPU计算在生物医学中的应用现状和存在问题.最后,结合实际情况提出了利用GPU计算的一些研究工作设想.
Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R
2012-02-23
We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient's skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures.
Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R.
2012-03-01
We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient's skin in realtime by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures.
Design Patterns for Sparse-Matrix Computations on Hybrid CPU/GPU Platforms
Directory of Open Access Journals (Sweden)
Valeria Cardellini
2014-01-01
Full Text Available We apply object-oriented software design patterns to develop code for scientific software involving sparse matrices. Design patterns arise when multiple independent developments produce similar designs which converge onto a generic solution. We demonstrate how to use design patterns to implement an interface for sparse matrix computations on NVIDIA GPUs starting from PSBLAS, an existing sparse matrix library, and from existing sets of GPU kernels for sparse matrices. We also compare the throughput of the PSBLAS sparse matrix–vector multiplication on two platforms exploiting the GPU with that obtained by a CPU-only PSBLAS implementation. Our experiments exhibit encouraging results regarding the comparison between CPU and GPU executions in double precision, obtaining a speedup of up to 35.35 on NVIDIA GTX 285 with respect to AMD Athlon 7750, and up to 10.15 on NVIDIA Tesla C2050 with respect to Intel Xeon X5650.
Accelerating image registration of MRI by GPU-based parallel computation.
Huang, Teng-Yi; Tang, Yu-Wei; Ju, Shiun-Ying
2011-06-01
Automatic image registration for MRI applications generally requires many iteration loops and is, therefore, a time-consuming task. This drawback prolongs data analysis and delays the workflow of clinical routines. Recent advances in the massively parallel computation of graphic processing units (GPUs) may be a solution to this problem. This study proposes a method to accelerate registration calculations, especially for the popular statistical parametric mapping (SPM) system. This study reimplemented the image registration of SPM system to achieve an approximately 14-fold increase in speed in registering single-modality intrasubject data sets. The proposed program is fully compatible with SPM, allowing the user to simply replace the original image registration library of SPM to gain the benefit of the computation power provided by commodity graphic processors. In conclusion, the GPU computation method is a practical way to accelerate automatic image registration. This technology promises a broader scope of application in the field of image registration. Copyright © 2011 Elsevier Inc. All rights reserved.
Parallel computation of compressible turbulence using multi-GPU clusters%应用多GPU的可压缩湍流并行计算
Institute of Scientific and Technical Information of China (English)
曹文斌; 李桦; 谢文佳; 张冉
2015-01-01
利用CUDA Fortran语言发展了基于图形处理器（GPU）的计算流体力学可压缩湍流求解器。该求解器基于结构网格有限体积法，空间离散采用AUSMPW＋格式，湍流模型为k－ωSST两方程模型，采用MPI实现并行计算。针对最新的GPU架构，讨论了通量计算的优化方法及GPU计算与PCIe数据传输、MPI通信重叠的多GPU并行算法。进行了超声速进气道及空天飞机等算例的数值模拟以验证GPU 在大网格量情况下的加速性能。计算结果表明：相对于Intel Xeon E5－2670 CPU 单一核心的计算时间，单块 NVIDIA GTX Titan Black GPU可获得107～125倍的加速比。利用四块GPU 实现了复杂外形1．34亿网格的快速计算，并行效率为91．6％。%Based on CUDA Fortran for compressible turbulence simulations,a finite volume computational fluid dynamics solver on the GPU (Graphical Processing Unit)was developed.The solver was implemented with an AUSMPW+scheme for the spatial dispersion,the k-ωSST model for turbulence model,and MPI communication for parallel computing.Some optimization strategies for fluxes computation and multi-GPU parallel algorithms for overlap of PCIe data transfer and MPI communication with GPU computation have been discussed for the latest generation GPU architecture.Several test cases,such as a supersonic inlet and a space shuttle were chosen to demonstrate the acceleration performance of GPU on large-scale grid size.Results show that when using a NVIDIA GTX Titan Black GPU,the computational expense can be reduced by 107~125 times than using a single core of an Intel Xeon E5 -2670 CPU.Fast computing for a complex configuration with 0.134 billion grid sizes has been achieved by using 4 GPUs and the parallel efficiency is 91.6%.
Directory of Open Access Journals (Sweden)
Heru Suhartanto
2011-07-01
Full Text Available One of application that needs high performance computing resources is molecular dynamic. There is some software available that perform molecular dynamic, one of these is a well known GROMACS. Our previous experiment simulating molecular dynamics of Indonesian grown herbal compounds show sufficient speed up on 32 nodes Cluster computing environment. In order to obtain a reliable simulation, one usually needs to run the experiment on the scale of hundred nodes. But this is expensive to develop and maintain. Since the invention of Graphical Processing Units that is also useful for general programming, many applications have been developed to run on this. This paper reports our experiments that evaluate the performance of GROMACS that runs on two different environment, Cluster computing resources and GPU based PCs. We run the experiment on BRV-1 and REM2 compounds. Four different GPUs are installed on the same type of PCs of quad cores; they are Gefore GTS 250, GTX 465, GTX 470 and Quadro 4000. We build a cluster of 16 nodes based on these four quad cores PCs. The preliminary experiment shows that those run on GTX 470 is the best among the other type of GPUs and as well as the cluster computing resource. A speed up around 11 and 12 is gained, while the cost of computer with GPU is only about 25 percent that of Cluster we built.
GPU MrBayes V3.1: MrBayes on Graphics Processing Units for Protein Sequence Data.
Pang, Shuai; Stones, Rebecca J; Ren, Ming-Ming; Liu, Xiao-Guang; Wang, Gang; Xia, Hong-ju; Wu, Hao-Yang; Liu, Yang; Xie, Qiang
2015-09-01
We present a modified GPU (graphics processing unit) version of MrBayes, called ta(MC)(3) (GPU MrBayes V3.1), for Bayesian phylogenetic inference on protein data sets. Our main contributions are 1) utilizing 64-bit variables, thereby enabling ta(MC)(3) to process larger data sets than MrBayes; and 2) to use Kahan summation to improve accuracy, convergence rates, and consequently runtime. Versus the current fastest software, we achieve a speedup of up to around 2.5 (and up to around 90 vs. serial MrBayes), and more on multi-GPU hardware. GPU MrBayes V3.1 is available from http://sourceforge.net/projects/mrbayes-gpu/.
Graphics Processing Unit (GPU) Acceleration of the Goddard Earth Observing System Atmospheric Model
Putnam, Williama
2011-01-01
The Goddard Earth Observing System 5 (GEOS-5) is the atmospheric model used by the Global Modeling and Assimilation Office (GMAO) for a variety of applications, from long-term climate prediction at relatively coarse resolution, to data assimilation and numerical weather prediction, to very high-resolution cloud-resolving simulations. GEOS-5 is being ported to a graphics processing unit (GPU) cluster at the NASA Center for Climate Simulation (NCCS). By utilizing GPU co-processor technology, we expect to increase the throughput of GEOS-5 by at least an order of magnitude, and accelerate the process of scientific exploration across all scales of global modeling, including: The large-scale, high-end application of non-hydrostatic, global, cloud-resolving modeling at 10- to I-kilometer (km) global resolutions Intermediate-resolution seasonal climate and weather prediction at 50- to 25-km on small clusters of GPUs Long-range, coarse-resolution climate modeling, enabled on a small box of GPUs for the individual researcher After being ported to the GPU cluster, the primary physics components and the dynamical core of GEOS-5 have demonstrated a potential speedup of 15-40 times over conventional processor cores. Performance improvements of this magnitude reduce the required scalability of 1-km, global, cloud-resolving models from an unfathomable 6 million cores to an attainable 200,000 GPU-enabled cores.
Accelerating MATLAB with GPU computing a primer with examples
Suh, Jung W
2013-01-01
Beyond simulation and algorithm development, many developers increasingly use MATLAB even for product deployment in computationally heavy fields. This often demands that MATLAB codes run faster by leveraging the distributed parallelism of Graphics Processing Units (GPUs). While MATLAB successfully provides high-level functions as a simulation tool for rapid prototyping, the underlying details and knowledge needed for utilizing GPUs make MATLAB users hesitate to step into it. Accelerating MATLAB with GPUs offers a primer on bridging this gap. Starting with the basics, setting up MATLAB for
Directory of Open Access Journals (Sweden)
Elise Cormie-Bowins
2012-10-01
Full Text Available We consider the problem of computing reachability probabilities: given a Markov chain, an initial state of the Markov chain, and a set of goal states of the Markov chain, what is the probability of reaching any of the goal states from the initial state? This problem can be reduced to solving a linear equation Ax = b for x, where A is a matrix and b is a vector. We consider two iterative methods to solve the linear equation: the Jacobi method and the biconjugate gradient stabilized (BiCGStab method. For both methods, a sequential and a parallel version have been implemented. The parallel versions have been implemented on the compute unified device architecture (CUDA so that they can be run on a NVIDIA graphics processing unit (GPU. From our experiments we conclude that as the size of the matrix increases, the CUDA implementations outperform the sequential implementations. Furthermore, the BiCGStab method performs better than the Jacobi method for dense matrices, whereas the Jacobi method does better for sparse ones. Since the reachability probabilities problem plays a key role in probabilistic model checking, we also compared the implementations for matrices obtained from a probabilistic model checker. Our experiments support the conjecture by Bosnacki et al. that the Jacobi method is superior to Krylov subspace methods, a class to which the BiCGStab method belongs, for probabilistic model checking.
Supporting Real-Time Computer Vision Workloads using OpenVX on Multicore+GPU Platforms
2015-05-01
workloads specified using OpenVX to be supported in a predictable way. I. INTRODUCTION In the automotive industry today, vision-based sensing through cameras...Supporting Real-Time Computer Vision Workloads using OpenVX on Multicore+GPU Platforms Glenn A. Elliott, Kecheng Yang, and James H. Anderson...Department of Computer Science, University of North Carolina at Chapel Hill Abstract—In the automotive industry, there is currently great interest in
MATCHED FILTER COMPUTATION ON FPGA, CELL, AND GPU
Energy Technology Data Exchange (ETDEWEB)
BAKER, ZACHARY K. [Los Alamos National Laboratory; GOKHALE, MAYA B. [Los Alamos National Laboratory; TRIPP, JUSTIN L. [Los Alamos National Laboratory
2007-01-08
The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands. In this work, they evaluate the performance of a matched filter algorithm implementation on accelerated co-processor (XD1000), the IBM Cell microprocessor, and the NVIDIA GeForce 6900 GTX GPU graphics card. They provide extensive discussion of the challenges and opportunities afforded by each platform. In particular, they explore the problems of partitioning the filter most efficiently between the host CPU and the co-processor. Using their results, they derive several performance metrics that provide the optimal solution for a variety of application situations.
APES-based procedure for super-resolution SAR imagery with GPU parallel computing
Jia, Weiwei; Xu, Xiaojian; Xu, Guangyao
2015-10-01
The amplitude and phase estimation (APES) algorithm is widely used in modern spectral analysis. Compared with conventional Fourier transform (FFT), APES results in lower sidelobes and narrower spectral peaks. However, in synthetic aperture radar (SAR) imaging with large scene, without parallel computation, it is difficult to apply APES directly to super-resolution radar image processing due to its great amount of calculation. In this paper, a procedure is proposed to achieve target extraction and parallel computing of APES for super-resolution SAR imaging. Numerical experimental are carried out on Tesla K40C with 745 MHz GPU clock rate and 2880 CUDA cores. Results of SAR image with GPU parallel computing show that the parallel APES is remarkably more efficient than that of CPU-based with the same super-resolution.
CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.
Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan
2017-06-24
The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn (2)) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to
Development of magnetron sputtering simulator with GPU parallel computing
Sohn, Ilyoup; Kim, Jihun; Bae, Junkyeong; Lee, Jinpil
2014-12-01
Sputtering devices are widely used in the semiconductor and display panel manufacturing process. Currently, a number of surface treatment applications using magnetron sputtering techniques are being used to improve the efficiency of the sputtering process, through the installation of magnets outside the vacuum chamber. Within the internal space of the low pressure chamber, plasma generated from the combination of a rarefied gas and an electric field is influenced interactively. Since the quality of the sputtering and deposition rate on the substrate is strongly dependent on the multi-physical phenomena of the plasma regime, numerical simulations using PIC-MCC (Particle In Cell, Monte Carlo Collision) should be employed to develop an efficient sputtering device. In this paper, the development of a magnetron sputtering simulator based on the PIC-MCC method and the associated numerical techniques are discussed. To solve the electric field equations in the 2-D Cartesian domain, a Poisson equation solver based on the FDM (Finite Differencing Method) is developed and coupled with the Monte Carlo Collision method to simulate the motion of gas particles influenced by an electric field. The magnetic field created from the permanent magnet installed outside the vacuum chamber is also numerically calculated using Biot-Savart's Law. All numerical methods employed in the present PIC code are validated by comparison with analytical and well-known commercial engineering software results, with all of the results showing good agreement. Finally, the developed PIC-MCC code is parallelized to be suitable for general purpose computing on graphics processing unit (GPGPU) acceleration, so as to reduce the large computation time which is generally required for particle simulations. The efficiency and accuracy of the GPGPU parallelized magnetron sputtering simulator are examined by comparison with the calculated results and computation times from the original serial code. It is found that
Toward Optimal Computation of Ultrasound Image Reconstruction Using CPU and GPU.
Techavipoo, Udomchai; Worasawate, Denchai; Boonleelakul, Wittawat; Keinprasit, Rachaporn; Sunpetchniyom, Treepop; Sugino, Nobuhiko; Thajchayapong, Pairash
2016-11-24
An ultrasound image is reconstructed from echo signals received by array elements of a transducer. The time of flight of the echo depends on the distance between the focus to the array elements. The received echo signals have to be delayed to make their wave fronts and phase coherent before summing the signals. In digital beamforming, the delays are not always located at the sampled points. Generally, the values of the delayed signals are estimated by the values of the nearest samples. This method is fast and easy, however inaccurate. There are other methods available for increasing the accuracy of the delayed signals and, consequently, the quality of the beamformed signals; for example, the in-phase (I)/quadrature (Q) interpolation, which is more time consuming but provides more accurate values than the nearest samples. This paper compares the signals after dynamic receive beamforming, in which the echo signals are delayed using two methods, the nearest sample method and the I/Q interpolation method. The comparisons of the visual qualities of the reconstructed images and the qualities of the beamformed signals are reported. Moreover, the computational speeds of these methods are also optimized by reorganizing the data processing flow and by applying the graphics processing unit (GPU). The use of single and double precision floating-point formats of the intermediate data is also considered. The speeds with and without these optimizations are also compared.
Toward Optimal Computation of Ultrasound Image Reconstruction Using CPU and GPU
Techavipoo, Udomchai; Worasawate, Denchai; Boonleelakul, Wittawat; Keinprasit, Rachaporn; Sunpetchniyom, Treepop; Sugino, Nobuhiko; Thajchayapong, Pairash
2016-01-01
An ultrasound image is reconstructed from echo signals received by array elements of a transducer. The time of flight of the echo depends on the distance between the focus to the array elements. The received echo signals have to be delayed to make their wave fronts and phase coherent before summing the signals. In digital beamforming, the delays are not always located at the sampled points. Generally, the values of the delayed signals are estimated by the values of the nearest samples. This method is fast and easy, however inaccurate. There are other methods available for increasing the accuracy of the delayed signals and, consequently, the quality of the beamformed signals; for example, the in-phase (I)/quadrature (Q) interpolation, which is more time consuming but provides more accurate values than the nearest samples. This paper compares the signals after dynamic receive beamforming, in which the echo signals are delayed using two methods, the nearest sample method and the I/Q interpolation method. The comparisons of the visual qualities of the reconstructed images and the qualities of the beamformed signals are reported. Moreover, the computational speeds of these methods are also optimized by reorganizing the data processing flow and by applying the graphics processing unit (GPU). The use of single and double precision floating-point formats of the intermediate data is also considered. The speeds with and without these optimizations are also compared. PMID:27886149
Toward Optimal Computation of Ultrasound Image Reconstruction Using CPU and GPU
Directory of Open Access Journals (Sweden)
Udomchai Techavipoo
2016-11-01
Full Text Available An ultrasound image is reconstructed from echo signals received by array elements of a transducer. The time of flight of the echo depends on the distance between the focus to the array elements. The received echo signals have to be delayed to make their wave fronts and phase coherent before summing the signals. In digital beamforming, the delays are not always located at the sampled points. Generally, the values of the delayed signals are estimated by the values of the nearest samples. This method is fast and easy, however inaccurate. There are other methods available for increasing the accuracy of the delayed signals and, consequently, the quality of the beamformed signals; for example, the in-phase (I/quadrature (Q interpolation, which is more time consuming but provides more accurate values than the nearest samples. This paper compares the signals after dynamic receive beamforming, in which the echo signals are delayed using two methods, the nearest sample method and the I/Q interpolation method. The comparisons of the visual qualities of the reconstructed images and the qualities of the beamformed signals are reported. Moreover, the computational speeds of these methods are also optimized by reorganizing the data processing flow and by applying the graphics processing unit (GPU. The use of single and double precision floating-point formats of the intermediate data is also considered. The speeds with and without these optimizations are also compared.
Fast calculation of computer-generated-hologram on AMD HD5000 series GPU and OpenCL
Shimobaba, Tomoyoshi; Masuda, Nobuyuki; Ichihashi, Yasuyuki; Takada, Naoki
2010-01-01
In this paper, we report fast calculation of a computer-generated-hologram using a new architecture of the HD5000 series GPU (RV870) made by AMD and its new software development environment, OpenCL. Using a RV870 GPU and OpenCL, we can calculate 1,920 * 1,024 resolution of a CGH from a 3D object consisting of 1,024 points in 30 milli-seconds. The calculation speed realizes a speed approximately two times faster than that of a GPU made by NVIDIA.
Fast calculation of computer-generated-hologram on AMD HD5000 series GPU and OpenCL.
Shimobaba, Tomoyoshi; Ito, Tomoyoshi; Masuda, Nobuyuki; Ichihashi, Yasuyuki; Takada, Naoki
2010-05-10
In this paper, we report fast calculation of a computer-generated-hologram using a new architecture of the HD5000 series GPU (RV870) made by AMD and its new software development environment, OpenCL. Using a RV870 GPU and OpenCL, we can calculate 1,920 x 1,024 resolution of a CGH from a 3D object consisting of 1,024 points in 30 milli-seconds. The calculation speed realizes a speed approximately two times faster than that of a GPU made by NVIDIA. (c) 2010 Optical Society of America.
Hybrid computing: CPU+GPU co-processing and its application to tomographic reconstruction
Energy Technology Data Exchange (ETDEWEB)
Agulleiro, J.I.; Vazquez, F.; Garzon, E.M. [Supercomputing and Algorithms Group, Associated Unit CSIC-UAL, University of Almeria, 04120 Almeria (Spain); Fernandez, J.J., E-mail: JJ.Fernandez@csic.es [National Centre for Biotechnology, National Research Council (CNB-CSIC), Campus UAM, C/Darwin 3, Cantoblanco, 28049 Madrid (Spain)
2012-04-15
Modern computers are equipped with powerful computing engines like multicore processors and GPUs. The 3DEM community has rapidly adapted to this scenario and many software packages now make use of high performance computing techniques to exploit these devices. However, the implementations thus far are purely focused on either GPUs or CPUs. This work presents a hybrid approach that collaboratively combines the GPUs and CPUs available in a computer and applies it to the problem of tomographic reconstruction. Proper orchestration of workload in such a heterogeneous system is an issue. Here we use an on-demand strategy whereby the computing devices request a new piece of work to do when idle. Our hybrid approach thus takes advantage of the whole computing power available in modern computers and further reduces the processing time. This CPU+GPU co-processing can be readily extended to other image processing tasks in 3DEM. -- Highlights: Black-Right-Pointing-Pointer Hybrid computing allows full exploitation of the power (CPU+GPU) in a computer. Black-Right-Pointing-Pointer Proper orchestration of workload is managed by an on-demand strategy. Black-Right-Pointing-Pointer Total number of threads running in the system should be limited to the number of CPUs.
SEMANTIC-RELAXED NON-BLOCKING CONCURRENT QUEUE FOR GPU COMPUTING%基于 GPU 的语义松弛非阻塞并行队列研究
Institute of Scientific and Technical Information of China (English)
张翔宇; 邓仰东
2015-01-01
近年来，基于图形处理器 GPU 的通用计算逐渐成为主流计算模式。为了降低 GPU 程序设计的难度，提出一种适合于GPU 体系结构的非阻塞并行队列数据结构。通过对并行队列进行语义松弛，该数据结构能够有效利用队列操作的并行性。同时，还提出了高速并行队列插入和删除算法。使用线性化准则对该并行队列的正确性进行验证。实验表明，所提出的并发队列能够达到远高于目前多核 CPU 和 GPU 并行队列的性能，分别超越现有最好结果20倍和200倍以上。%Recent years have witnessed a strong momentum of general purpose computing on graphics processing units (GPUs).To ease the difficulty of developing highly efficient massively parallel programs on GPU,this paper introduces a non-blocking concurrent queue data structure suitable for GPUs architecture.By applying semantic-relaxation on concurrent queue,the proposed data structure is able to effectively make use of the concurrency of queuing operations.Meanwhile this paper also presents efficient insert and delete algorithms of high-speed concurrent queues.Experiments indicate that our concurrent queue significantly outperforms the performances of existing multi-core CPU and GPU concurrent queue data structures by 20 and 200 fold respectively.The correctness of the proposed concurrent queue is validated by the linearisability criteria.
New Combustion CFD Algorithms Designed for Rapid GPU Computations Project
National Aeronautics and Space Administration — We propose development of new algorithms specifically designed to exploit the highly parallel structure of graphics processing units (GPUs) for performing the...
Hybrid computing: CPU+GPU co-processing and its application to tomographic reconstruction.
Agulleiro, J I; Vázquez, F; Garzón, E M; Fernández, J J
2012-04-01
Modern computers are equipped with powerful computing engines like multicore processors and GPUs. The 3DEM community has rapidly adapted to this scenario and many software packages now make use of high performance computing techniques to exploit these devices. However, the implementations thus far are purely focused on either GPUs or CPUs. This work presents a hybrid approach that collaboratively combines the GPUs and CPUs available in a computer and applies it to the problem of tomographic reconstruction. Proper orchestration of workload in such a heterogeneous system is an issue. Here we use an on-demand strategy whereby the computing devices request a new piece of work to do when idle. Our hybrid approach thus takes advantage of the whole computing power available in modern computers and further reduces the processing time. This CPU+GPU co-processing can be readily extended to other image processing tasks in 3DEM. Copyright © 2012 Elsevier B.V. All rights reserved.
Castaño-Díez, Daniel
2017-06-01
Dynamo is a package for the processing of tomographic data. As a tool for subtomogram averaging, it includes different alignment and classification strategies. Furthermore, its data-management module allows experiments to be organized in groups of tomograms, while offering specialized three-dimensional tomographic browsers that facilitate visualization, location of regions of interest, modelling and particle extraction in complex geometries. Here, a technical description of the package is presented, focusing on its diverse strategies for optimizing computing performance. Dynamo is built upon mbtools (middle layer toolbox), a general-purpose MATLAB library for object-oriented scientific programming specifically developed to underpin Dynamo but usable as an independent tool. Its structure intertwines a flexible MATLAB codebase with precompiled C++ functions that carry the burden of numerically intensive operations. The package can be delivered as a precompiled standalone ready for execution without a MATLAB license. Multicore parallelization on a single node is directly inherited from the high-level parallelization engine provided for MATLAB, automatically imparting a balanced workload among the threads in computationally intense tasks such as alignment and classification, but also in logistic-oriented tasks such as tomogram binning and particle extraction. Dynamo supports the use of graphical processing units (GPUs), yielding considerable speedup factors both for native Dynamo procedures (such as the numerically intensive subtomogram alignment) and procedures defined by the user through its MATLAB-based GPU library for three-dimensional operations. Cloud-based virtual computing environments supplied with a pre-installed version of Dynamo can be publicly accessed through the Amazon Elastic Compute Cloud (EC2), enabling users to rent GPU computing time on a pay-as-you-go basis, thus avoiding upfront investments in hardware and longterm software maintenance.
Institute of Scientific and Technical Information of China (English)
刘国峰; 刘钦; 李博; 佟小龙; 刘洪
2009-01-01
随着图形处理器(Graphic Processing Unit:GPU)在通用计算领域的日趋成熟,使GPU/CPU协同并行计算应用到油气勘探地震资料处理中,对诸多大规模计算的关键性环节有重大提升.本文阐明协同并行计算机的思路、架构及编程环境,着重分析其计算效率得以大幅度提升的关键所在.文中以地震资料处理中的叠前时间偏移和Gazdag深度偏移为切入点,展示样机测试结果的图像显示.显而易见,生产实践中,时常面临对诸多算法进行算法精度和计算速度之间的折中选择.本文阐明GPU/CPU样机协同计算具有高并行度,进而可在算法精度与计算速度的优化配置协调上获得广阔空间.笔者认为,本文的台式协同并行机研制思路及架构,或可作为地球物理配置高性能计算机全新选择的一项依据.%With the development of Graphic Processing Unit (GPU) in general colculations, it makes the GPU/CPU co-processing parallel computing used in seismic data processing of oil and gas exploration and get the great improvement in some important steps. In this paper, we introduce the idea ,architecture and coding environment for GPU and CPU co-processing with CUDA. At the same time,we pay more attention to point out why this method can greatly improve the efficiency. We use the pre-stack time migration and Gazdag depth migration in seismic data processing as the start point of this paper, and the figures can tell us the result of these tests. Obviously ,we often face the situation that we must abandon some good method because of some problems about computing speed, but with the GPU we will have more choices. The result sticks out a mile that the samples in this paper have more advantages than the common PC-Cluster,the new architecture of making parallel computer on the desk expressed in this paper may be one of another choice for high performance computing in the field of geophysics.
GPU-accelerated computing for Lagrangian coherent structures of multi-body gravitational regimes
Lin, Mingpei; Xu, Ming; Fu, Xiaoyu
2017-04-01
Based on a well-established theoretical foundation, Lagrangian Coherent Structures (LCSs) have elicited widespread research on the intrinsic structures of dynamical systems in many fields, including the field of astrodynamics. Although the application of LCSs in dynamical problems seems straightforward theoretically, its associated computational cost is prohibitive. We propose a block decomposition algorithm developed on Compute Unified Device Architecture (CUDA) platform for the computation of the LCSs of multi-body gravitational regimes. In order to take advantage of GPU's outstanding computing properties, such as Shared Memory, Constant Memory, and Zero-Copy, the algorithm utilizes a block decomposition strategy to facilitate computation of finite-time Lyapunov exponent (FTLE) fields of arbitrary size and timespan. Simulation results demonstrate that this GPU-based algorithm can satisfy double-precision accuracy requirements and greatly decrease the time needed to calculate final results, increasing speed by approximately 13 times. Additionally, this algorithm can be generalized to various large-scale computing problems, such as particle filters, constellation design, and Monte-Carlo simulation.
Zhang, Lei; Gao, Jiao Bo; Hu, Yu; Wang, Ying Hui; Sun, Ke Feng; Cheng, Juan; Sun, Dan Dan; Li, Yu
2017-02-01
During the research of hyper-spectral imaging spectrometer, how to process the huge amount of image data is a difficult problem for all researchers. The amount of image data is about the order of magnitude of several hundred megabytes per second. The only way to solve this problem is parallel computing technology. With the development of multi-core CPU and GPU parallel computing on multi-core CPU or GPU is increasingly applied in large-scale data processing. In this paper, we propose a new parallel computing solution of hyper-spectral data processing which is based on the multi-CPU and multi-GPU heterogeneous computing platform. We use OpenMP technology to control multi-core CPU, we also use CUDA to schedule the parallel computing on multi-GPU. Experimental results show that the speed of hyper-spectral data processing on the multi-CPU and multi-GPU heterogeneous computing platform is apparently faster than the traditional serial algorithm which is run on single core CPU. Our research has significant meaning for the engineering application of the windowing Fourier transform imaging spectrometer.
Computation of electron quantum transport in graphene nanoribbons using GPU
Ihnatsenka, S
2011-01-01
The performance potential for simulating quantum electron transport on graphical processing units (GPUs) is studied. Using graphene ribbons of realistic sizes as an example it is shown that GPUs provide significant speed-ups in comparison to central processing units as the transverse dimension of the ribbon grows. The recursive Green's function algorithm is employed and implementation details on GPUs are discussed. Calculated conductances were found to accumulate significant numerical error due to single-precision floating-point arithmetic at energies close to the charge neutrality point of the graphene.
Genetic Algorithm Modeling with GPU Parallel Computing Technology
Cavuoti, Stefano; Brescia, Massimo; Pescapé, Antonio; Longo, Giuseppe; Ventre, Giorgio
2012-01-01
We present a multi-purpose genetic algorithm, designed and implemented with GPGPU / CUDA parallel computing technology. The model was derived from a multi-core CPU serial implementation, named GAME, already scientifically successfully tested and validated on astrophysical massive data classification problems, through a web application resource (DAMEWARE), specialized in data mining based on Machine Learning paradigms. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm has provided an exploit of the internal training features of the model, permitting a strong optimization in terms of processing performances and scalability.
Directory of Open Access Journals (Sweden)
Ali Dashti
Full Text Available This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU. The method is applicable to homogeneous computing clusters with a varying number of nodes and GPUs per node. We achieve a 6-fold speedup in data processing as compared with an optimized method running on a cluster of CPUs and bring a hitherto impossible [Formula: see text]-NNG generation for a dataset of twenty million images with 15 k dimensionality into the realm of practical possibility.
Directory of Open Access Journals (Sweden)
Yingnian Wu
2014-01-01
Full Text Available Electromagnetic calculation plays an important role in both military and civic fields. Some methods and models proposed for calculation of electromagnetic wave propagation in a large range bring heavy burden in CPU computation and also require huge amount of memory. Using the GPU to accelerate computation and visualization can reduce the computational burden on the CPU. Based on forward ray-tracing method, a transmission particle model (TPM for calculating electromagnetic field is presented to combine the particle method. The movement of a particle obeys the principle of the propagation of electromagnetic wave, and then the particle distribution density in space reflects the electromagnetic distribution status. The algorithm with particle transmission, movement, reflection, and diffraction is described in detail. Since the particles in TPM are completely independent, it is very suitable for the parallel computing based on GPU. Deduction verification of TPM with the electric dipole antenna as the transmission source is conducted to prove that the particle movement itself represents the variation of electromagnetic field intensity caused by diffusion. Finally, the simulation comparisons are made against the forward and backward ray-tracing methods. The simulation results verified the effectiveness of the proposed method.
Interesting Spatio-Temporal Region Discovery Computations Over Gpu and Mapreduce Platforms
McDermott, M.; Prasad, S. K.; Shekhar, S.; Zhou, X.
2015-07-01
Discovery of interesting paths and regions in spatio-temporal data sets is important to many fields such as the earth and atmospheric sciences, GIS, public safety and public health both as a goal and as a preliminary step in a larger series of computations. This discovery is usually an exhaustive procedure that quickly becomes extremely time consuming to perform using traditional paradigms and hardware and given the rapidly growing sizes of today's data sets is quickly outpacing the speed at which computational capacity is growing. In our previous work (Prasad et al., 2013a) we achieved a 50 times speedup over sequential using a single GPU. We were able to achieve near linear speedup over this result on interesting path discovery by using Apache Hadoop to distribute the workload across multiple GPU nodes. Leveraging the parallel architecture of GPUs we were able to drastically reduce the computation time of a 3-dimensional spatio-temporal interest region search on a single tile of normalized difference vegetative index for Saudi Arabia. We were further able to see an almost linear speedup in compute performance by distributing this workload across several GPUs with a simple MapReduce model. This increases the speed of processing 10 fold over the comparable sequential while simultaneously increasing the amount of data being processed by 384 fold. This allowed us to process the entirety of the selected data set instead of a constrained window.
Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin
2016-06-01
CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
Miki, T; Wang, X; Aoki, T; Imai, Y; Ishikawa, T; Takase, K; Yamaguchi, T
2012-01-01
In this paper, we propose a novel patient-specific method of modelling pulmonary airflow using graphics processing unit (GPU) computation that can be applied in medical practice. To overcome the barriers imposed by computation speed, installation price and footprint to the application of computational fluid dynamics, we focused on GPU computation and the lattice Boltzmann method (LBM). The GPU computation and LBM are compatible due to the characteristics of the GPU. As the optimisation of data access is essential for the performance of the GPU computation, we developed an adaptive meshing method, in which an airway model is covered by isotropic subdomains consisting of a uniform Cartesian mesh. We found that 4(3) size subdomains gave the best performance. The code was also tested on a small GPU cluster to confirm its performance and applicability, as the price and footprint are reasonable for medical applications.
2014-05-01
Parallelization and vectorization on the GPU is achieved with modifying the code syntax for compatibility with CUDA . We assess the speedup due to various...ExaScience Lab in Leuven, Belgium) and compare it with the performance of a GPU unit running CUDA . We implement a test case of a 1D two-stream instability...programming language syntax only in the GPU / CUDA version of the code and these changes do not have any significant impact on the final performance. 2
GPU-acceleration of parallel unconditionally stable group explicit finite difference method
Parand, K; Hossayni, Sayyed A
2013-01-01
Graphics Processing Units (GPUs) are high performance co-processors originally intended to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this paper is to evaluate the impact of using GPU in solution of the transient diffusion type equation by parallel and stable group explicit finite difference method. To accomplish that, GPU and CPU-based (multi-core) approaches were developed. Moreover, we proposed an optimal synchronization arrangement for its implementation pseudo-code. Also, the interrelation of GPU parallel programming and initializing the algorithm variables was discussed, using numerical experiences. The GPU-approach results are faster than a much expensive parallel 8-thread CPU-based approach results. The GPU, used in this paper, is an ordinary laptop GPU (GT 335M) and is accessible for e...
Putnam, William M.
2011-01-01
Earth system models like the Goddard Earth Observing System model (GEOS-5) have been pushing the limits of large clusters of multi-core microprocessors, producing breath-taking fidelity in resolving cloud systems at a global scale. GPU computing presents an opportunity for improving the efficiency of these leading edge models. A GPU implementation of GEOS-5 will facilitate the use of cloud-system resolving resolutions in data assimilation and weather prediction, at resolutions near 3.5 km, improving our ability to extract detailed information from high-resolution satellite observations and ultimately produce better weather and climate predictions
Numerical Integration with Graphical Processing Unit for QKD Simulation
2014-03-27
existing and proposed Quantum Key Distribution (QKD) systems. This research investigates using graphical processing unit ( GPU ) technology to more...Time Pad GPU graphical processing unit API application programming interface CUDA Compute Unified Device Architecture SIMD single-instruction-stream...and can be passed by value or reference [2]. 2.3 Graphical Processing Units Programming with graphical processing unit ( GPU ) requires a different
A GPU-Based Visualization Method for Computing Dark Matter Annihilation Signal
Yang, L.; Szalay, A.
2013-10-01
We present a novel GPU-based visualization method for computing the dark matter annihilation signal for cosmological dark matter simulations. The technique increased the speed of rendering by more than 1,000 times. In a previous study, using a code running on regular CPUs, each particle's contribution was explicitly calculated pixel by pixel over a HEALPIX map, then remapped onto a Molleweide projection. Using Via Lactea II simulation (˜ 400M particles), it takes over 7 hours for a single thread CPU (˜3 GHz) to complete an all-sky map with NSIDE=512 resolution. Our novel method is based on a separate stereographic projection for each hemisphere, and a hardware accelerated rendering pipeline on a GPU (OpenGL). We project the particles instead of the celestial sphere to the tangent plane with a skewed flux profile appropriate for the STR projection. OpenGL's Point Sprite feature and shader language allow us to render those eccentric circular flux profiles at the rate of more than 10M particles per second. The new method can process a single snapshot of the Via Lactea II data in less than 1 minute with a single NVIDIA GTX 480 GPU, including I/O, with effective rendering time less than 24 seconds. Using an approximate normalization for the flux, accurate to 2.5% in total flux, the rendering can be done in less than 13 seconds. The stereographic images corresponding to the two hemispheres are then warped to an all-sky image in the Molleweide projection, and are in good agreement with the result from the regular CPU code, at similar resolution.
Lonardo, Alessandro; Ammendola, Roberto; Biagioni, Andrea; Cotta Ramusino, Angelo; Fiorini, Massimiliano; Frezza, Ottorino; Lamanna, Gianluca; Lo Cicero, Francesca; Martinelli, Michele; Neri, Ilaria; Paolucci, Pier Stanislao; Pastorelli, Elena; Pontisso, Luca; Rossetti, Davide; Simeone, Francesco; Simula, Francesco; Sozzi, Marco; Tosoratto, Laura; Vicini, Piero
2015-01-01
The capability of processing high bandwidth data streams in real-time is a computational requirement common to many High Energy Physics experiments. Keeping the latency of the data transport tasks under control is essential in order to meet this requirement. We present NaNet, a FPGA-based PCIe Network Interface Card design featuring Remote Direct Memory Access towards CPU and GPU memories plus a transport protocol offload module characterized by cycle-accurate upper-bound handling. The combination of these two features allows to relieve almost entirely the OS and the application from data tranfer management, minimizing the unavoidable jitter effects associated to OS process scheduling. The design currently supports one GbE (1000Base-T) and three custom 34 Gbps APElink I/O channels, but four-channels 10GbE (10Base-R) and 2.5 Gbps deterministic latency KM3link versions are being implemented. Two use cases of NaNet will be discussed: the GPU-based low level trigger for the RICH detector in the NA62 experiment an...
CSIR Research Space (South Africa)
Duvenhage, B
2010-06-01
Full Text Available . This paper presents the details of a real-time implementation of the Lucas- Kanade image registration algorithm on a Graphics Processing Unit (GPU) using the OpenGL Shading Language (GLSL). The implementation is driven by a real world requirement...
NMF-mGPU: non-negative matrix factorization on multi-GPU systems.
Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto
2015-02-13
In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the
Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan
2017-08-04
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
MIGS-GPU: Microarray Image Gridding and Segmentation on the GPU.
Katsigiannis, Stamos; Zacharia, Eleni; Maroulis, Dimitris
2016-03-03
cDNA microarray is a powerful tool for simultaneously studying the expression level of thousands of genes. Nevertheless, the analysis of microarray images remains an arduous and challenging task due to the poor quality of the images which often suffer from noise, artifacts, and uneven background. In this work, the MIGS-GPU (Microarray Image Gridding and Segmentation on GPU) software for gridding and segmenting microarray images is presented. MIGS-GPU's computations are performed on the graphics processing unit (GPU) by means of the CUDA architecture in order to achieve fast performance and increase the utilization of available system resources. Evaluation on both real and synthetic cDNA microarray images showed that MIGS-GPU provides better performance than state-of-the-art alternatives, while the proposed GPU implementation achieves significantly lower computational times compared to the respective CPU approaches. Consequently, MIGS-GPU can be an advantageous and useful tool for biomedical laboratories, offering a userfriendly interface that requires minimum input in order to run.
Tsuchimoto, Masashi; Tanimura, Yoshitaka
2015-08-11
A system with many energy states coupled to a harmonic oscillator bath is considered. To study quantum non-Markovian system-bath dynamics numerically rigorously and nonperturbatively, we developed a computer code for the reduced hierarchy equations of motion (HEOM) for a graphics processor unit (GPU) that can treat the system as large as 4096 energy states. The code employs a Padé spectrum decomposition (PSD) for a construction of HEOM and the exponential integrators. Dynamics of a quantum spin glass system are studied by calculating the free induction decay signal for the cases of 3 × 2 to 3 × 4 triangular lattices with antiferromagnetic interactions. We found that spins relax faster at lower temperature due to transitions through a quantum coherent state, as represented by the off-diagonal elements of the reduced density matrix, while it has been known that the spins relax slower due to suppression of thermal activation in a classical case. The decay of the spins are qualitatively similar regardless of the lattice sizes. The pathway of spin relaxation is analyzed under a sudden temperature drop condition. The Compute Unified Device Architecture (CUDA) based source code used in the present calculations is provided as Supporting Information .
Directory of Open Access Journals (Sweden)
Kgotlaetsile Mathews Modieginyane
2013-06-01
Full Text Available Recent advances in computing such as the massively parallel GPUs (Graphical Processing Units,coupledwith the need to store and deliver large quantities of digital data especially images, has brought a numberof challenges for Computer Scientists, the research community and other stakeholders. These challenges,such as prohibitively large costs to manipulate the digital data amongst others, have been the focus of theresearch community in recent years and has led to the investigation of image compression techniques thatcan achieve excellent results. One such technique is the Discrete Cosine Transform, which helps separatean image into parts of differing frequencies and has the advantage of excellent energy-compaction.This paper investigates the use of the Compute Unified Device Architecture (CUDA programming modelto implement the DCT based Cordic based Loeffler algorithm for efficient image compression. Thecomputational efficiency is analyzed and evaluated under both the CPU and GPU. The PSNR (Peak Signalto Noise Ratio is used to evaluate image reconstruction quality in this paper. The results are presentedand discussed.
Proceedings of the GPU computing in high-energy physics conference 2014 GPUHEP2014
Energy Technology Data Exchange (ETDEWEB)
Bonati, Claudio; D' Elia, Massimo; Lamanna, Gianluca; Sozzi, Marco (eds.)
2015-06-15
The International Conference on GPUs in High-Energy Physics was held from September 10 to 12, 2014 at the University of Pisa, Italy. It represented a larger scale follow-up to a set of workshops which indicated the rising interest of the HEP community, experimentalists and theorists alike, towards the use of inexpensive and massively parallel computing devices, for very diverse purposes. The conference was organized in plenary sessions of invited and contributed talks, and poster presentations on the following topics: - GPUs in triggering applications - Low-level trigger systems based on GPUs - Use of GPUs in high-level trigger systems - GPUs in tracking and vertexing - Challenges for triggers in future HEP experiments - Reconstruction and Monte Carlo software on GPUs - Software frameworks and tools for GPU code integration - Hard real-time use of GPUs - Lattice QCD simulation - GPUs in phenomenology - GPUs for medical imaging purposes - GPUs in neutron and photon science - Massively parallel computations in HEP - Code parallelization. ''GPU computing in High-Energy Physics'' attracted 78 registrants to Pisa. The 38 oral presentations included talks on specific topics in experimental and theoretical applications of GPUs, as well as review talks on applications and technology. 5 posters were also presented, and were introduced by a short plenary oral illustration. A company exhibition was hosted on site. The conference consisted of 12 plenary sessions, together with a social program which included a banquet and guided excursions around Pisa. It was overall an enjoyable experience, offering an opportunity to share ideas and opinions, and getting updated on other participants' work in this emerging field, as well as being a valuable introduction for newcomers interested to learn more about the use of GPUs as accelerators for scientific progress on the elementary constituents of matter and energy.
Importance of Explicit Vectorization for CPU and GPU Software Performance
Dickson, Neil G; Hamze, Firas
2010-01-01
Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version.
Institute of Scientific and Technical Information of China (English)
李龙飞; 贺占庄; 徐丹妮
2014-01-01
以CUDA架构为例，对传统的CPU＋单GPU架构进行了分析，提出了一种CPU＋多GPU异构协同计算的系统方案，对关键的CPU对多GPU的管理及多GPU间数据通信等问题做了重点讨论，从理论上进行了可行性分析，并提出了相应的优化方法。%In this paper ,we analyzed the traditional CPU + single GPU architecture ,and put forward a CPU +multi-GPU heterogeneous collaborative computing system .We focused on the CPU management for multi-GPU and data communication between multi-GPU ,then theoretically analyzed the feasibility ,finally gave the corresponding optimization methods .
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System.
Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram
2014-01-01
We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second.
A Fast GPU-accelerated Mixed-precision Strategy for Fully NonlinearWater Wave Computations
DEFF Research Database (Denmark)
Glimberg, Stefan Lemvig; Engsig-Karup, Allan Peter; Madsen, Morten G.
2011-01-01
We present performance results of a mixed-precision strategy developed to improve a recently developed massively parallel GPU-accelerated tool for fast and scalable simulation of unsteady fully nonlinear free surface water waves over uneven depths (Engsig-Karup et.al. 2011). The underlying wave...... model is based on a potential flow formulation, which requires efficient solution of a Laplace problem at large-scales. We report recent results on a new mixed-precision strategy for efficient iterative high-order accurate and scalable solution of the Laplace problem using a multigrid......-preconditioned defect correction method. The improved strategy improves the performance by exploiting architectural features of modern GPUs for mixed precision computations and is tested in a recently developed generic library for fast prototyping of PDE solvers. The new wave tool is applicable to solve and analyze...
CULA: hybrid GPU accelerated linear algebra routines
Humphrey, John R.; Price, Daniel K.; Spagnoli, Kyle E.; Paolini, Aaron L.; Kelmelis, Eric J.
2010-04-01
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent FLOPS/watt ratio. High-level linear algebra operations are computationally intense, often requiring O(N3) operations and would seem a natural fit for the processing power of the GPU. Our work is on CULA, a GPU accelerated implementation of linear algebra routines. We present results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares. The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is accomplished by allowing each platform to do the work it performs most naturally.
Energy Technology Data Exchange (ETDEWEB)
Shyles, Daniel [University of Tennessee (UT); Dongarra, Jack J. [University of Tennessee, Knoxville (UTK); Guidry, Mike W. [ORNL; Tomov, Stanimire Z. [ORNL; Billings, Jay Jay [ORNL; Brock, Benjamin A. [ORNL; Haidar Ahmad, Azzam A. [ORNL
2016-09-01
Abstract—We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms that solve efficiently N coupled ordinary differential equations (subject to initial conditions) on modern GPUs. We take representative test cases (Type Ia supernova explosions) and demonstrate two or more orders of magnitude increase in efficiency for solving such systems (of realistic thermonuclear networks coupled to fluid dynamics). This implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications we present the computational techniques developed for our ongoing deployment of these new methods on modern GPU accelerators. We show that similarly to many other scientific applications, ranging from national security to medical advances, the computation can be split into many independent computational tasks, each of relatively small-size. As the size of each individual task does not provide sufficient parallelism for the underlying hardware, especially for accelerators, these tasks must be computed concurrently as a single routine, that we call batched routine, in order to saturate the hardware with enough work.
3D- VISUALIZATION BY RAYTRACING IMAGE SYNTHESIS ON GPU
Directory of Open Access Journals (Sweden)
Al-Oraiqat Anas M.
2016-06-01
Full Text Available This paper presents a realization of the approach to spatial 3D stereo of visualization of 3D images with use parallel Graphics processing unit (GPU. The experiments of realization of synthesis of images of a 3D stage by a method of trace of beams on GPU with Compute Unified Device Architecture (CUDA have shown that 60 % of the time is spent for the decision of a computing problem approximately, the major part of time (40 % is spent for transfer of data between the central processing unit and GPU for calculations and the organization process of visualization. The study of the influence of increase in the size of the GPU network at the speed of calculations showed importance of the correct task of structure of formation of the parallel computer network and general mechanism of parallelization.
Rueda, Antonio J.; Noguera, José M.; Luque, Adrián
2016-02-01
In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
Li, Zengguang; Xi, Xiaoqi; Han, Yu; Yan, Bin; Li, Lei
2016-10-01
The circle-plus-line trajectory satisfies the exact reconstruction data sufficiency condition, which can be applied in C-arm X-ray Computed Tomography (CT) system to increase reconstruction image quality in a large cone angle. The m-line reconstruction algorithm is adopted for this trajectory. The selection of the direction of m-lines is quite flexible and the m-line algorithm needs less data for accurate reconstruction compared with FDK-type algorithms. However, the computation complexity of the algorithm is very large to obtain efficient serial processing calculations. The reconstruction speed has become an important issue which limits its practical applications. Therefore, the acceleration of the algorithm has great meanings. Compared with other hardware accelerations, the graphics processing unit (GPU) has become the mainstream in the CT image reconstruction. GPU acceleration has achieved a better acceleration effect in FDK-type algorithms. But the implementation of the m-line algorithm's acceleration for the circle-plus-line trajectory is different from the FDK algorithm. The parallelism of the circular-plus-line algorithm needs to be analyzed to design the appropriate acceleration strategy. The implementation can be divided into the following steps. First, selecting m-lines to cover the entire object to be rebuilt; second, calculating differentiated back projection of the point on the m-lines; third, performing Hilbert filtering along the m-line direction; finally, the m-line reconstruction results need to be three-dimensional-resembled and then obtain the Cartesian coordinate reconstruction results. In this paper, we design the reasonable GPU acceleration strategies for each step to improve the reconstruction speed as much as possible. The main contribution is to design an appropriate acceleration strategy for the circle-plus-line trajectory m-line reconstruction algorithm. Sheep-Logan phantom is used to simulate the experiment on a single K20 GPU. The
Tanner, David E; Phillips, James C; Schulten, Klaus
2012-07-10
Molecular dynamics methodologies comprise a vital research tool for structural biology. Molecular dynamics has benefited from technological advances in computing, such as multi-core CPUs and graphics processing units (GPUs), but harnessing the full power of hybrid GPU/CPU computers remains difficult. The generalized Born/solvent-accessible surface area implicit solvent model (GB/SA) stands to benefit from hybrid GPU/CPU computers, employing the GPU for the GB calculation and the CPU for the SA calculation. Here, we explore the computational challenges facing GB/SA calculations on hybrid GPU/CPU computers and demonstrate how NAMD, a parallel molecular dynamics program, is able to efficiently utilize GPUs and CPUs simultaneously for fast GB/SA simulations. The hybrid computation principles demonstrated here are generally applicable to parallel applications employing hybrid GPU/CPU calculations.
Triggering events with GPU at ATLAS
Kama, Sami; The ATLAS collaboration
2015-01-01
The growing complexity of events produced in LHC collisions demands more and more computing power both for the on line selection and for the offline reconstruction of events. In recent years, the explosive performance growth of massively parallel processors like Graphical Processing Units both in computing power and in low energy consumption, make GPU extremely attractive for using them in a complex high energy experiment like ATLAS. Together with the optimization of reconstruction algorithms exploiting this new massively parallel paradigm, a small scale prototype of the full ATLAS High Level Trigger exploiting GPU has been implemented. We discuss the integration procedure of this prototype, the achieved performance and the prospects for the future.
GPU上计算流体力学的加速%Acceleration of Computational Fluid Dynamics Codes on GPU
Institute of Scientific and Technical Information of China (English)
董廷星; 李新亮; 李森; 迟学斌
2011-01-01
Computational Fluid Dynamic (CFD) codes based on incompressible Navier-Stokes, compressible Euler and compressible Navier-Stokes solvers are ported on NVIDIA GPU. As validation test, we have simulated a two-dimension cavity flow, Riemann problem and a transonic flow over a RAE2822 airfoil. Maximum 33.2x speedup is reported in our test. To maximum the GPU code performance, we also explore a number of GPU-specific optimization strategies. It demonstrates GPU code gives the expected results compared CPU code and experimental result and GPU computing has good compatibility and bright future.%本文将计算流体力学中的可压缩的纳维叶-斯托克斯(Navier-Stokes),不可压缩的Navier-Stokes和欧拉(Euler)方程移植到NVIDIA GPU上.模拟了3个测试例子,2维的黎曼问题,方腔流问题和RAE2822型的机翼绕流.相比于CPU,我们在GPU平台上最高得到了33.2倍的加速比.为了最大程度提高代码的性能,针对GPU平台上探索了几种优化策略.和CPU以及实验结果对比表明,利用计算流体力学在GPU平台上能够得到预想的结果,具有很好的应用前景.
Bustamam, Alhadi; Burrage, Kevin; Hamilton, Nicholas A
2012-01-01
Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks. However,with increasing vast amount of data on biological networks, performance and scalability issues are becoming a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with their data.
GPU-accelerated voxelwise hepatic perfusion quantification.
Wang, H; Cao, Y
2012-09-07
Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using compute unified device architecture-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, nonlinear least-squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626 400 voxels in a patient's liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10(-6). The method will be useful for generating liver perfusion images in clinical settings.
GPU Accelerated Surgical Simulators for Complex Morhpology
DEFF Research Database (Denmark)
Mosegaard, Jesper; Sørensen, Thomas Sangild
2005-01-01
a springmass system in order to simulate a complex organ such as the heart. Computations are accelerated by taking advantage of modern graphics processing units (GPUs). Two GPU implementations are presented. They vary in their generality of spring connections and in the speedup factor they achieve...
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.
2011-07-01
We describe and evaluate a fast implementation of a classical block-matching motion estimation algorithm for multiple graphical processing units (GPUs) using the compute unified device architecture computing engine. The implemented block-matching algorithm uses summed absolute difference error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation, we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and noninteger search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a noninteger search grid. The additional speedup for a noninteger search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition, we compared the execution time of the proposed FS GPU implementation with two existing, highly optimized nonfull grid search CPU-based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and simplified unsymmetrical multi-hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720 × 480 pixels in resolution commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
Energy Technology Data Exchange (ETDEWEB)
Nieto, J., E-mail: jnieto@sec.upm.es [Grupo de Investigacion en Instrumentacion y Acustica Aplicada. Universidad Politecnica de Madrid, Crta. Valencia Km-7, Madrid 28031 Spain (Spain); Arcas, G. de; Ruiz, M. [Grupo de Investigacion en Instrumentacion y Acustica Aplicada. Universidad Politecnica de Madrid, Crta. Valencia Km-7, Madrid 28031 Spain (Spain); Vega, J. [Asociacion EURATOM/CIEMAT para Fusion, Madrid (Spain); Lopez, J.M.; Barrera, E. [Grupo de Investigacion en Instrumentacion y Acustica Aplicada. Universidad Politecnica de Madrid, Crta. Valencia Km-7, Madrid 28031 Spain (Spain); Castro, R. [Asociacion EURATOM/CIEMAT para Fusion, Madrid (Spain); Sanz, D. [Grupo de Investigacion en Instrumentacion y Acustica Aplicada. Universidad Politecnica de Madrid, Crta. Valencia Km-7, Madrid 28031 Spain (Spain); Utzel, N.; Makijarvi, P.; Zabeo, L. [ITER Organization, CS 90 046, 13067 St. Paul lez Durance Cedex (France)
2012-12-15
Highlights: Black-Right-Pointing-Pointer Implementation of fast plant system controller (FPSC) for ITER CODAC. Black-Right-Pointing-Pointer GPU-based real time high performance computing service. Black-Right-Pointing-Pointer Performance evaluation with respect to other solutions based in multi-core processors. - Abstract: EURATOM/CIEMAT and the Technical University of Madrid UPM are involved in the development of a FPSC (fast plant system control) prototype for ITER based on PXIe form factor. The FPSC architecture includes a GPU-based real time high performance computing service which has been integrated under EPICS (experimental physics and industrial control system). In this work we present the design of this service and its performance evaluation with respect to other solutions based in multi-core processors. Plasma pre-processing algorithms, illustrative of the type of tasks that could be required for both control and diagnostics, are used during the performance evaluation.
Computing the Density Matrix in Electronic Structure Theory on Graphics Processing Units.
Cawkwell, M J; Sanville, E J; Mniszewski, S M; Niklasson, Anders M N
2012-11-13
The self-consistent solution of a Schrödinger-like equation for the density matrix is a critical and computationally demanding step in quantum-based models of interatomic bonding. This step was tackled historically via the diagonalization of the Hamiltonian. We have investigated the performance and accuracy of the second-order spectral projection (SP2) algorithm for the computation of the density matrix via a recursive expansion of the Fermi operator in a series of generalized matrix-matrix multiplications. We demonstrate that owing to its simplicity, the SP2 algorithm [Niklasson, A. M. N. Phys. Rev. B2002, 66, 155115] is exceptionally well suited to implementation on graphics processing units (GPUs). The performance in double and single precision arithmetic of a hybrid GPU/central processing unit (CPU) and full GPU implementation of the SP2 algorithm exceed those of a CPU-only implementation of the SP2 algorithm and traditional matrix diagonalization when the dimensions of the matrices exceed about 2000 × 2000. Padding schemes for arrays allocated in the GPU memory that optimize the performance of the CUBLAS implementations of the level 3 BLAS DGEMM and SGEMM subroutines for generalized matrix-matrix multiplications are described in detail. The analysis of the relative performance of the hybrid CPU/GPU and full GPU implementations indicate that the transfer of arrays between the GPU and CPU constitutes only a small fraction of the total computation time. The errors measured in the self-consistent density matrices computed using the SP2 algorithm are generally smaller than those measured in matrices computed via diagonalization. Furthermore, the errors in the density matrices computed using the SP2 algorithm do not exhibit any dependence of system size, whereas the errors increase linearly with the number of orbitals when diagonalization is employed.
Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures
Energy Technology Data Exchange (ETDEWEB)
Dong, Fengguang [Univ. of Tennessee, Knoxville, TN (United States); Tomov, Stanimire [Univ. of Tennessee, Knoxville, TN (United States); Dongarra, Jack [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
2011-06-01
We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations e ciently. Our approach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our main idea is to treat the heterogeneous system as a distributed-memory machine, and to use a heterogeneous 1-D block cyclic distribution to allocate data to the host system and GPUs to minimize communication. We have designed heterogeneous algorithms with two di erent tile sizes (one for CPU cores and the other for GPUs) to cope with processor heterogeneity. We propose an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our experiments on a compute node with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs demonstrate good weak scalability, strong scalability, load balance, and e ciency of our approach.
SecureMed: Secure Medical Computation using GPU-Accelerated Homomorphic Encryption Scheme.
Khedr, Alhassan; Gulak, Glenn
2017-01-23
Sharing the medical records of individuals among healthcare providers and researchers around the world can accelerate advances in medical research. While the idea seems increasingly practical due to cloud data services, maintaining patient privacy is of paramount importance. Standard encryption algorithms help protect sensitive data from outside attackers but they cannot be used to compute on this sensitive data while being encrypted. Homomorphic Encryption (HE) presents a very useful tool that can compute on encrypted data without the need to decrypt it. In this work, we describe an optimized NTRUbased implementation of the GSW homomorphic encryption scheme. Our results show a factor of 58 × improvement in CPU performance compared to other recent work on encrypted medical data under the same security settings. Our system is built to be easily portable to GPUs resulting in an additional speedup of up to a factor of 104 × (and 410 × ) to offer an overall speedup of 6085 × (and 24011 × ) using a single GPU (or four GPUs), respectively.
Bayesian Lasso and multinomial logistic regression on GPU.
Češnovar, Rok; Štrumbelj, Erik
2017-01-01
We describe an efficient Bayesian parallel GPU implementation of two classic statistical models-the Lasso and multinomial logistic regression. We focus on parallelizing the key components: matrix multiplication, matrix inversion, and sampling from the full conditionals. Our GPU implementations of Bayesian Lasso and multinomial logistic regression achieve 100-fold speedups on mid-level and high-end GPUs. Substantial speedups of 25 fold can also be achieved on older and lower end GPUs. Samplers are implemented in OpenCL and can be used on any type of GPU and other types of computational units, thereby being convenient and advantageous in practice compared to related work.
GPU real-time processing in NA62 trigger system
Ammendola, R.; Biagioni, A.; Chiozzi, S.; Cretaro, P.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Piccini, M.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.
2017-01-01
A commercial Graphics Processing Unit (GPU) is used to build a fast Level 0 (L0) trigger system tested parasitically with the TDAQ (Trigger and Data Acquisition systems) of the NA62 experiment at CERN. In particular, the parallel computing power of the GPU is exploited to perform real-time fitting in the Ring Imaging CHerenkov (RICH) detector. Direct GPU communication using a FPGA-based board has been used to reduce the data transmission latency. The performance of the system for multi-ring reconstrunction obtained during the NA62 physics run will be presented.
Energy Technology Data Exchange (ETDEWEB)
Almeida, Adino Americo Heimlich
2009-07-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)
Directory of Open Access Journals (Sweden)
Hee-Min Choi
2015-06-01
Full Text Available An accelerated spatial redundancy-based novel-look-up-table (A-SR-NLUT method based on a new concept of the N-point one-dimensional sub-principal fringe pattern (N-point1-D sub-PFP is implemented on a graphics processing unit (GPU for fast calculation of computer-generated holograms (CGHs of three-dimensional (3-Dobjects. Since the proposed method can generate the N-point two-dimensional (2-D PFPs for CGH calculation from the pre-stored N-point 1-D PFPs, the loading time of the N-point PFPs on the GPU can be dramatically reduced, which results in a great increase of the computational speed of the proposed method. Experimental results confirm that the average calculation time for one-object point has been reduced by 49.6% and 55.4% compared to those of the conventional 2-D SR-NLUT methods for each case of the 2-point and 3-point SR maps, respectively.
GPU Accelerated Vector Median Filter
Aras, Rifat; Shen, Yuzhong
2011-01-01
Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
GPU Pro 5 advanced rendering techniques
Engel, Wolfgang
2014-01-01
In GPU Pro5: Advanced Rendering Techniques, section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Marius Bjorge have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book covers rendering, lighting, effects in image space, mobile devices, 3D engine design, and compute. It explores rasterization of liquids, ray tracing of art assets that would otherwise be used in a rasterized engine, physically based area lights, volumetric light
Colloquium: Large scale simulations on GPU clusters
Bernaschi, Massimo; Bisson, Mauro; Fatica, Massimiliano
2015-06-01
Graphics processing units (GPU) are currently used as a cost-effective platform for computer simulations and big-data processing. Large scale applications require that multiple GPUs work together but the efficiency obtained with cluster of GPUs is, at times, sub-optimal because the GPU features are not exploited at their best. We describe how it is possible to achieve an excellent efficiency for applications in statistical mechanics, particle dynamics and networks analysis by using suitable memory access patterns and mechanisms like CUDA streams, profiling tools, etc. Similar concepts and techniques may be applied also to other problems like the solution of Partial Differential Equations.
Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.
2013-03-01
The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.
Wong, Un-Hong; Aoki, Takayuki; Wong, Hon-Cheng
2014-07-01
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct-MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU-MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.
Performance evaluation of image processing algorithms on the GPU.
Castaño-Díez, Daniel; Moser, Dominik; Schoenegger, Andreas; Pruggnaller, Sabine; Frangakis, Achilleas S
2008-10-01
The graphics processing unit (GPU), which originally was used exclusively for visualization purposes, has evolved into an extremely powerful co-processor. In the meanwhile, through the development of elaborate interfaces, the GPU can be used to process data and deal with computationally intensive applications. The speed-up factors attained compared to the central processing unit (CPU) are dependent on the particular application, as the GPU architecture gives the best performance for algorithms that exhibit high data parallelism and high arithmetic intensity. Here, we evaluate the performance of the GPU on a number of common algorithms used for three-dimensional image processing. The algorithms were developed on a new software platform called "CUDA", which allows a direct translation from C code to the GPU. The implemented algorithms include spatial transformations, real-space and Fourier operations, as well as pattern recognition procedures, reconstruction algorithms and classification procedures. In our implementation, the direct porting of C code in the GPU achieves typical acceleration values in the order of 10-20 times compared to a state-of-the-art conventional processor, but they vary depending on the type of the algorithm. The gained speed-up comes with no additional costs, since the software runs on the GPU of the graphics card of common workstations.
GPU Accelerated Semiclassical Initial Value Representation Molecular Dynamics
Tamascelli, Dario; Conte, Riccardo; Ceotto, Michele
2013-01-01
This paper presents a graphics processing units (GPUs) implementation of the semiclassical initial value representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations. The time-averaging formulation of the SC-IVR for power spectrum calculations is employed. Details about the CUDA implementation of the semiclassical code are provided. 4 molecules with an increasing number of atoms are considered and the GPU-calculated vibrational frequencies perfectly match the benchmark values. The computational time scaling of two GPUs (C2075 and K20) versus two CPUs (intel core i5 and Intel Xeon E5-2687W) shows that the CPU code scales linearly, whereas the GPU CUDA code roughly constantly for most of the trajectory range considered. Critical issues related to the GPU implementation are discussed. The resulting reduction in computational time and power consumption is significant and semiclassical GPU calculations are shown to be environment friendly.
Graphics processing units in bioinformatics, computational biology and systems biology.
Nobile, Marco S; Cazzaniga, Paolo; Tangherloni, Andrea; Besozzi, Daniela
2016-07-08
Several studies in Bioinformatics, Computational Biology and Systems Biology rely on the definition of physico-chemical or mathematical models of biological systems at different scales and levels of complexity, ranging from the interaction of atoms in single molecules up to genome-wide interaction networks. Traditional computational methods and software tools developed in these research fields share a common trait: they can be computationally demanding on Central Processing Units (CPUs), therefore limiting their applicability in many circumstances. To overcome this issue, general-purpose Graphics Processing Units (GPUs) are gaining an increasing attention by the scientific community, as they can considerably reduce the running time required by standard CPU-based software, and allow more intensive investigations of biological systems. In this review, we present a collection of GPU tools recently developed to perform computational analyses in life science disciplines, emphasizing the advantages and the drawbacks in the use of these parallel architectures. The complete list of GPU-powered tools here reviewed is available at http://bit.ly/gputools. © The Author 2016. Published by Oxford University Press.
Numerical simulation of lava flow using a GPU SPH model
Directory of Open Access Journals (Sweden)
Eugenio Rustico
2011-12-01
Full Text Available A smoothed particle hydrodynamics (SPH method for lava-flow modeling was implemented on a graphical processing unit (GPU using the compute unified device architecture (CUDA developed by NVIDIA. This resulted in speed-ups of up to two orders of magnitude. The three-dimensional model can simulate lava flow on a real topography with free-surface, non-Newtonian fluids, and with phase change. The entire SPH code has three main components, neighbor list construction, force computation, and integration of the equation of motion, and it is computed on the GPU, fully exploiting the computational power. The simulation speed achieved is one to two orders of magnitude faster than the equivalent central processing unit (CPU code. This GPU implementation of SPH allows high resolution SPH modeling in hours and days, rather than in weeks and months, on inexpensive and readily available hardware.
Architecting the Finite Element Method Pipeline for the GPU.
Fu, Zhisong; Lewis, T James; Kirby, Robert M; Whitaker, Ross T
2014-02-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers.
Efficient implementation of MrBayes on multi-GPU.
Bao, Jie; Xia, Hongju; Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-06-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.
GPU-based high performance Monte Carlo simulation in neutron transport
Energy Technology Data Exchange (ETDEWEB)
Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A. [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil). Lab. de Inteligencia Artificial Aplicada], e-mail: cmnap@ien.gov.br
2009-07-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)
Energy Technology Data Exchange (ETDEWEB)
Yoon, Jong Seon; Choi, Hyoung Gwon [Seoul Nat’l Univ. of Science and Technology, Seoul (Korea, Republic of); Jeon, Byoung Jin [Yonsei Univ., Seoul (Korea, Republic of); Jung, Hye Dong [Korea Electronics Technology Institute, Seongnam (Korea, Republic of)
2016-09-15
A parallel algorithm of bi-conjugate gradient method was developed based on CUDA for parallel computation of the incompressible Navier-Stokes equations. The governing equations were discretized using splitting P2P1 finite element method. Asymmetric stenotic flow problem was solved to validate the proposed algorithm, and then the parallel performance of the GPU was examined by measuring the elapsed times. Further, the GPU performance for sparse matrix-vector multiplication was also investigated with a matrix of fluid-structure interaction problem. A kernel was generated to simultaneously compute the inner product of each row of sparse matrix and a vector. In addition, the kernel was optimized to improve the performance by using both parallel reduction and memory coalescing. In the kernel construction, the effect of warp on the parallel performance of the present CUDA was also examined. The present GPU computation was more than 7 times faster than the single CPU by double precision.
Implementation of Membrane Algorithms on GPU
Directory of Open Access Journals (Sweden)
Xingyi Zhang
2014-01-01
Full Text Available Membrane algorithms are a new class of parallel algorithms, which attempt to incorporate some components of membrane computing models for designing efficient optimization algorithms, such as the structure of the models and the way of communication between cells. Although the importance of the parallelism of such algorithms has been well recognized, membrane algorithms were usually implemented on the serial computing device central processing unit (CPU, which makes the algorithms unable to work in an efficient way. In this work, we consider the implementation of membrane algorithms on the parallel computing device graphics processing unit (GPU. In such implementation, all cells of membrane algorithms can work simultaneously. Experimental results on two classical intractable problems, the point set matching problem and TSP, show that the GPU implementation of membrane algorithms is much more efficient than CPU implementation in terms of runtime, especially for solving problems with a high complexity.
Research on Programming Based on CPU/GPU Heterogeneous Computing Cluster%基于CPU/GPU集群的编程的研究
Institute of Scientific and Technical Information of China (English)
刘钢锋
2013-01-01
随着微处理器技术的发展,GPU/CPU的混合计算已经成为是科学计算的主流趋势.本文从编程的层面,介绍了如何利用已有的并行编程语言来,调度GPU的计算功能,主要以MPI(一种消息传递编程模型)与基于GPU的CUDA(统一计算设备架构)编程模型相结合的方式进行GPU集群程序的测试,并分析了CPU/GPU集群并行环境下的运行特点.从分析的特点中总结出GPU集群较优策略,从而为提高CPU/GPU并行程序性能提供科学依据.%With the fast development in computer and microprocessor, Scientific Computing using CPU/GPU hybrid computing cluster has become a tendency. In this paper, from programming point of view, we propose the method of GPU scheduling to improve calculation efficiency. The main methods are through the combination of MPI (Message Passing Interface) and CUDA (Compute Unified Device Architecture) based on GPU to program. According to running condition of the parallel program, the characteristic of CPU/GPU hybrid computing cluster is analyzed. From the characteristic, the optimization strategy of parallel programs is found. So, the strategy will provide basis for improving the CPU/GPU parallel program.
GPU-accelerated Monte Carlo simulation of particle coagulation based on the inverse method
Wei, J.; Kruis, F. E.
2013-09-01
Simulating particle coagulation using Monte Carlo methods is in general a challenging computational task due to its numerical complexity and the computing cost. Currently, the lowest computing costs are obtained when applying a graphic processing unit (GPU) originally developed for speeding up graphic processing in the consumer market. In this article we present an implementation of accelerating a Monte Carlo method based on the Inverse scheme for simulating particle coagulation on the GPU. The abundant data parallelism embedded within the Monte Carlo method is explained as it will allow an efficient parallelization of the MC code on the GPU. Furthermore, the computation accuracy of the MC on GPU was validated with a benchmark, a CPU-based discrete-sectional method. To evaluate the performance gains by using the GPU, the computing time on the GPU against its sequential counterpart on the CPU were compared. The measured speedups show that the GPU can accelerate the execution of the MC code by a factor 10-100, depending on the chosen particle number of simulation particles. The algorithm shows a linear dependence of computing time with the number of simulation particles, which is a remarkable result in view of the n2 dependence of the coagulation.
A SURVEY PAPER ON SOLVING TSP USING ANT COLONY OPTIMIZATION ON GPU
Directory of Open Access Journals (Sweden)
Khushbu khatri
2015-10-01
Full Text Available Ant Colony Optimization (ACO is meta-heuristic algorithm inspired from nature to solve many combinatorial optimization problem such as Travelling Salesman Problem (TSP. There are many versions of ACO used to solve TSP like, Ant System, Elitist Ant System, Max-Min Ant System, Rank based Ant System algorithm. For improved performance, these methods can be implemented in parallel architecture like GPU, CUDA architecture. Graphics Processing Unit (GPU provides highly parallel and fully programmable platform. GPUs which have many processing units with an off-chip global memory can be used for general purpose parallel computation. This paper presents a survey on different solving TSP using ACO on GPU.
Institute of Scientific and Technical Information of China (English)
Shenghan Ren; Xueli Chen; Xu Cao; Shouping Zhu; Jimin Liang
2016-01-01
Simplified spherical harmonics approximation (SPN) equations are widely used in modeling light propagation in biological tissues.However,with the increase of order N,its computational burden will severely aggravate.We propose a graphics processing unit (GPU) accelerated framework for SPN equations.Compared with the conventional central processing unit implementation,an increased performance of the GPU framework is obtained with an increase in mesh size,with the best speed-up ratio of 25 among the studied cases.The influence of thread distribution on the performance of the GPU framework is also investigated.
A Survey Paper on Solving TSP using Ant Colony Optimization on GPU
Directory of Open Access Journals (Sweden)
Khushbu Khatri
2014-12-01
Full Text Available Ant Colony Optimization (ACO is meta-heuristic algorithm inspired from nature to solve many combinatorial optimization problem such as Travelling Salesman Problem (TSP. There are many versions of ACO used to solve TSP like, Ant System, Elitist Ant System, Max-Min Ant System, Rank based Ant System algorithm. For improved performance, these methods can be implemented in parallel architecture like GPU, CUDA architecture. Graphics Processing Unit (GPU provides highly parallel and fully programmable platform. GPUs which have many processing units with an off-chip global memory can be used for general purpose parallel computation. This paper presents a survey on different solving TSP using ACO on GPU.
Gpu Implementation of a Viscous Flow Solver on Unstructured Grids
Xu, Tianhao; Chen, Long
2016-06-01
Graphics processing units have gained popularities in scientific computing over past several years due to their outstanding parallel computing capability. Computational fluid dynamics applications involve large amounts of calculations, therefore a latest GPU card is preferable of which the peak computing performance and memory bandwidth are much better than a contemporary high-end CPU. We herein focus on the detailed implementation of our GPU targeting Reynolds-averaged Navier-Stokes equations solver based on finite-volume method. The solver employs a vertex-centered scheme on unstructured grids for the sake of being capable of handling complex topologies. Multiple optimizations are carried out to improve the memory accessing performance and kernel utilization. Both steady and unsteady flow simulation cases are carried out using explicit Runge-Kutta scheme. The solver with GPU acceleration in this paper is demonstrated to have competitive advantages over the CPU targeting one.
GPU General Computing：A Revolution in Computing%GPU通用计算：计算领域的一次革命
Institute of Scientific and Technical Information of China (English)
张宏泰
2011-01-01
随着GPU功能越来越强大，特别是CUDA的推出，在全世界范围内掀起了一场对GPU通用计算的研究热潮，本文在研究国内最新文献的基础上，从GPU通用计算的发展历程、架构优势、发展方向等方面对其进行了深入解读，提出了GPU通用计算发展普及的一些有效建议。%With more powerful GPU, especially the introduction of CUDA, a worldwide launched a study on the purpose GPU computing boom,this recent literature in the study of the country, based on the GPU general computing from the development process,architectural a
Komura, Yukihiro
2014-01-01
We present sample CUDA programs for the GPU computing of the Swendsen-Wang multi-cluster spin flip algorithm. We deal with the classical spin models; the Ising model, the $q$-state Potts model, and the classical XY model. As for the lattice, both the 2D (square) lattice and the 3D (simple cubic) lattice are treated. We already reported the idea of the GPU implementation for 2D models [Comput. Phys. Commun. 183 (2012) 1155-1161]. We here explain the details of sample programs, and discuss the performance of the present GPU implementation for the 3D Ising and XY models. We also show the calculated results of the moment ratio for these models, and discuss phase transitions.
Parallel Implementation of Similarity Measures on GPU Architecture using CUDA
Directory of Open Access Journals (Sweden)
Kuldeep Yadav
2012-02-01
Full Text Available Image processing and pattern recognition algorithms take more time for execution on a single core processor. Graphics Processing Unit (GPU is more popular now-a-days due to their speed, programmability,low cost and more inbuilt execution cores in it. Most of the researchers started work to use GPUs as a processing unit with a single core computer system to speedup execution of algorithms and in the field of Content based medical image retrieval (CBMIR, Euclidean distance and Mahalanobis plays an important role in retrieval of images. Distance formula is important because it plays an important role in matching the images. In this research work, we parallelized Euclidean distance algorithm on CUDA. CPU with Intel® Dual-CoreE5500 @ 2.80GHz and 2.0 GB of main memory which run on Windows XP (SP2. The next step was to convert this code in GPU format i.e. to run this program on GPU NVIDIA GeForce series 9500GT model having 1023MB of video memory of DDR2 type and bus width of 64bit. The graphic driver we used is of 270.81 series of NVIDIA. In this paper both the CPU and GPU version of algorithm is being implemented on the MATLABR2010. The CPU version of the algorithm is being analyzed in simple MATLAB but the GPU version is being implemented with the help of intermediate software Jacket-win-1.3.0. For using Jacket, we have to make some changes in our source code so to make the CPU and GPU to work simultaneously and thus reducing the overall computational acceleration . Our work employs extensive usage of highly multithreaded architecture of multicored GPU. An efficient use of shared memory is required to optimize parallel reduction in Compute Unified Device Architecture (CUDA, Graphic Processing Units (GPUs are emerging as powerful parallel systems at a cheap cost of a few thousand rupees.
2015-06-01
Particle-to- Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method by Richard H Haney and Dale Shires......ARL-TR-7315 ● JUNE 2015 US Army Research Laboratory Analysis and Implementation of Particle-to- Particle (P2P) Graphics Processor
Eckert, C. H. J.; Zenker, E.; Bussmann, M.; Albach, D.
2016-10-01
We present an adaptive Monte Carlo algorithm for computing the amplified spontaneous emission (ASE) flux in laser gain media pumped by pulsed lasers. With the design of high power lasers in mind, which require large size gain media, we have developed the open source code HASEonGPU that is capable of utilizing multiple graphic processing units (GPUs). With HASEonGPU, time to solution is reduced to minutes on a medium size GPU cluster of 64 NVIDIA Tesla K20m GPUs and excellent speedup is achieved when scaling to multiple GPUs. Comparison of simulation results to measurements of ASE in Y b 3 + : Y AG ceramics show perfect agreement.
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.
Nagaoka, Tomoaki; Watanabe, Soichi
2010-01-01
Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
High Performance GPU-Based Fourier Volume Rendering.
Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr
2015-01-01
Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its (N (2)logN) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are (N (3)) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
High Performance GPU-Based Fourier Volume Rendering
Directory of Open Access Journals (Sweden)
Marwan Abdellah
2015-01-01
Full Text Available Fourier volume rendering (FVR is a significant visualization technique that has been used widely in digital radiography. As a result of its O(N2logN time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are O(N3 computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
Demidov, A.; Eschlböck-Fuchs, S.; Kazakov, A. Ya.; Gornushkin, I. B.; Kolmhofer, P. J.; Pedarnig, J. D.; Huber, N.; Heitz, J.; Schmid, T.; Rössler, R.; Panne, U.
2016-11-01
The improved Monte-Carlo (MC) method for standard-less analysis in laser induced breakdown spectroscopy (LIBS) is presented. Concentrations in MC LIBS are found by fitting model-generated synthetic spectra to experimental spectra. The current version of MC LIBS is based on the graphic processing unit (GPU) computation and reduces the analysis time down to several seconds per spectrum/sample. The previous version of MC LIBS which was based on the central processing unit (CPU) computation requested unacceptably long analysis times of 10's minutes per spectrum/sample. The reduction of the computational time is achieved through the massively parallel computing on the GPU which embeds thousands of co-processors. It is shown that the number of iterations on the GPU exceeds that on the CPU by a factor > 1000 for the 5-dimentional parameter space and yet requires > 10-fold shorter computational time. The improved GPU-MC LIBS outperforms the CPU-MS LIBS in terms of accuracy, precision, and analysis time. The performance is tested on LIBS-spectra obtained from pelletized powders of metal oxides consisting of CaO, Fe2O3, MgO, and TiO2 that simulated by-products of steel industry, steel slags. It is demonstrated that GPU-based MC LIBS is capable of rapid multi-element analysis with relative error between 1 and 10's percent that is sufficient for industrial applications (e.g. steel slag analysis). The results of the improved GPU-based MC LIBS are positively compared to that of the CPU-based MC LIBS as well as to the results of the standard calibration-free (CF) LIBS based on the Boltzmann plot method.
Komura, Yukihiro; Okabe, Yutaka
2016-03-01
We present new versions of sample CUDA programs for the GPU computing of the Swendsen-Wang multi-cluster spin flip algorithm. In this update, we add the method of GPU-based cluster-labeling algorithm without the use of conventional iteration (Komura, 2015) to those programs. For high-precision calculations, we also add a random-number generator in the cuRAND library. Moreover, we fix several bugs and remove the extra usage of shared memory in the kernel functions.
Unit 03 - Introduction to Computers
Unit 74, CC in GIS; National Center for Geographic Information and Analysis
1990-01-01
This unit provides a brief introduction to computer hardware and software. It discusses binary notation, the ASCII coding system and hardware components including the central processing unit (CPU), memory, peripherals and storage media. Software including operating systems, word processors database packages, spreadsheets and statistical packages are briefly described.
gPGA: GPU Accelerated Population Genetics Analyses.
Directory of Open Access Journals (Sweden)
Chunbao Zhou
Full Text Available The isolation with migration (IM model is important for studies in population genetics and phylogeography. IM program applies the IM model to genetic data drawn from a pair of closely related populations or species based on Markov chain Monte Carlo (MCMC simulations of gene genealogies. But computational burden of IM program has placed limits on its application.With strong computational power, Graphics Processing Unit (GPU has been widely used in many fields. In this article, we present an effective implementation of IM program on one GPU based on Compute Unified Device Architecture (CUDA, which we call gPGA.Compared with IM program, gPGA can achieve up to 52.30X speedup on one GPU. The evaluation results demonstrate that it allows datasets to be analyzed effectively and rapidly for research on divergence population genetics. The software is freely available with source code at https://github.com/chunbaozhou/gPGA.
Optimizing Performance of Scientific Visualization Software to Support Frontier-Class Computations
2015-08-01
assistance with accessing graphics processing unit ( GPU )- enabled nodes on the HPC utility server systems via the Portable Batch System (PBS) batch job... graphics processing unit ( GPU )-enabled and large memory compute nodes. The EnSight client will run on the first allocated node (which is the graphics ...Defense DR Clients distributed rendering clients GPU graphics processing unit HPC high-performance computing HPCMDC High-Performance Computing
Institute of Scientific and Technical Information of China (English)
王芳; 秦磊华
2016-01-01
While raytracing, the screen image pixel is decomposed into the combination of radiance and texture of the patches, created as scene objects intersect with the casting ray. The radiance of each patch is calculated at the linear combination of the bases of bi-directional reflectance distribution function (BRDF), and able to be accelerated by graphics processing unit (GPU) parallel rendering. This paper presents a global illumination rendering algorithm based on BRDF and GPU parallel computation. With GPU parallel acceleration, through improving the efficiency of rendering, the algorithm achieves global illumination real-time rendering of the scene including dynamic interactive material. The key research: object surface’s multiple reflection characteristic is represented by linear combination of the basis of BRDF, so transforming the nonlinear problem to a linear one, thus improve the rendering efficiency. With GPU parallel acceleration, the algorithm calculates the object surface’s radiation energy and texture mapping and their linear combination, further improving the efficiency of rendering to meet the requirement of real-time.%基于光线追踪，将屏幕图像像素分解为投射光线与场景对象交点面片辐射亮度和纹理贴图的合成，每个面片的辐射亮度计算基于双向反射分布函数(BRDF)基的线性组合，并通过图形处理器(GPU)处理核心并行绘制进行加速，最后与并行计算的纹理映射结果进行合成。提出了一种基于BRDF和GPU 并行计算的全局光照实时渲染算法，利用GPU并行加速，在提高绘制效率的前提下，实现动态交互材质的全局光照实时渲染。重点研究：对象表面对光线的多次反射用BRDF基的线性组合来表示，将非线性问题转换为线性问题，从而提高绘制效率；利用GPU并行加速，分别计算对象表面光辐射能量和纹理映射及其线性组合，进一步提高计算效率满足实时绘制需求。
Triana-Martinez, J.; Orjuela-Vargas, S. A.; Philips, W.
2013-03-01
This paper compares the speed performance of a set of classic image algorithms for evaluating texture in images by using CUDA programming. We include a summary of the general program mode of CUDA. We select a set of texture algorithms, based on statistical analysis, that allow the use of repetitive functions, such as the Coocurrence Matrix, Haralick features and local binary patterns techniques. The memory allocation time between the host and device memory is not taken into account. The results of this approach show a comparison of the texture algorithms in terms of speed when executed on CPU and GPU processors. The comparison shows that the algorithms can be accelerated more than 40 times when implemented using CUDA environment.
Accelerated GPU based SPECT Monte Carlo simulations
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-01
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: 99m Tc, 111In and 131I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational efficiency
Accelerated GPU based SPECT Monte Carlo simulations.
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-07
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: (99m) Tc, (111)In and (131)I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational
GPU-Boosted Camera-Only Indoor Localization
DEFF Research Database (Denmark)
Özkil, Ali Gürcan; Fan, Zhun; Kristensen, Jens Klæstrup
relies on local image features detection, description and matching; by parallelizing these computationally intensive tasks on the graphical processing unit (GPU), it is possible to do online localization using a Topometric Appearance Map. The method is developed as an integral part of a mobile service...
Molecular Dynamics Simulation of Macromolecules Using Graphics Processing Unit
Xu, Ji; Ge, Wei; Yu, Xiang; Yang, Xiaozhen; Li, Jinghai
2010-01-01
Molecular dynamics (MD) simulation is a powerful computational tool to study the behavior of macromolecular systems. But many simulations of this field are limited in spatial or temporal scale by the available computational resource. In recent years, graphics processing unit (GPU) provides unprecedented computational power for scientific applications. Many MD algorithms suit with the multithread nature of GPU. In this paper, MD algorithms for macromolecular systems that run entirely on GPU are presented. Compared to the MD simulation with free software GROMACS on a single CPU core, our codes achieve about 10 times speed-up on a single GPU. For validation, we have performed MD simulations of polymer crystallization on GPU, and the results observed perfectly agree with computations on CPU. Therefore, our single GPU codes have already provided an inexpensive alternative for macromolecular simulations on traditional CPU clusters and they can also be used as a basis to develop parallel GPU programs to further spee...
GPU computing with OpenCL to model 2D elastic wave propagation: exploring memory usage
Iturrarán-Viveros, Ursula; Molero-Armenta, Miguel
2015-01-01
Graphics processing units (GPUs) have become increasingly powerful in recent years. Programs exploring the advantages of this architecture could achieve large performance gains and this is the aim of new initiatives in high performance computing. The objective of this work is to develop an efficient tool to model 2D elastic wave propagation on parallel computing devices. To this end, we implement the elastodynamic finite integration technique, using the industry open standard open computing language (OpenCL) for cross-platform, parallel programming of modern processors, and an open-source toolkit called [Py]OpenCL. The code written with [Py]OpenCL can run on a wide variety of platforms; it can be used on AMD or NVIDIA GPUs as well as classical multicore CPUs, adapting to the underlying architecture. Our main contribution is its implementation with local and global memory and the performance analysis using five different computing devices (including Kepler, one of the fastest and most efficient high performance computing technologies) with various operating systems.
Setiani, Tia Dwi; Suprijadi, Haryanto, Freddy
2016-03-01
Monte Carlo (MC) is one of the powerful techniques for simulation in x-ray imaging. MC method can simulate the radiation transport within matter with high accuracy and provides a natural way to simulate radiation transport in complex systems. One of the codes based on MC algorithm that are widely used for radiographic images simulation is MC-GPU, a codes developed by Andrea Basal. This study was aimed to investigate the time computation of x-ray imaging simulation in GPU (Graphics Processing Unit) compared to a standard CPU (Central Processing Unit). Furthermore, the effect of physical parameters to the quality of radiographic images and the comparison of image quality resulted from simulation in the GPU and CPU are evaluated in this paper. The simulations were run in CPU which was simulated in serial condition, and in two GPU with 384 cores and 2304 cores. In simulation using GPU, each cores calculates one photon, so, a large number of photon were calculated simultaneously. Results show that the time simulations on GPU were significantly accelerated compared to CPU. The simulations on the 2304 core of GPU were performed about 64 -114 times faster than on CPU, while the simulation on the 384 core of GPU were performed about 20 - 31 times faster than in a single core of CPU. Another result shows that optimum quality of images from the simulation was gained at the history start from 108 and the energy from 60 Kev to 90 Kev. Analyzed by statistical approach, the quality of GPU and CPU images are relatively the same.
Image Volume Rendering based on GPU Computing%基于GPU通用计算的图像体绘制
Institute of Scientific and Technical Information of China (English)
吴井胜; 鲍旭东
2008-01-01
基于GPU(graphic processing unit)的体绘制是体视化技术研究的重要分支.应用GPU通用计算改进基于GPU的图像体绘制,在体积图像处理、代理几何面生成、代理几何面渲染等体绘制过程中使用GPU通用计算技术,以提高绘制效率,改善图像质量.实验证明,基于GPU通用计算的体绘制在交互性能和绘制效果方面均表现良好.
Parallel generation of architecture on the GPU
Steinberger, Markus
2014-05-01
In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars on the graphics processing unit (GPU). Unlike previous approaches that are either limited in the kind of shapes they allow, the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling including stochasticity and context-sensitivity. To increase parallelism, we explicitly express independence in the grammar, reduce inter-rule dependencies required for context-sensitive evaluation, and introduce intra-rule parallelism. Our rule scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by dynamically grouping rules in on-chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude faster than the standard in CPU-based rule evaluation, while offering equal expressive power. In comparison to the state of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context-sensitivity. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
Fully 3D GPU PET reconstruction
Energy Technology Data Exchange (ETDEWEB)
Herraiz, J.L., E-mail: joaquin@nuclear.fis.ucm.es [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Espana, S. [Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA (United States); Cal-Gonzalez, J. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Vaquero, J.J. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Desco, M. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Unidad de Medicina y Cirugia Experimental, Hospital General Universitario Gregorio Maranon, Madrid (Spain); Udias, J.M. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain)
2011-08-21
Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.
GPU-based prompt gamma ray imaging from boron neutron capture therapy
Energy Technology Data Exchange (ETDEWEB)
Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae, E-mail: suhsanta@catholic.ac.kr [Department of Biomedical Engineering and Research Institute of Biomedical Engineering, College of Medicine, Catholic University of Korea, Seoul 505 137-701 (Korea, Republic of); Jo Hong, Key [Molecular Imaging Program at Stanford (MIPS), Department of Radiology, Stanford University, 300 Pasteur Drive, Stanford, California 94305 (United States); Sil Lee, Keum [Department of Radiation Oncology, Stanford University School of Medicine, 875 Blake Wilbur Drive, Stanford, California 94305-5847 (United States)
2015-01-15
Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.
Multi-GPU adaptation of a simulator of heart electric activity
Directory of Open Access Journals (Sweden)
Víctor M. García
2013-12-01
Full Text Available The simulation of the electrical activity of the heart is calculated by solving a large system of ordinary differential equations; this takes an enormous amount of computation time. In recent years graphics processing unit (GPU are being introduced in the field of high performance computing. These powerful computing devices have attracted research groups requiring simulate the electrical activity of the heart. The research group signing this paper has developed a simulator of cardiac electrical activity that runs on a single GPU. This article describes the adaptation and modification of the simulator to run on multiple GPU. The results confirm that the technique significantly reduces the execution time compared to those obtained with a single GPU, and allows the solution of larger problems.
GPU-Acceleration of Parallel Unconditionally Stable Group Explicit Finite Difference Method
Parand, K.; Zafarvahedian, Saeed; Hossayni, Sayyed A.
2013-01-01
Graphics Processing Units (GPUs) are high performance co-processors originally intended to improve the use and quality of computer graphics applications. Once, researchers and practitioners noticed the potential of using GPU for general purposes, GPUs applications have been extended from graphics applications to other fields. The main objective of this paper is to evaluate the impact of using GPU in solution of the transient diffusion type equation by parallel and stable group explicit finite...
Medical image processing on the GPU - past, present and future.
Eklund, Anders; Dufort, Paul; Forsberg, Daniel; LaConte, Stephen M
2013-12-01
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges.
Implementation of a Parallel Tree Method on a GPU
Nakasato, Naohito
2011-01-01
The kd-tree is a fundamental tool in computer science. Among other applications, the application of kd-tree search (by the tree method) to the fast evaluation of particle interactions and neighbor search is highly important, since the computational complexity of these problems is reduced from O(N^2) for a brute force method to O(N log N) for the tree method, where N is the number of particles. In this paper, we present a parallel implementation of the tree method running on a graphics processing unit (GPU). We present a detailed description of how we have implemented the tree method on a Cypress GPU. An optimization that we found important is localized particle ordering to effectively utilize cache memory. We present a number of test results and performance measurements. Our results show that the execution of the tree traversal in a force calculation on a GPU is practical and efficient.
Real-Time Computation of Parameter Fitting and Image Reconstruction Using Graphical Processing Units
Locans, Uldis; Suter, Andreas; Fischer, Jannis; Lustermann, Werner; Dissertori, Gunther; Wang, Qiulin
2016-01-01
In recent years graphical processing units (GPUs) have become a powerful tool in scientific computing. Their potential to speed up highly parallel applications brings the power of high performance computing to a wider range of users. However, programming these devices and integrating their use in existing applications is still a challenging task. In this paper we examined the potential of GPUs for two different applications. The first application, created at Paul Scherrer Institut (PSI), is used for parameter fitting during data analysis of muSR (muon spin rotation, relaxation and resonance) experiments. The second application, developed at ETH, is used for PET (Positron Emission Tomography) image reconstruction and analysis. Applications currently in use were examined to identify parts of the algorithms in need of optimization. Efficient GPU kernels were created in order to allow applications to use a GPU, to speed up the previously identified parts. Benchmarking tests were performed in order to measure the ...
Institute of Scientific and Technical Information of China (English)
龚曙光; 刘奇良; 卢海山; 周志勇; 张佳
2015-01-01
针对无网格 Galerkin 法计算耗时的问题，采用逐节点对法来组装刚度矩阵、共轭梯度法求解基于 CSR 格式存储的稀疏线性方程组，提出了一种利用罚函数法施加本质边界条件的 EFG 法 GPU 加速并行算法，给出了刚度矩阵和惩罚刚度矩阵的统一格式，以及 GPU 加速并行算法的流程图。编写了基于 CUDA 构架平台的 GPU 程序，且在 NVIDIA GeForce GTX 660显卡上通过数值算例对所提算法进行了性能测试与分析比较，探讨了影响加速比的因素。算例结果验证了所提算法的可行性，并在满足计算精度的前提下，其加速比最大可达17倍；同时线性方程组的求解对加速比起决定性影响。%In order to reduce the computing time of Element-Free Galerkin(EFG)method,a GPU accele-ration parallel algorithm of EFG method that essential boundary condition is imposed by penalty function method is proposed,in which stiffness matrix is assembled by node pair-wise approach,and sparse linear equations based on CSR format is solved by conjugate gradient methods.The unified format of stiffness matrix and penalty stiffness matrix was derived,and the flow chart of the parallel algorithm was provided.The GPU codes were programmed on CUDA,and algorithm testing was finished on the device of NVIDIA GeForce GTX 660 by numerical examples.The factors of affecting speedup ratio were discussed.The example results verified the feasibility of the proposed algorithm.The maximum speedup ratio was up to 17 times on the premise that the calculating accuracy is met,and to solve linear equations is the major factor in the speedup.
Institute of Scientific and Technical Information of China (English)
朱佳丽; 黄茂海
2013-01-01
The Hi-GAL (Herschel infrared Galactic Plane Survey) images provide data with extraor-dinary spatial coverage and resolution for studying the FIR emission in the Galactic Plane. Graphics processing unit (GPU) parallel computing technologies are well suitable for accelerating processing and mining of this massive data. We illustrate the application of GPU parallel computing technolo-gies in two examples of Hi-GAL data processing. We spare the unnecessary physical details and focus on the method of using GPU in Herschel infrared data processing. In the first example, we demonstrate a simple and straightforward application of GPU parallel computing technologies by fitting the far-infrared spectral energy distribution of the dust continuum emission in the Hi-GAL l = 30◦field. There are over 3 × 105 pixels in image of the l = 30◦field. The fitting procedure for every pixel is performed in parallel by a GPU. Comparing the time-cost for fitting the entire image, the acceleration factor of the build-in GPU on a low performance laptop is 68, and a specialized GPU is 5513 times faster than a Xeon E5620 with one core. In the second example, we demonstrate a more sophisticated application of GPU parallel com-puting technologies. Based on the Hi-GAL l = 30◦field data, the distribution of molecular clouds derived from GRS (Galactic Ring Survey) data, and the properties of H II regions, we construct a 3D model of the interstellar medium to calculate the absorption of dust grains associated with molecular clouds. The resolution of the 3D model is 100 pc × 0.2◦× 0.2◦for a spherical grid. For this reso-lution, there are 4493 cells in total responsible for absorbing FUV photons. The absorption of these cells is calculated in parallel by a GPU. The resulted absorption is then compared with observations using Monte Carlo fitting method. In every iteration, CPU samples the free parameters and computes the goodness of fitting. The GPU part of the calculation is 95%of the
Accelerate micromagnetic simulations with GPU programming in MATLAB
Zhu, Ru
2015-01-01
A finite-difference Micromagnetic simulation code written in MATLAB is presented with Graphics Processing Unit (GPU) acceleration. The high performance of Graphics Processing Unit (GPU) is demonstrated compared to a typical Central Processing Unit (CPU) based code. The speed-up of GPU to CPU is shown to be greater than 30 for problems with larger sizes on a mid-end GPU in single precision. The code is less than 200 lines and suitable for new algorithm developing.
Accelerate micromagnetic simulations with GPU programming in MATLAB
Zhu, Ru
2015-01-01
A finite-difference Micromagnetic simulation code written in MATLAB is presented with Graphics Processing Unit (GPU) acceleration. The high performance of Graphics Processing Unit (GPU) is demonstrated compared to a typical Central Processing Unit (CPU) based code. The speed-up of GPU to CPU is shown to be greater than 30 for problems with larger sizes on a mid-end GPU in single precision. The code is less than 200 lines and suitable for new algorithm developing.
How General-Purpose can a GPU be?
Directory of Open Access Journals (Sweden)
Philip Machanick
2015-12-01
Full Text Available The use of graphics processing units (GPUs in general-purpose computation (GPGPU is a growing field. GPU instruction sets, while implementing a graphics pipeline, draw from a range of single instruction multiple datastream (SIMD architectures characteristic of the heyday of supercomputers. Yet only one of these SIMD instruction sets has been of application on a wide enough range of problems to survive the era when the full range of supercomputer design variants was being explored: vector instructions. This paper proposes a reconceptualization of the GPU as a multicore design with minimal exotic modes of parallelism so as to make GPGPU truly general.
GPU-based computational adaptive optics for volumetric optical coherence microscopy
Tang, Han; Mulligan, Jeffrey A.; Untracht, Gavrielle R.; Zhang, Xihao; Adie, Steven G.
2016-03-01
Optical coherence tomography (OCT) is a non-invasive imaging technique that measures reflectance from within biological tissues. Current higher-NA optical coherence microscopy (OCM) technologies with near cellular resolution have limitations on volumetric imaging capabilities due to the trade-offs between resolution vs. depth-of-field and sensitivity to aberrations. Such trade-offs can be addressed using computational adaptive optics (CAO), which corrects aberration computationally for all depths based on the complex optical field measured by OCT. However, due to the large size of datasets plus the computational complexity of CAO and OCT algorithms, it is a challenge to achieve high-resolution 3D-OCM reconstructions at speeds suitable for clinical and research OCM imaging. In recent years, real-time OCT reconstruction incorporating both dispersion and defocus correction has been achieved through parallel computing on graphics processing units (GPUs). We add to these methods by implementing depth-dependent aberration correction for volumetric OCM using plane-by-plane phase deconvolution. Following both defocus and aberration correction, our reconstruction algorithm achieved depth-independent transverse resolution of 2.8 um, equal to the diffraction-limited focal plane resolution. We have translated the CAO algorithm to a CUDA code implementation and tested the speed of the software in real-time using two GPUs - NVIDIA Quadro K600 and Geforce TITAN Z. For a data volume containing 4096×256×256 voxels, our system's processing speed can keep up with the 60 kHz acquisition rate of the line-scan camera, and takes 1.09 seconds to simultaneously update the CAO correction for 3 en face planes at user-selectable depths.
GPU-S2S:面向GPU的源到源翻译转化%GPU-S2S: a source to source compiler for GPU
Institute of Scientific and Technical Information of China (English)
李丹; 曹海军; 董小社; 张保
2012-01-01
To address the problem of poor software portability and programmability of a graphic processing unit ( GPU ) , and to facilitate the development of parallel programs on GPU, this study proposed a novel directive based compiler guided approach, and then the GPU-S2S, a prototypic tool for automatic source-to-source translation, was implemented through combining automatic mapping with static compilation configuration, which is capable of translating a C sequential program with directives into a compute unified device architecture (CUDA) program. The experimental results show that CUDA codes generated by the GPU-S2S can achieve comparable performance to that of CUDA benchmarks provided by NVIDIA CUDA SDK, and have significant performance improvements compared to its original C sequential codes.%针对图形处理器(GPU)架构下的软件可移植性、可编程性差的问题,为了便于在GPU上开发并行程序,通过自动映射与静态编译相结合,提出了一种新的基于制导语句控制的编译优化方法,实现了一个源到源的自动转化工具GPU-S2S,它能够将插入了制导语句的串行C程序转化为统一计算架构(CUDA)程序.实验结果表明,经GPU-S2S转化生成的代码和英伟达(NVIDIA)提供的基准测试代码具有相当的性能；与原串行程序在CPU上执行相比,转换后的并行程序在GPU上能够获取显著的性能提升.
2013-01-01
CUDA * Optimal employment of GPU memory...the GPU using the stream construct within CUDA . Using this technique, a small amount of...input tile data is sent to the GPU initially. Then, while the CUDA kernels process
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications.
Lei, Guoqing; Dou, Yong; Wan, Wen; Xia, Fei; Li, Rongchun; Ma, Meng; Zou, Dan
2012-01-01
Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.
Directory of Open Access Journals (Sweden)
G Boroni
2017-03-01
Full Text Available Lattice Boltzmann Method (LBM has shown great potential in fluid simulations, but performance issues and difficulties to manage complex boundary conditions have hindered a wider application. The upcoming of Graphic Processing Units (GPU Computing offered a possible solution for the performance issue, and methods like the Immersed Boundary (IB algorithm proved to be a flexible solution to boundaries. Unfortunately, the implicit IB algorithm makes the LBM implementation in GPU a non-trivial task. This work presents a fully parallel GPU implementation of LBM in combination with IB. The fluid-boundary interaction is implemented via GPU kernels, using execution configurations and data structures specifically designed to accelerate each code execution. Simulations were validated against experimental and analytical data showing good agreement and improving the computational time. Substantial reductions of calculation rates were achieved, lowering down the required time to execute the same model in a CPU to about two magnitude orders.
Wilson, J Adam; Williams, Justin C
2009-01-01
The clock speeds of modern computer processors have nearly plateaued in the past 5 years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card [graphics processing unit (GPU)] was developed for real-time neural signal processing of a brain-computer interface (BCI). The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter), followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a central processing unit-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels of 250 ms in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.
Lee, Sungyoung; Kwon, Min-Seok; Park, Taesung
2014-01-01
In genome-wide association studies (GWAS), regression analysis has been most commonly used to establish an association between a phenotype and genetic variants, such as single nucleotide polymorphism (SNP). However, most applications of regression analysis have been restricted to the investigation of single marker because of the large computational burden. Thus, there have been limited applications of regression analysis to multiple SNPs, including gene-gene interaction (GGI) in large-scale GWAS data. In order to overcome this limitation, we propose CARAT-GxG, a GPU computing system-oriented toolkit, for performing regression analysis with GGI using CUDA (compute unified device architecture). Compared to other methods, CARAT-GxG achieved almost 700-fold execution speed and delivered highly reliable results through our GPU-specific optimization techniques. In addition, it was possible to achieve almost-linear speed acceleration with the application of a GPU computing system, which is implemented by the TORQUE Resource Manager. We expect that CARAT-GxG will enable large-scale regression analysis with GGI for GWAS data.
Betatron tune measurement with the LHC damper using a GPU
Dubouchet, Frédéric; Höfle, Wolfgang
This thesis studies a possible futur implementation of a betatron tune measure- ment in the Large Hadron Collider (LHC) at European organization for nuclear research (CERN) using a General Purpose Graphic Processing Unit (GPGPU) to analyse data acquired with the LHC transverse transverse damper (ADT). The present hardware and future possible implementations using ADT acquisi- tions and Graphic Processing Unit (GPU) computing are described. The ADT data have to be processed to extract the betatron tune. To compute the tune, the signal is transformed from the time domain to the frequency domain using Fast Fourier Transform (FFT) on GPUs. We show that it is possible to achieve one order of magnitude faster FFTs on a Fermi generation GPU than what can be done using a i7 generation Central Processing Unit (CPU). This makes online per bunch FFT computation and betatron tune measurement possible.
Energy- and cost-efficient lattice-QCD computations using graphics processing units
Energy Technology Data Exchange (ETDEWEB)
Bach, Matthias
2014-07-01
Quarks and gluons are the building blocks of all hadronic matter, like protons and neutrons. Their interaction is described by Quantum Chromodynamics (QCD), a theory under test by large scale experiments like the Large Hadron Collider (LHC) at CERN and in the future at the Facility for Antiproton and Ion Research (FAIR) at GSI. However, perturbative methods can only be applied to QCD for high energies. Studies from first principles are possible via a discretization onto an Euclidean space-time grid. This discretization of QCD is called Lattice QCD (LQCD) and is the only ab-initio option outside of the high-energy regime. LQCD is extremely compute and memory intensive. In particular, it is by definition always bandwidth limited. Thus - despite the complexity of LQCD applications - it led to the development of several specialized compute platforms and influenced the development of others. However, in recent years General-Purpose computation on Graphics Processing Units (GPGPU) came up as a new means for parallel computing. Contrary to machines traditionally used for LQCD, graphics processing units (GPUs) are a massmarket product. This promises advantages in both the pace at which higher-performing hardware becomes available and its price. CL2QCD is an OpenCL based implementation of LQCD using Wilson fermions that was developed within this thesis. It operates on GPUs by all major vendors as well as on central processing units (CPUs). On the AMD Radeon HD 7970 it provides the fastest double-precision D kernel for a single GPU, achieving 120GFLOPS. D - the most compute intensive kernel in LQCD simulations - is commonly used to compare LQCD platforms. This performance is enabled by an in-depth analysis of optimization techniques for bandwidth-limited codes on GPUs. Further, analysis of the communication between GPU and CPU, as well as between multiple GPUs, enables high-performance Krylov space solvers and linear scaling to multiple GPUs within a single system. LQCD
Mano, Omer; Clark, Damon A
2017-01-01
Sensory neuroscience seeks to understand and predict how sensory neurons respond to stimuli. Nonlinear components of neural responses are frequently characterized by the second-order Wiener kernel and the closely-related spike-triggered covariance (STC). Recent advances in data acquisition have made it increasingly common and computationally intensive to compute second-order Wiener kernels/STC matrices. In order to speed up this sort of analysis, we developed a graphics processing unit (GPU)-accelerated module that computes the second-order Wiener kernel of a system's response to a stimulus. The generated kernel can be easily transformed for use in standard STC analyses. Our code speeds up such analyses by factors of over 100 relative to current methods that utilize central processing units (CPUs). It works on any modern GPU and may be integrated into many data analysis workflows. This module accelerates data analysis so that more time can be spent exploring parameter space and interpreting data.
Identifying attributes of GPU programs for difficulty evaluation
Directory of Open Access Journals (Sweden)
Dale Tristram
2014-08-01
Full Text Available General-purpose computation on graphics processing units (GPGPU has great potential to accelerate many scientific models and algorithms. However, some problems are considerably more difficult to accelerate than others, and it may be challenging for those new to GPGPU to ascertain the difficulty of accelerating a particular problem. Through what was learned in the acceleration of three problems, problem attributes have been identified that can assist in the evaluation of the difficulty of accelerating a problem on a GPU. The identified attributes are a problem's available parallelism, inherent parallelism, synchronisation requirements, and data transfer requirements. We envisage that with further development, these attributes could form the foundation of a difficulty classification system that could be used to determine whether GPU acceleration is practical for a candidate GPU acceleration problem, aid in identifying appropriate techniques and optimisations, and outline the required GPGPU knowledge.
Directory of Open Access Journals (Sweden)
MEI, G.
2015-05-01
Full Text Available In the calculating of convex hulls for point sets, a preprocessing procedure that is to filter the input points by discarding non-extreme points is commonly used to improve the computational efficiency. We previously proposed a quite straightforward preprocessing approach for accelerating 2D convex hull computation on the GPU. In this paper, we extend that algorithm to being used in 3D cases. The basic ideas behind these two preprocessing algorithms are similar: first, several groups of extreme points are found according to the original set of input points and several rotated versions of the input set; then, a convex polyhedron is created using the found extreme points; and finally those interior points locating inside the formed convex polyhedron are discarded. Experimental results show that: when employing the proposed preprocessing algorithm, it achieves the speedups of about 4x on average and 5x to 6x in the best cases over the cases where the proposed approach is not used. In addition, more than 95 percent of the input points can be discarded in most experimental tests.
Fast GPU-based computation of the sensitivity matrix for a PET list-mode OSEM algorithm
Energy Technology Data Exchange (ETDEWEB)
Nassiri, Moulay Ali; Carrier, Jean-Francois [Montreal Univ., QC (Canada). Dept. de Radio-Oncologie; Hissoiny, Sami [Ecole Polytechnique de Montreal, QC (Canada). Dept. de Genie Informatique et Genie Logiciel; Despres, Philippe [Quebec Univ. (Canada). Dept. de Radio-Oncologie
2011-07-01
One of the obstacle in introducing a list-mode PET reconstruction algorithm for routine clinical use is the long computation time required for the sensitivity matrix calculation. This matrix must be computed for each study because it depends on the object attenuation map. During the last decade, studies have shown that 3D list-mode OSEM reconstruction algorithms could be effectively performed and considerably accelerated by GPU devices. However, most of that preliminary work (1) was done for pre-clinical PET systems in which the number of LORs is small compared to modern human PET systems and (2) supposed that the sensitivity matrix is pre-calculated. The time required to compute this matrix can however be longer than the reconstruction time itself. The objective of this work is to investigate the performance of sensitivity matrix calculations in terms of computation time with modern GPUs, for clinical fully 3D LM-OSEM for modern PET scanners. For this purpose, sensitivity matrix calculations and full list-mode OSEM reconstruction for human PET systems were implemented on GPUs using the CUDA framework. The system matrices were built on-the-fly by using the multi-ray Siddon algorithm. The time to compute the sensitivity matrix for 288 x 288 x 57 arrays using 3 tangential LORs was 29 seconds. The 3D LM-OSEM algorithm, including the sensitivity matrix calculation, was performed for the same LORs in 71 seconds for 62 millions events, 6 frames and 1 iterations. This work let envision fast reconstructions for advanced PET application such as dynamic studies and parametric image reconstruction. (orig.)
A graphics processing unit (GPU)-based real-time spherizing algorithm%基于GPU的实时球面化算法
Institute of Scientific and Technical Information of China (English)
黄建彪; 陈国华; 张爱军; 周厉颖
2013-01-01
分析了球面映射算法速度过慢的原因,针对传统插值计算中普遍存在的速度与精度相互制约的问题,改进了现有的基于立体投影的半球面纹理映射模型,提出了基于GPU的球面化算法,使用CUDA并行编程实现双线性插值的并行计算.球面化实验表明该算法在保证输出精度的前提下,可获得10倍左右的加速比,显著提高了计算速度,可用于实时性较高的应用场合.%The cause of the low speed of a sphere mapping algorithm has been analyzed. In order to reduce the interaction between speed and accuracy, which is common in traditional interpolation methods with existing sphere mapping algorithms, an improved hemisphere texture mapping model based on stereoscopic projection has been proposed , and a graphics processing unit ( GPU ) -based spherizing algorithm has been put forward, in which CUDA parallel programming was utilized to complete the parallel computing of bilinear interpolation. The experiments showed that the computing speed could be significantly increased with the new method, whilst ensuring output accuracy. The method gave a speedup factor of almost 10, and it could be employed in fast real-time applications.
Numerical simulation of lava flow using a GPU SPH model
Eugenio Rustico; Annamaria Vicari; Giuseppe Bilotta; Alexis Hérault; Ciro Del Negro
2011-01-01
A smoothed particle hydrodynamics (SPH) method for lava-flow modeling was implemented on a graphical processing unit (GPU) using the compute unified device architecture (CUDA) developed by NVIDIA. This resulted in speed-ups of up to two orders of magnitude. The three-dimensional model can simulate lava flow on a real topography with free-surface, non- Newtonian fluids, and with phase change. The entire SPH code has three main components, neighbor list construction, force computation, an...
Survey of CPU/GPU Synergetic Parallel Computing%CPU/GPU协同并行计算研究综述
Institute of Scientific and Technical Information of China (English)
卢风顺; 宋君强; 银福康; 张理论
2011-01-01
CPU/GPU异构混合并行系统以其强劲计算能力、高性价比和低能耗等特点成为新型高性能计算平台,但其复杂体系结构为并行计算研究提出了巨大挑战.CPU/GPU协同并行计算属于新兴研究领域,是一个开放的课题.根据所用计算资源的规模将CPU/GPU协同并行计算研究划分为三类,尔后从立项依据、研究内容和研究方法等方面重点介绍了几个混合计算项目,并指出了可进一步研究的方向,以期为领域科学家进行协同并行计算研究提供一定参考.%With the features of tremendous capability, high performance/price ratio and low power, the heterogeneous hybrid CPU/GPU parallel systems have become the new high performance computing platforms. However, the architecture complexity of the hybrid system poses many challenges on the parallel algorithms design on the infrastructure. According to the scale of computational resources involved in the synergetic parallel computing, we classified the recent researches into three categories, detailed the motivations, methodologies and applications of several projects, and discussed some on-going research issues in this direction in the end. We hope the domain experts can gain useful information about synergetic parallel computing from this work.
Directory of Open Access Journals (Sweden)
Christley Scott
2010-08-01
Full Text Available Abstract Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a
Optimization strategies for parallel CPU and GPU implementations of a meshfree particle method
Domínguez, Jose M; Gómez-Gesteira, Moncho
2011-01-01
Much of the current focus in high performance computing (HPC) for computational fluid dynamics (CFD) deals with grid based methods. However, parallel implementations for new meshfree particle methods such as Smoothed Particle Hydrodynamics (SPH) are less studied. In this work, we present optimizations for both central processing unit (CPU) and graphics processing unit (GPU) of a SPH method. These optimization strategies can be further applied to many other meshfree methods. The obtained performance for each architecture and a comparison between the most efficient implementations for CPU and GPU are shown.
Institute of Scientific and Technical Information of China (English)
吴建松; 张辉; 杨锐
2013-01-01
This paper applies the meshfree Smoothed Particle Hydrodynamics (SPH) method with Graphical Processing Unit (GPU) parallel computing technique to investigate the highly complex 3-D dam-break flow in urban areas including underground spaces. Taking the advantage of GPUs parallel computing techniques, simulations involving more than 107 particles can be achieved. We use a virtual geometric plane boundary to handle the outermost solid wall in order to save considerable video card memory for the GPU computing. To evaluate the accuracy of the new GPU-based SPH model, qualitative and quantitative comparison to a real flooding experiment is performed and the results of a numerical model based on Shallow Water Equations (SWEs) is given with good accu- racy. With the new GPU-based SPH model, the effects of the building layouts and underground spaces on the propagation of dam- break flood through an intricate city layout are examined.
Ng, C M
2013-10-01
The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T
2011-11-21
We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
Parallelization of MODFLOW using a GPU library.
Ji, Xiaohui; Li, Dandan; Cheng, Tangpei; Wang, Xu-Sheng; Wang, Qun
2014-01-01
A new method based on a graphics processing unit (GPU) library is proposed in the paper to parallelize MODFLOW. Two programs, GetAb_CG and CG_GPU, have been developed to reorganize the equations in MODFLOW and solve them with the GPU library. Experimental tests using the NVIDIA Tesla C1060 show that a 1.6- to 10.6-fold speedup can be achieved for models with more than 10(5) cells. The efficiency can be further improved by using up-to-date GPU devices.
GPU-BSM: a GPU-based tool to map bisulfite-treated reads.
Directory of Open Access Journals (Sweden)
Andrea Manconi
Full Text Available Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.
GPU-BSM: a GPU-based tool to map bisulfite-treated reads.
Manconi, Andrea; Orro, Alessandro; Manca, Emanuele; Armano, Giuliano; Milanesi, Luciano
2014-01-01
Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.
Hidalgo, R.C.; Kanzaki, T.; Alonso-Marroquin, F.; Luding, S.; Yu, A.; Dong, K.; Yang, R.; Luding, S.
2013-01-01
General-purpose computation on Graphics Processing Units (GPU) on personal computers has recently become an attractive alternative to parallel computing on clusters and supercomputers. We present the GPU-implementation of an accurate molecular dynamics algorithm for a system of spheres. The new hybr
图形处理器在通用计算中的应用%Application of graphics processing unit in general purpose computation
Institute of Scientific and Technical Information of China (English)
张健; 陈瑞
2009-01-01
基于图形处理器(GPU)的计算统一设备体系结构(compute unified device architecture,CUDA)构架,阐述了GPU用于通用计算的原理和方法.在Geforce8800GT下,完成了矩阵乘法运算实验.实验结果表明,随着矩阵阶数的递增,无论是GPU还是CPU处理,速度都在减慢.数据增加100倍后,GPU上的运算时间仅增加了3.95倍,而CPU的运算时间增加了216.66倍.%Based on the CUDA (compute unified device architecture) of GPU (graphics processing unit), the technical fundamentals and methods for general purpose computation on GPU are introduced. The algorithm of matrix multiplication is simulated on Geforce8800 GT. With the increasing of matrix order, algorithm speed is slowed either on CPU or on GPU. After the data quantity increases to 100 times, the operation time only increased in 3.95 times on GPU, and 216.66 times on CPU.
Fast and maliciously secure two-party computation using the GPU
DEFF Research Database (Denmark)
Frederiksen, Tore Kasper; Nielsen, Jesper Buus
2013-01-01
We describe, and implement, a maliciously secure protocol for two-party computation in a parallel computational model. Our protocol is based on Yao’s garbled circuit and an efficient OT extension. The implementation is done using CUDA and yields fast results for maliciously secure two-party compu......We describe, and implement, a maliciously secure protocol for two-party computation in a parallel computational model. Our protocol is based on Yao’s garbled circuit and an efficient OT extension. The implementation is done using CUDA and yields fast results for maliciously secure two...
Choi, Sunghwan; Kwon, Oh-Kyoung; Kim, Jaewook; Kim, Woo Youn
2016-09-15
We investigated the performance of heterogeneous computing with graphics processing units (GPUs) and many integrated core (MIC) with 20 CPU cores (20×CPU). As a practical example toward large scale electronic structure calculations using grid-based methods, we evaluated the Hartree potentials of silver nanoparticles with various sizes (3.1, 3.7, 4.9, 6.1, and 6.9 nm) via a direct integral method supported by the sinc basis set. The so-called work stealing scheduler was used for efficient heterogeneous computing via the balanced dynamic distribution of workloads between all processors on a given architecture without any prior information on their individual performances. 20×CPU + 1GPU was up to ∼1.5 and ∼3.1 times faster than 1GPU and 20×CPU, respectively. 20×CPU + 2GPU was ∼4.3 times faster than 20×CPU. The performance enhancement by CPU + MIC was considerably lower than expected because of the large initialization overhead of MIC, although its theoretical performance is similar with that of CPU + GPU. © 2016 Wiley Periodicals, Inc.
The experience of GPU calculations at Lunarc
Sjöström, Anders; Lindemann, Jonas; Church, Ross
2011-09-01
To meet the ever increasing demand for computational speed and use of ever larger datasets, multi GPU instal- lations look very tempting. Lunarc and the Theoretical Astrophysics group at Lund Observatory collaborate on a pilot project to evaluate and utilize multi-GPU architectures for scientific calculations. Starting with a small workshop in 2009, continued investigations eventually lead to the procurement of the GPU-resource Timaeus, which is a four-node eight-GPU cluster with two Nvidia m2050 GPU-cards per node. The resource is housed within the larger cluster Platon and share disk-, network- and system resources with that cluster. The inaugu- ration of Timaeus coincided with the meeting "Computational Physics with GPUs" in November 2010, hosted by the Theoretical Astrophysics group at Lund Observatory. The meeting comprised of a two-day workshop on GPU-computing and a two-day science meeting on using GPUs as a tool for computational physics research, with a particular focus on astrophysics and computational biology. Today Timaeus is used by research groups from Lund, Stockholm and Lule in fields ranging from Astrophysics to Molecular Chemistry. We are investigating the use of GPUs with commercial software packages and user supplied MPI-enabled codes. Looking ahead, Lunarc will be installing a new cluster during the summer of 2011 which will have a small number of GPU-enabled nodes that will enable us to continue working with the combination of parallel codes and GPU-computing. It is clear that the combination of GPUs/CPUs is becoming an important part of high performance computing and here we will describe what has been done at Lunarc regarding GPU-computations and how we will continue to investigate the new and coming multi-GPU servers and how they can be utilized in our environment.
Choi, Sunghoon; Lee, Seungwan; Lee, Haenghwa; Lee, Donghoon; Choi, Seungyeon; Shin, Jungwook; Seo, Chang-Woo; Kim, Hee-Joung
2017-03-01
Digital tomosynthesis offers the advantage of low radiation doses compared to conventional computed tomography (CT) by utilizing small numbers of projections ( 80) acquired over a limited angular range. It produces 3D volumetric data, although there are artifacts due to incomplete sampling. Based upon these characteristics, we developed a prototype digital tomosynthesis R/F system for applications in chest imaging. Our prototype chest digital tomosynthesis (CDT) R/F system contains an X-ray tube with high power R/F pulse generator, flat-panel detector, R/F table, electromechanical radiographic subsystems including a precise motor controller, and a reconstruction server. For image reconstruction, users select between analytic and iterative reconstruction methods. Our reconstructed images of Catphan700 and LUNGMAN phantoms clearly and rapidly described the internal structures of phantoms using graphics processing unit (GPU) programming. Contrast-to-noise ratio (CNR) values of the CTP682 module of Catphan700 were higher in images using a simultaneous algebraic reconstruction technique (SART) than in those using filtered back-projection (FBP) for all materials by factors of 2.60, 3.78, 5.50, 2.30, 3.70, and 2.52 for air, lung foam, low density polyethylene (LDPE), Delrin® (acetal homopolymer resin), bone 50% (hydroxyapatite), and Teflon, respectively. Total elapsed times for producing 3D volume were 2.92 s and 86.29 s on average for FBP and SART (20 iterations), respectively. The times required for reconstruction were clinically feasible. Moreover, the total radiation dose from our system (5.68 mGy) was lower than that of conventional chest CT scan. Consequently, our prototype tomosynthesis R/F system represents an important advance in digital tomosynthesis applications.
Optimizing a mobile robot control system using GPU acceleration
Tuck, Nat; McGuinness, Michael; Martin, Fred
2012-01-01
This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.
Institute of Scientific and Technical Information of China (English)
赵旭东; 梁书秀; 孙昭晨; 刘忠波; 韩松林; 任喜峰
2014-01-01
应用非结构化网格建立水动力模型目前已经得到了广泛的应用。针对在网格数过多，且无集群机情况下难以快速获得计算结果这一问题，基于 GPU的高性能计算技术，在 CUDA开发平台下设计并行算法，建立非结构化网格的二维水动力模型。与利用 GTX460显卡和集群机的计算效率对比表明，在保持计算精度的前提下，速度提升了一个量级，且随着网格数的持续递增，可以保持较高的加速比增幅，比较适合应用于大范围海域的水动力模型的数值计算。%Unstructured grid has been widely used to establish hydrodynamic models.To rapidly get the calculation results without a cluster when the number of computational grids is too large,a high-performance computing technology which is based on GPU (graphic processing unit),is adopted to design a parallel algorithm and establish a 2D unstructured grid hydrodynamic model on CUDA (compute unified device architecture ) development platform. Through the comparisons to the computational efficiency of a cluster and the Graphic Card of GTX460,the advantages of GPU method are confirmed:the speedup ratio can reach more than 10 times and maintain a high growth as the increase of the number of computational grids.It is suitable for the numerical simulation of large-domain hydrodynamic models.
ALICE HLT high speed tracking on GPU
Gorbunov, Sergey; Aamodt, Kenneth; Alt, Torsten; Appelshauser, Harald; Arend, Andreas; Bach, Matthias; Becker, Bruce; Bottger, Stefan; Breitner, Timo; Busching, Henner; Chattopadhyay, Sukalyan; Cleymans, Jean; Cicalo, Corrado; Das, Indranil; Djuvsland, Oystein; Engel, Heiko; Erdal, Hege Austrheim; Fearick, Roger; Haaland, Oystein Senneset; Hille, Per Thomas; Kalcher, Sebastian; Kanaki, Kalliopi; Kebschull, Udo Wolfgang; Kisel, Ivan; Kretz, Matthias; Lara, Camillo; Lindal, Sven; Lindenstruth, Volker; Masoodi, Arshad Ahmad; Ovrebekk, Gaute; Panse, Ralf; Peschek, Jorg; Ploskon, Mateusz; Pocheptsov, Timur; Ram, Dinesh; Rascanu, Theodor; Richter, Matthias; Rohrich, Dieter; Ronchetti, Federico; Skaali, Bernhard; Smorholm, Olav; Stokkevag, Camilla; Steinbeck, Timm Morten; Szostak, Artur; Thader, Jochen; Tveter, Trine; Ullaland, Kjetil; Vilakazi, Zeblon; Weis, Robert; Yin, Zhong-Bao; Zelnicek, Pierre
2011-01-01
The on-line event reconstruction in ALICE is performed by the High Level Trigger, which should process up to 2000 events per second in proton-proton collisions and up to 300 central events per second in heavy-ion collisions, corresponding to an inp ut data stream of 30 GB/s. In order to fulfill the time requirements, a fast on-line tracker has been developed. The algorithm combines a Cellular Automaton method being used for a fast pattern recognition and the Kalman Filter method for fitting of found trajectories and for the final track selection. The tracker was adapted to run on Graphics Processing Units (GPU) using the NVIDIA Compute Unified Device Architecture (CUDA) framework. The implementation of the algorithm had to be adjusted at many points to allow for an efficient usage of the graphics cards. In particular, achieving a good overall workload for many processor cores, efficient transfer to and from the GPU, as well as optimized utilization of the different memories the GPU offers turned out to be cri...
Fast box-counting algorithm on GPU.
Jiménez, J; Ruiz de Miras, J
2012-12-01
The box-counting algorithm is one of the most widely used methods for calculating the fractal dimension (FD). The FD has many image analysis applications in the biomedical field, where it has been used extensively to characterize a wide range of medical signals. However, computing the FD for large images, especially in 3D, is a time consuming process. In this paper we present a fast parallel version of the box-counting algorithm, which has been coded in CUDA for execution on the Graphic Processing Unit (GPU). The optimized GPU implementation achieved an average speedup of 28 times (28×) compared to a mono-threaded CPU implementation, and an average speedup of 7 times (7×) compared to a multi-threaded CPU implementation. The performance of our improved box-counting algorithm has been tested with 3D models with different complexity, features and sizes. The validity and accuracy of the algorithm has been confirmed using models with well-known FD values. As a case study, a 3D FD analysis of several brain tissues has been performed using our GPU box-counting algorithm.
2011-11-15
hole binary system. This work resulted in a fast publication in Physical Review . More work is ongoing in the fields of computational mathematics, civil engineering, mechanical engineering, physics, and geophysics.
Adeshina, A M; Hashim, R
2017-03-01
Diagnostic radiology is a core and integral part of modern medicine, paving ways for the primary care physicians in the disease diagnoses, treatments and therapy managements. Obviously, all recent standard healthcare procedures have immensely benefitted from the contemporary information technology revolutions, apparently revolutionizing those approaches to acquiring, storing and sharing of diagnostic data for efficient and timely diagnosis of diseases. Connected health network was introduced as an alternative to the ageing traditional concept in healthcare system, improving hospital-physician connectivity and clinical collaborations. Undoubtedly, the modern medicinal approach has drastically improved healthcare but at the expense of high computational cost and possible breach of diagnosis privacy. Consequently, a number of cryptographical techniques are recently being applied to clinical applications, but the challenges of not being able to successfully encrypt both the image and the textual data persist. Furthermore, processing time of encryption-decryption of medical datasets, within a considerable lower computational cost without jeopardizing the required security strength of the encryption algorithm, still remains as an outstanding issue. This study proposes a secured radiology-diagnostic data framework for connected health network using high-performance GPU-accelerated Advanced Encryption Standard. The study was evaluated with radiology image datasets consisting of brain MR and CT datasets obtained from the department of Surgery, University of North Carolina, USA, and the Swedish National Infrastructure for Computing. Sample patients' notes from the University of North Carolina, School of medicine at Chapel Hill were also used to evaluate the framework for its strength in encrypting-decrypting textual data in the form of medical report. Significantly, the framework is not only able to accurately encrypt and decrypt medical image datasets, but it also
On the performance of a 2D unstructured computational rheology code on a GPU
Pereira, S.P.; Vuik, K.; Pinho, F.T.; Nobrega, J.M.
2013-01-01
The present work explores the massively parallel capabilities of the most advanced architecture of graphics processing units (GPUs) code named “Fermi”, on a two-dimensional unstructured cell-centred finite volume code. We use the SIMPLE algorithm to solve the continuity and momentum equations that
On the performance of a 2D unstructured computational rheology code on a GPU
Pereira, S.P.; Vuik, K.; Pinho, F.T.; Nobrega, J.M.
2013-01-01
The present work explores the massively parallel capabilities of the most advanced architecture of graphics processing units (GPUs) code named “Fermi”, on a two-dimensional unstructured cell-centred finite volume code. We use the SIMPLE algorithm to solve the continuity and momentum equations that w
Probabilistic View-based 3D Curve Skeleton Computation on the GPU
Kustra, Jacek; Jalba, Andrei; Telea, Alexandru
2013-01-01
Computing curve skeletons of 3D shapes is a challenging task. Recently, a high-potential technique for this task was proposed, based on integrating medial information obtained from several 2D projections of a 3D shape. However effective, this technique is strongly influenced in terms of complexity b
Directory of Open Access Journals (Sweden)
Paul Richmond
Full Text Available High performance computing on the Graphics Processing Unit (GPU is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism, achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic" for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.
Richmond, Paul; Buesing, Lars; Giugliano, Michele; Vasilaki, Eleni
2011-05-04
High performance computing on the Graphics Processing Unit (GPU) is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism), achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic") for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.
Using the CPU and GPU for real-time video enhancement on a mobile computer
CSIR Research Space (South Africa)
Bachoo, AK
2010-09-01
Full Text Available Language (2nd Edition), Addison- Wesley Professional, 2006. [6] D. Shreiner, M. Woo, J. Neider, and T. Davis, OpenGL Programming Guide, Addison-Wesley Professional, 2007. [7] SM Pizer, EP Amburn, JD Austin, R Cromartie, A Geselowitz, T Greer, BM ter... computer,? in The Twentieth Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), 2009. [11] RC Gonzalez and RE Woods, Digital image processing, Addison-Wesley Publishing Company, 2002. ...
A GPU Accelerated Spring Mass System for Surgical Simulation
DEFF Research Database (Denmark)
Mosegaard, Jesper; Sørensen, Thomas Sangild
2005-01-01
There is a growing demand for surgical simulators to dofast and precise calculations of tissue deformation to simulateincreasingly complex morphology in real-time. Unfortunately, evenfast spring-mass based systems have slow convergence rates for largemodels. This paper presents a method to accele...... to accelerate computation of aspring-mass system in order to simulate a complex organ such as theheart. This acceleration is achieved by taking advantage of moderngraphics processing units (GPU)....
Basket Option Pricing Using GP-GPU Hardware Acceleration
Douglas, Craig C.
2010-08-01
We introduce a basket option pricing problem arisen in financial mathematics. We discretized the problem based on the alternating direction implicit (ADI) method and parallel cyclic reduction is applied to solve the set of tridiagonal matrices generated by the ADI method. To reduce the computational time of the problem, a general purpose graphics processing units (GP-GPU) environment is considered. Numerical results confirm the convergence and efficiency of the proposed method. © 2010 IEEE.
Parallelizing Epistasis Detection in GWAS on FPGA and GPU-Accelerated Computing Systems.
González-Domínguez, Jorge; Wienbrandt, Lars; Kässens, Jan Christian; Ellinghaus, David; Schimmler, Manfred; Schmidt, Bertil
2015-01-01
High-throughput genotyping technologies (such as SNP-arrays) allow the rapid collection of up to a few million genetic markers of an individual. Detecting epistasis (based on 2-SNP interactions) in Genome-Wide Association Studies is an important but time consuming operation since statistical computations have to be performed for each pair of measured markers. Computational methods to detect epistasis therefore suffer from prohibitively long runtimes; e.g., processing a moderately-sized dataset consisting of about 500,000 SNPs and 5,000 samples requires several days using state-of-the-art tools on a standard 3 GHz CPU. In this paper, we demonstrate how this task can be accelerated using a combination of fine-grained and coarse-grained parallelism on two different computing systems. The first architecture is based on reconfigurable hardware (FPGAs) while the second architecture uses multiple GPUs connected to the same host. We show that both systems can achieve speedups of around four orders-of-magnitude compared to the sequential implementation. This significantly reduces the runtimes for detecting epistasis to only a few minutes for moderately-sized datasets and to a few hours for large-scale datasets.
Airborne SAR Real-time Imaging Algorithm Design and Implementation with CUDA on NVIDIA GPU
Directory of Open Access Journals (Sweden)
Meng Da-di
2013-12-01
Full Text Available Synthetic Aperture Radar (SAR image processing requires huge computation amount. Traditionally, this task runs on the workstation or server based on Central Processing Unit (CPU and is rather time-consuming, hence real-time processing of SAR data is impossible. Based on Compute Unified Device Architecture (CUDA technology, a new plan of SAR imaging algorithm operated on NVIDIA Graphic Processing Unit (GPU is proposed. The new proposal makes it possible that the data processing procedure and CPU/GPU data exchanging execute concurrently, especially when SAR data size exceeds total GPU global memory size. Multi-GPU is suitably supported by the new proposal and all of computational resources are fully exploited. It is shown by experiment on NVIDIA K20C and INTEL E5645 that the proposed solution accelerates SAR data processing by tens of times. Consequently, the GPU based SAR processing system with the proposed solution embedded is much more power saving and portable, which makes it qualified to be a real-time SAR data processing system. Experiment shows that SAR data of 36 Mega points can be processed in real-time per second by K20C with the new solution equipped.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Energy Technology Data Exchange (ETDEWEB)
Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun; Jia, Xun, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Jiang, Steve B., E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu [Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, Texas 75390 (United States); Peng, Fei [Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 (United States)
2015-06-15
Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is
A GPU based real-time software correlation system for the Murchison Widefield Array prototype
Wayth, Randall B; Briggs, Frank H
2009-01-01
Modern graphics processing units (GPUs) are inexpensive commodity hardware that offer Tflop/s theoretical computing capacity. GPUs are well suited to many compute-intensive tasks including digital signal processing. We describe the implementation and performance of a GPU-based digital correlator for radio astronomy. The correlator is implemented using the NVIDIA CUDA development environment. We evaluate three design options on two generations of NVIDIA hardware. The different designs utilize the internal registers, shared memory and multiprocessors in different ways. We find that optimal performance is achieved with the design that minimizes global memory reads on recent generations of hardware. The GPU-based correlator outperforms a single-threaded CPU equivalent by a factor of 60 for a 32 antenna array, and runs on commodity PC hardware. The extra compute capability provided by the GPU maximises the correlation capability of a PC while retaining the fast development time associated with using standard hardw...
Fast calculation of computer-generated-hologram on AMD HD5000 series GPU and OpenCL
National Research Council Canada - National Science Library
Shimobaba, Tomoyoshi; Ito, Tomoyoshi; Masuda, Nobuyuki; Ichihashi, Yasuyuki; Takada, Naoki
2010-01-01
...) made by AMD and its new software development environment, OpenCL. Using a RV870 GPU and OpenCL, we can calculate 1,920 x 1,024 resolution of a CGH from a 3D object consisting of 1,024 points in 30 milli-seconds...
Research and Implementation of Parallel PLS Algorithm Based on GPU Computing%基于GPU计算的并行PLS算法研究与实现
Institute of Scientific and Technical Information of China (English)
杨辉华; 唐天彪; 李灵巧; 郭拓; 罗国安
2012-01-01
偏最小二乘算法(PLS)是与红外、近红外光谱分析结合使用最为广泛的化学计量学算法,然而当前PLS算法通常采用单线程方式实现,当校正模型数量多或样本数量大、波长点数和主成分数较多,模型需对光谱预处理和波长选择方法反复优化时,计算十分缓慢.为大幅提高建模速度,该文提出了一种基于图形处理器( GPU)的并行计算策略,利用具有大规模并行计算特性的GPU作为计算设备,结合CUBLAS库函数实现了基于GPU并行的PLS建模算法(CUPLS).利用近红外光谱数据集进行性能对比实验,结果表明CUPLS建模算法较传统单线程实现的PLS算法,加速比可达近42倍,极大地提升了化学计量学算法的建模效率.该方法亦可用于其它化学计量学算法的加速.%Partial least squares (PLS) algorithm is one of the most common used chemometric algo-rithms, and is often combined with infrared and near infrared spectroscopy analysis. However, its regular implementation in a single-threaded way makes the modeling process severely ineffective when there are a great deal of models to built, or when there are iterative optimizations of the wavelength ranges and its preprocessing methods need to build an optimal model which contains thousands of sam-ples, enormous data points, and uses a large number of principal components. To give an effective modeling method in this situation, this paper presented a novel parallel chemometric computation strategy which takes the Graphic Processing Unit (GPU) as computing devices, and then the parallel PLS algorithm, i. e. CUPLS, is implemented using the CUBLAS library. Finally, using a large near infrared spectroscopy ( NIR) dataset as the test bed, a performance comparison experiment is conducted, and the results showed that the speed of the parallel algorithm is 42 times faster than that of the CPU-based implementation, which dramatically improves the efficiency of chemometric model
Institute of Scientific and Technical Information of China (English)
朱宇兰
2016-01-01
GPU通用计算是近几年来迅速发展的一个计算领域，以其强大的并行处理能力为密集数据单指令型计算提供了一个绝佳的解决方案，但受限制于芯片的制造工艺，其运算能力遭遇瓶颈。本文从GPU通用计算的基础——图形API开始，分析GPU并行算法特征、运算的过程及特点，并抽象出了一套并行计算框架。通过计算密集行案例，演示了框架的使用方法，并与传统GPU通用计算的实现方法比较，证明了本框架具有代码精简、与图形学无关的特点。%GPGPU(General Purpose Computing on Graphics Processing Unit) is a calculation mothed that develops quiet fast in recent years, it provide an optimal solution for the intensive data calculation of a single instruction with a powerful treatment, however it is restricted in CPU making process to lead to entounter the bottleneck of hardware manufacture. This paper started from GPGPU by Graphics API to analyze the featuers, progress and characteristics of GPU parallel algorithm and obtained a set of computing framework to demonstrate it by an intensive line calculation and compared between the traditional GPU and the parallel computing framework to turn out to show that there was a simplified code and had nothing to do with graphics.
Hu, Hongda; Shu, Hong; Hu, Zhiyong; Xu, Jianhui
2016-04-01
Kriging interpolation provides the best linear unbiased estimation for unobserved locations, but its heavy computation limits the manageable problem size in practice. To address this issue, an efficient interpolation procedure incorporating the fast Fourier transform (FFT) was developed. Extending this efficient approach, we propose an FFT-based parallel algorithm to accelerate regression Kriging interpolation on an NVIDIA® compute unified device architecture (CUDA)-enabled graphic processing unit (GPU). A high-performance cuFFT library in the CUDA toolkit was introduced to execute computation-intensive FFTs on the GPU, and three time-consuming processes were redesigned as kernel functions and executed on the CUDA cores. A MODIS land surface temperature 8-day image tile at a resolution of 1 km was resampled to create experimental datasets at eight different output resolutions. These datasets were used as the interpolation grids with different sizes in a comparative experiment. Experimental results show that speedup of the FFT-based regression Kriging interpolation accelerated by GPU can exceed 1000 when processing datasets with large grid sizes, as compared to the traditional Kriging interpolation running on the CPU. These results demonstrate that the combination of FFT methods and GPU-based parallel computing techniques greatly improves the computational performance without loss of precision.
Application of Photon Transport Monte Carlo Module with GPU-based Parallel System
Energy Technology Data Exchange (ETDEWEB)
Park, Chang Je [Sejong University, Seoul (Korea, Republic of); Shon, Heejeong [Golden Eng. Co. LTD, Seoul (Korea, Republic of); Lee, Donghak [CoCo Link Inc., Seoul (Korea, Republic of)
2015-05-15
In general, it takes lots of computing time to get reliable results in Monte Carlo simulations especially in deep penetration problems with a thick shielding medium. To mitigate such a weakness of Monte Carlo methods, lots of variance reduction algorithms are proposed including geometry splitting and Russian roulette, weight windows, exponential transform, and forced collision, etc. Simultaneously, advanced computing hardware systems such as GPU(Graphics Processing Units)-based parallel machines are used to get a better performance of the Monte Carlo simulation. The GPU is much easier to access and to manage when comparing a CPU cluster system. It also becomes less expensive these days due to enhanced computer technology. There, lots of engineering areas adapt GPU-bases massive parallel computation technique. based photon transport Monte Carlo method. It provides almost 30 times speedup without any optimization and it is expected almost 200 times with fully supported GPU system. It is expected that GPU system with advanced parallelization algorithm will contribute successfully for development of the Monte Carlo module which requires quick and accurate simulations.
GPU Implementation of Bayesian Neural Network Construction for Data-Intensive Applications
Perry, Michelle; Prosper, Harrison B.; Meyer-Baese, Anke
2014-06-01
We describe a graphical processing unit (GPU) implementation of the Hybrid Markov Chain Monte Carlo (HMC) method for training Bayesian Neural Networks (BNN). Our implementation uses NVIDIA's parallel computing architecture, CUDA. We briefly review BNNs and the HMC method and we describe our implementations and give preliminary results.
Jacobian-free Newton-Krylov methods with GPU acceleration for computing nonlinear ship wave patterns
Pethiyagoda, Ravindra; Moroney, Timothy J; Back, Julian M
2014-01-01
The nonlinear problem of steady free-surface flow past a submerged source is considered as a case study for three-dimensional ship wave problems. Of particular interest is the distinctive wedge-shaped wave pattern that forms on the surface of the fluid. By reformulating the governing equations with a standard boundary-integral method, we derive a system of nonlinear algebraic equations that enforce a singular integro-differential equation at each midpoint on a two-dimensional mesh. Our contribution is to solve the system of equations with a Jacobian-free Newton-Krylov method together with a banded preconditioner that is carefully constructed with entries taken from the Jacobian of the linearised problem. Further, we are able to utilise graphics processing unit acceleration to significantly increase the grid refinement and decrease the run-time of our solutions in comparison to schemes that are presently employed in the literature. Our approach provides opportunities to explore the nonlinear features of three-...
GPU-accelerated adjoint algorithmic differentiation
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
Performance potential for simulating spin models on GPU
Weigel, Martin
2011-01-01
Graphics processing units (GPUs) are recently being used to an increasing degree for general computational purposes. This development is motivated by their theoretical peak performance, which significantly exceeds that of broadly available CPUs. For practical purposes, however, it is far from clear how much of this theoretical performance can be realized in actual scientific applications. As is discussed here for the case of studying classical spin models of statistical mechanics by Monte Carlo simulations, only an explicit tailoring of the involved algorithms to the specific architecture under consideration allows to harvest the computational power of GPU systems. A number of examples, ranging from Metropolis simulations of ferromagnetic Ising models, over continuous Heisenberg and disordered spin-glass systems to parallel-tempering simulations are discussed. Significant speed-ups by factors of up to 1000 compared to serial CPU code as well as previous GPU implementations are observed.
Performance potential for simulating spin models on GPU
Weigel, Martin
2012-04-01
Graphics processing units (GPUs) are recently being used to an increasing degree for general computational purposes. This development is motivated by their theoretical peak performance, which significantly exceeds that of broadly available CPUs. For practical purposes, however, it is far from clear how much of this theoretical performance can be realized in actual scientific applications. As is discussed here for the case of studying classical spin models of statistical mechanics by Monte Carlo simulations, only an explicit tailoring of the involved algorithms to the specific architecture under consideration allows to harvest the computational power of GPU systems. A number of examples, ranging from Metropolis simulations of ferromagnetic Ising models, over continuous Heisenberg and disordered spin-glass systems to parallel-tempering simulations are discussed. Significant speed-ups by factors of up to 1000 compared to serial CPU code as well as previous GPU implementations are observed.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.
Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H
2012-09-01
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC
AVIST: A GPU-Centric Design for Visual Exploration of Large Multidimensional Datasets
Directory of Open Access Journals (Sweden)
Peng Mi
2016-10-01
Full Text Available This paper presents the Animated VISualization Tool (AVIST, an exploration-oriented data visualization tool that enables rapidly exploring and filtering large time series multidimensional datasets. AVIST highlights interactive data exploration by revealing fine data details. This is achieved through the use of animation and cross-filtering interactions. To support interactive exploration of big data, AVIST features a GPU (Graphics Processing Unit-centric design. Two key aspects are emphasized on the GPU-centric design: (1 both data management and computation are implemented on the GPU to leverage its parallel computing capability and fast memory bandwidth; (2 a GPU-based directed acyclic graph is proposed to characterize data transformations triggered by users’ demands. Moreover, we implement AVIST based on the Model-View-Controller (MVC architecture. In the implementation, we consider two aspects: (1 user interaction is highlighted to slice big data into small data; and (2 data transformation is based on parallel computing. Two case studies demonstrate how AVIST can help analysts identify abnormal behaviors and infer new hypotheses by exploring big datasets. Finally, we summarize lessons learned about GPU-based solutions in interactive information visualization with big data.
Accelerated protein structure comparison using TM-score-GPU.
Hung, Ling-Hong; Samudrala, Ram
2012-08-15
Accurate comparisons of different protein structures play important roles in structural biology, structure prediction and functional annotation. The root-mean-square-deviation (RMSD) after optimal superposition is the predominant measure of similarity due to the ease and speed of computation. However, global RMSD is dependent on the length of the protein and can be dominated by divergent loops that can obscure local regions of similarity. A more sophisticated measure of structure similarity, Template Modeling (TM)-score, avoids these problems, and it is one of the measures used by the community-wide experiments of critical assessment of protein structure prediction to compare predicted models with experimental structures. TM-score calculations are, however, much slower than RMSD calculations. We have therefore implemented a very fast version of TM-score for Graphical Processing Units (TM-score-GPU), using a new and novel hybrid Kabsch/quaternion method for calculating the optimal superposition and RMSD that is designed for parallel applications. This acceleration in speed allows TM-score to be used efficiently in computationally intensive applications such as for clustering of protein models and genome-wide comparisons of structure. TM-score-GPU was applied to six sets of models from Nutritious Rice for the World for a total of 3 million comparisons. TM-score-GPU is 68 times faster on an ATI 5870 GPU, on average, than the original CPU single-threaded implementation on an AMD Phenom II 810 quad-core processor. The complete source, including the GPU code and the hybrid RMSD subroutine, can be downloaded and used without restriction at http://software.compbio.washington.edu/misc/downloads/tmscore/. The implementation is in C++/OpenCL.
Cheng, Chun-Pei; Lan, Kuo-Lun; Liu, Wen-Chun; Chang, Ting-Tsung; Tseng, Vincent S
2016-12-01
Hepatitis B viral (HBV) infection is strongly associated with an increased risk of liver diseases like cirrhosis or hepatocellular carcinoma (HCC). Many lines of evidence suggest that deletions occurring in HBV genomic DNA are highly associated with the activity of HBV via the interplay between aberrant viral proteins release and human immune system. Deletions finding on the HBV whole genome sequences is thus a very important issue though there exist underlying the challenges in mining such big and complex biological data. Although some next generation sequencing (NGS) tools are recently designed for identifying structural variations such as insertions or deletions, their validity is generally committed to human sequences study. This design may not be suitable for viruses due to different species. We propose a graphics processing unit (GPU)-based data mining method called DeF-GPU to efficiently and precisely identify HBV deletions from large NGS data, which generally contain millions of reads. To fit the single instruction multiple data instructions, sequencing reads are referred to as multiple data and the deletion finding procedure is referred to as a single instruction. We use Compute Unified Device Architecture (CUDA) to parallelize the procedures, and further validate DeF-GPU on 5 synthetic and 1 real datasets. Our results suggest that DeF-GPU outperforms the existing commonly-used method Pindel and is able to exactly identify the deletions of our ground truth in few seconds. The source code and other related materials are available at https://sourceforge.net/projects/defgpu/.
SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU
Energy Technology Data Exchange (ETDEWEB)
Moriya, S; Sato, M [Komazawa University, Setagaya, Tokyo (Japan); Tachibana, H [National Cancer Center Hospital East, Kashiwa, Chiba (Japan)
2015-06-15
Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation running on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.
Belleman, R.G.; Bédorf, J.; Portegies Zwart, S.F.
2008-01-01
We present the results of gravitational direct N-body simulations using the graphics processing unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the N-body problem is implemented in "Compute Unified Device Architecture" (CUDA) using the GPU to
Belleman, R.G.; Bédorf, J.; Portegies Zwart, S.F.
2008-01-01
We present the results of gravitational direct N-body simulations using the graphics processing unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the N-body problem is implemented in "Compute Unified Device Architecture" (CUDA) using the GPU to
Directory of Open Access Journals (Sweden)
Liu Tian-Yuan
2016-01-01
Full Text Available Blade is one of the core components of turbine machinery. The reliability of blade is directly related to the normal operation of plant unit. However, with the increase of blade length and flow rate, non-linear effects such as finite deformation must be considered in strength computation to guarantee enough accuracy. Parallel computation is adopted to improve the efficiency of classical nonlinear finite element method and shorten the blade design period. So it is of extraordinary importance for engineering practice. In this paper, the dynamic partial differential equations and the finite element method forms for turbine blades under centrifugal load and flow load are given firstly. Then, according to the characteristics of turbine blade model, the classical method is optimized based on central processing unit + graphics processing unit heterogeneous parallel computation. Finally, the numerical experiment validations are performed. The computation speed of the algorithm proposed in this paper is compared with the speed of ANSYS. For the rectangle plate model with mesh number of 10 k to 4000 k, a maximum speed-up of 4.31 can be obtained. For the real blade-rim model with mesh number of 500 k, the speed-up of 4.54 times can be obtained.
GPU-based ultrafast IMRT plan optimization
Men, Chunhua; Gu, Xuejun; Choi, Dongju; Majumdar, Amitava; Zheng, Ziyi; Mueller, Klaus; Jiang, Steve B.
2009-11-01
The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California, San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs). We adopt a penalty-based quadratic optimization model, which is solved by using a gradient projection method with Armijo's line search rule. Our optimization algorithm has been implemented in CUDA for parallel GPU computing as well as in C for serial CPU computing for comparison purpose. A prostate IMRT case with various beamlet and voxel sizes was used to evaluate our implementation. On an NVIDIA Tesla C1060 GPU card, we have achieved speedup factors of 20-40 without losing accuracy, compared to the results from an Intel Xeon 2.27 GHz CPU. For a specific nine-field prostate IMRT case with 5 × 5 mm2 beamlet size and 2.5 × 2.5 × 2.5 mm3 voxel size, our GPU implementation takes only 2.8 s to generate an optimal IMRT plan. Our work has therefore solved a major problem in developing online re-planning technologies for adaptive radiotherapy.
High Performance Processing and Analysis of Geospatial Data Using CUDA on GPU
Directory of Open Access Journals (Sweden)
STOJANOVIC, N.
2014-11-01
Full Text Available In this paper, the high-performance processing of massive geospatial data on many-core GPU (Graphic Processing Unit is presented. We use CUDA (Compute Unified Device Architecture programming framework to implement parallel processing of common Geographic Information Systems (GIS algorithms, such as viewshed analysis and map-matching. Experimental evaluation indicates the improvement in performance with respect to CPU-based solutions and shows feasibility of using GPU and CUDA for parallel implementation of GIS algorithms over large-scale geospatial datasets.
Lossless data compression for improving the performance of a GPU-based beamformer.
Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi
2015-04-01
The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer. © The Author(s) 2014.
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
Ruymgaart, A Peter; Elber, Ron
2012-11-13
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
Fast quantum Monte Carlo on a GPU
Lutsyshyn, Y
2013-01-01
We present a scheme for the parallelization of quantum Monte Carlo on graphical processing units, focusing on bosonic systems and variational Monte Carlo. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent acceleration. Comparing with single core execution, GPU-accelerated code runs over x100 faster. The CUDA code is provided along with the package that is necessary to execute variational Monte Carlo for a system representing liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the latest Kepler architecture K20 GPU. Kepler-specific optimization is discussed.
GASPRNG: GPU accelerated scalable parallel random number generator library
Gao, Shuang; Peterson, Gregory D.
2013-04-01
Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or
Citizens unite for computational immunology!
Belden, Orrin S; Baker, Sarah Catherine; Baker, Brian M
2015-07-01
Recruiting volunteers who can provide computational time, programming expertise, or puzzle-solving talent has emerged as a powerful tool for biomedical research. Recent projects demonstrate the potential for such 'crowdsourcing' efforts in immunology. Tools for developing applications, new funding opportunities, and an eager public make crowdsourcing a serious option for creative solutions for computationally-challenging problems. Expanded uses of crowdsourcing in immunology will allow for more efficient large-scale data collection and analysis. It will also involve, inspire, educate, and engage the public in a variety of meaningful ways. The benefits are real - it is time to jump in!
High-speed optical coherence tomography signal processing on GPU
Energy Technology Data Exchange (ETDEWEB)
Li Xiqi; Shi Guohua; Zhang Yudong, E-mail: lixiqi@yahoo.cn [Laboratory on Adaptive Optics, Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209 (China)
2011-01-01
The signal processing speed of spectral domain optical coherence tomography (SD-OCT) has become a bottleneck in many medical applications. Recently, a time-domain interpolation method was proposed. This method not only gets a better signal-to noise ratio (SNR) but also gets a faster signal processing time for the SD-OCT than the widely used zero-padding interpolation method. Furthermore, the re-sampled data is obtained by convoluting the acquired data and the coefficients in time domain. Thus, a lot of interpolations can be performed concurrently. So, this interpolation method is suitable for parallel computing. An ultra-high optical coherence tomography signal processing can be realized by using graphics processing unit (GPU) with computer unified device architecture (CUDA). This paper will introduce the signal processing steps of SD-OCT on GPU. An experiment is performed to acquire a frame SD-OCT data (400A-linesx2048 pixel per A-line) and real-time processed the data on GPU. The results show that it can be finished in 6.208 milliseconds, which is 37 times faster than that on Central Processing Unit (CPU).
Travel Software using GPU Hardware
Szalwinski, Chris M; Dimov, Veliko Atanasov; CERN. Geneva. ATS Department
2015-01-01
Travel is the main multi-particle tracking code being used at CERN for the beam dynamics calculations through hadron and ion linear accelerators. It uses two routines for the calculation of space charge forces, namely, rings of charges and point-to-point. This report presents the studies to improve the performance of Travel using GPU hardware. The studies showed that the performance of Travel with the point-to-point simulations of space-charge effects can be speeded up at least 72 times using current GPU hardware. Simple recompilation of the source code using an Intel compiler can improve performance at least 4 times without GPU support. The limited memory of the GPU is the bottleneck. Two algorithms were investigated on this point: repeated computation and tiling. The repeating computation algorithm is simpler and is the currently recommended solution. The tiling algorithm was more complicated and degraded performance. Both build and test instructions for the parallelized version of the software are inclu...
Fast 2-D ultrasound strain imaging: the benefits of using a GPU.
Idzenga, Tim; Gaburov, Evghenii; Vermin, Willem; Menssen, Jan; de Korte, Chris
2014-01-01
Deformation of tissue can be accurately estimated from radio-frequency ultrasound data using a 2-dimensional normalized cross correlation (NCC)-based algorithm. This procedure, however, is very computationally time-consuming. A major time reduction can be achieved by parallelizing the numerous computations of NCC. In this paper, two approaches for parallelization have been investigated: the OpenMP interface on a multi-CPU system and Compute Unified Device Architecture (CUDA) on a graphics processing unit (GPU). The performance of the OpenMP and GPU approaches were compared with a conventional Matlab implementation of NCC. The OpenMP approach with 8 threads achieved a maximum speed-up factor of 132 on the computing of NCC, whereas the GPU approach on an Nvidia Tesla K20 achieved a maximum speed-up factor of 376. Neither parallelization approach resulted in a significant loss in image quality of the elastograms. Parallelization of the NCC computations using the GPU, therefore, significantly reduces the computation time and increases the frame rate for motion estimation.
GHOSTM: a GPU-accelerated homology search tool for metagenomics.
Directory of Open Access Journals (Sweden)
Shuji Suzuki
Full Text Available BACKGROUND: A large number of sensitive homology searches are required for mapping DNA sequence fragments to known protein sequences in public and private databases during metagenomic analysis. BLAST is currently used for this purpose, but its calculation speed is insufficient, especially for analyzing the large quantities of sequence data obtained from a next-generation sequencer. However, faster search tools, such as BLAT, do not have sufficient search sensitivity for metagenomic analysis. Thus, a sensitive and efficient homology search tool is in high demand for this type of analysis. METHODOLOGY/PRINCIPAL FINDINGS: We developed a new, highly efficient homology search algorithm suitable for graphics processing unit (GPU calculations that was implemented as a GPU system that we called GHOSTM. The system first searches for candidate alignment positions for a sequence from the database using pre-calculated indexes and then calculates local alignments around the candidate positions before calculating alignment scores. We implemented both of these processes on GPUs. The system achieved calculation speeds that were 130 and 407 times faster than BLAST with 1 GPU and 4 GPUs, respectively. The system also showed higher search sensitivity and had a calculation speed that was 4 and 15 times faster than BLAT with 1 GPU and 4 GPUs. CONCLUSIONS: We developed a GPU-optimized algorithm to perform sensitive sequence homology searches and implemented the system as GHOSTM. Currently, sequencing technology continues to improve, and sequencers are increasingly producing larger and larger quantities of data. This explosion of sequence data makes computational analysis with contemporary tools more difficult. We developed GHOSTM, which is a cost-efficient tool, and offer this tool as a potential solution to this problem.
大规模有限元系统的GPU加速计算研究%Solving large finite element system by GPU computation
Institute of Scientific and Technical Information of China (English)
刘小虎; 胡耀国; 符伟
2012-01-01
Some techniques for applying GPU（Graphics Processing Units） computation in FEM（Finite El- ement Method） were investigated in this paper, which include element stiffness matrix parallel calcula- tion and global stiffness matrix assembly method, unstructured sparse matrix-vector multiplication and large-scale linear system solving method. A FEM code was implemented by using CUDA（Compute Uni- fied Device Architecture） platform and tested on nVidia GeForce GPU device. The system stiffness ma- trix was stored in the graphics memory in CSR（Compressed Sparse Row） format,and assembled via element coloring. Conjugate gradient method was used to solve FEM linear system iteratively. For the truss and 2D examples, the GPU-based FEM code gained speedups up to 9. 5x and 6.5x, respectively. It is found that the GPU speedup values are roughly linear with system DOFs（Degree Of Freedoms）,and the peak values of GFLOP/s increase approximately 10 times when comparing with those of CPU＇s.%研究了GPU（Graphics Processing Units）计算应用于有限元方法中的总刚计算和组装、稀疏矩阵与向量乘积运算、线性方程组求解问题，并基于CUDA（Compute Unified Device Arehiteeture）平台利用GTX295GPU进行程序实现和测试。系统总刚采用CSR（Compressed SparseRow）压缩格式存放于GPU显存中，用单元染色方法实现总刚并行计算组装，用共轭梯度迭代法求解大规模线性方程组。对300万自由度以内的空间桁架和平面问题算例，GPU有限元计算分别获得最高9．5倍和6．5倍的计算加速比，并且加速比随系统自由度的增加而近似线性增加，GFLOP／s峰值也有近10倍的增加。
Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil
2013-04-04
The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+. CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at http://cudasw.sourceforge.net.
GPU-ACCELERATED FEM SOLVER FOR THREE DIMENSIONAL ELECTROMAGNETIC ANALYSIS
Institute of Scientific and Technical Information of China (English)
Tian Jin; Gong Li; Shi Xiaowei; Le Xu
2011-01-01
A new Graphics Processing Unit (GPU) parallelization strategy is proposed to accelerate sparse finite element computation for three dimensional electromagnetic analysis.The parallelization strategy is employed based on a new compression format called sliced ELL Four (sliced ELL-F).The sliced ELL-F format-based parallelization strategy is designed for hastening many addition,dot product,and Sparse Matrix Vector Product (SMVP) operations in the Conjugate Gradient Norm (CGN) calculation of finite element equations.The new implementation of SMVP on GPUs is evaluated.The proposed strategy executed on a GPU can efficiently solve sparse finite element equations,especially when the equations are huge sparse (size of most rows in a coefficient matrix is less than 8).Numerical results show the sliced ELL-F format-based parallelization strategy can reach significant speedups compared to Compressed Sparse Row (CSR) format.
Interactive brain shift compensation using GPU based programming
van der Steen, Sander; Noordmans, Herke Jan; Verdaasdonk, Rudolf
2009-02-01
Processing large images files or real-time video streams requires intense computational power. Driven by the gaming industry, the processing power of graphic process units (GPUs) has increased significantly. With the pixel shader model 4.0 the GPU can be used for image processing 10x faster than the CPU. Dedicated software was developed to deform 3D MR and CT image sets for real-time brain shift correction during navigated neurosurgery using landmarks or cortical surface traces defined by the navigation pointer. Feedback was given using orthogonal slices and an interactively raytraced 3D brain image. GPU based programming enables real-time processing of high definition image datasets and various applications can be developed in medicine, optics and image sciences.
NaNet: a configurable NIC bridging the gap between HPC and real-time HEP GPU computing
Lonardo, A.; Ameli, F.; Ammendola, R.; Biagioni, A.; Cotta Ramusino, A.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Pontisso, L.; Rossetti, D.; Simeone, F.; Simula, F.; Sozzi, M.; Tosoratto, L.; Vicini, P.
2015-04-01
NaNet is a FPGA-based PCIe Network Interface Card (NIC) design with GPUDirect and Remote Direct Memory Access (RDMA) capabilities featuring a configurable and extensible set of network channels. The design currently supports both standard—Gbe (1000BASE-T) and 10GbE (10Base-R)—and custom—34 Gbps APElink and 2.5 Gbps deterministic latency KM3link—channels, but its modularity allows for straightforward inclusion of other link technologies. The GPUDirect feature combined with a transport layer offload module and a data stream processing stage makes NaNet a low-latency NIC suitable for real-time GPU processing. In this paper we describe the NaNet architecture and its performances, exhibiting two of its use cases: the GPU-based low-level trigger for the RICH detector in the NA62 experiment at CERN and the on-/off-shore data transport system for the KM3NeT-IT underwater neutrino telescope.
GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo
Energy Technology Data Exchange (ETDEWEB)
Kim, H; Duchaineau, M; Max, N
2011-09-21
We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.
Molecular dynamics simulations through GPU video games technologies
Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
2016-01-01
Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations. PMID:27525251
A New Approach to Reduce Memory Consumption in Lattice Boltzmann Method on GPU
Directory of Open Access Journals (Sweden)
Mojtaba Sheida
2017-01-01
Full Text Available Several efforts have been performed to improve LBM defects related to its computational performance. In this work, a new algorithm has been introduced to reduce memory consumption. In the past, most LBM developers have not paid enough attention to retain LBM simplicity in their modified version, while it has been one of the main concerns in developing of the present algorithm. Note, there is also a deficiency in our new algorithm. Besides the memory reduction, because of high memory call back from the main memory, some computational efficiency reduction occurs. To overcome this difficulty, an optimization approach has been introduced, which has recovered this efficiency to the original two-steps two-lattice LBM. This is accomplished by a trade-off between memory reduction and computational performance. To keep a suitable computational efficiency, memory reduction has reached to about 33% in D2Q9 and 42% in D3Q19. In addition, this approach has been implemented on graphical processing unit (GPU as well. In regard to onboard memory limitation in GPU, the advantage of this new algorithm is enhanced even more (39% in D2Q9 and 45% in D3Q19. Note, because of higher memory bandwidth in GPU, computational performance of our new algorithm using GPU is better than CPU.
Ammazzalorso, F.; Bednarz, T.; Jelen, U.
2014-03-01
We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
Simulating and Visualizing Real-Time Crowds on GPU Clusters
Benjamín Hernández; Hugo Pérez; Isaac Rudomin; Sergio Ruiz; Oriam de Gyves; Leonel Toledo
2014-01-01
We present a set of algorithms for simulating and visualizing real-time crowds in GPU (Graphics Processing Units) clusters. First we present crowd simulation and rendering techniques that take advantage of single GPU machines. Then, using as an example a wandering crowd behavior simulation algorithm, we explain how this kind of algorithms can be extended for their use in GPU cluster environments. We also present a visualization architecture that renders the simulation results using detailed 3...
基于GPU的MATLAB计算与仿真研究%Research of MATLAB Calculations and Simulation based on GPU
Institute of Scientific and Technical Information of China (English)
王恒; 高建瓴
2012-01-01
The graphics processing unit (GPU) has become an integral part of todays mainstream computing system, modern GPU is not only a powerful graphics engine, but also a highly parallel programmable processor. Peak arithmetic and memory bandwidth of the GPU often significantly exceeded the corresponding peak and memory bandwidth in its CPU. This paper describes a computational simulation method based on GPU general computing framework for the JACKET accelerating MATLAB, draws simulation results through classic FFT algorithm, analyzes the FFT algorithm in the CPU and GPU running environment Gflops and speedup and comes to the conclusion that under JACKET large scale computing and the GPU-based MATLAB simulation program running has been efficiency greatly improved.%图形处理单元(GPU)已经成为当今的主流计算系统的一个组成部分,现代GPU不仅是一个功能强大的图形引擎,也是一个高度并行的可编程处理器,GPU的峰值运算和内存带宽往往大幅超出其CPU所对应的峰值和内存带宽.本文介绍了基于GPU通用计算框架的JACKET加速MATLAB的计算仿真方法,通过FFT算法得出仿真结果,分析在CPU和GPU运行环境下的GFLOPS和加速比,最后得出基于GPU的MATLAB计算仿真程序运行效率在JACKET的加速下大大提高了.
Xu, Daguang; Huang, Yong; Kang, Jin U
2014-06-16
We implemented the graphics processing unit (GPU) accelerated compressive sensing (CS) non-uniform in k-space spectral domain optical coherence tomography (SD OCT). Kaiser-Bessel (KB) function and Gaussian function are used independently as the convolution kernel in the gridding-based non-uniform fast Fourier transform (NUFFT) algorithm with different oversampling ratios and kernel widths. Our implementation is compared with the GPU-accelerated modified non-uniform discrete Fourier transform (MNUDFT) matrix-based CS SD OCT and the GPU-accelerated fast Fourier transform (FFT)-based CS SD OCT. It was found that our implementation has comparable performance to the GPU-accelerated MNUDFT-based CS SD OCT in terms of image quality while providing more than 5 times speed enhancement. When compared to the GPU-accelerated FFT based-CS SD OCT, it shows smaller background noise and less side lobes while eliminating the need for the cumbersome k-space grid filling and the k-linear calibration procedure. Finally, we demonstrated that by using a conventional desktop computer architecture having three GPUs, real-time B-mode imaging can be obtained in excess of 30 fps for the GPU-accelerated NUFFT based CS SD OCT with frame size 2048(axial) × 1,000(lateral).
Haptic Feedback for the GPU-based Surgical Simulator
DEFF Research Database (Denmark)
Sørensen, Thomas Sangild; Mosegaard, Jesper
2006-01-01
The GPU has proven to be a powerful processor to compute spring-mass based surgical simulations. It has not previously been shown however, how to effectively implement haptic interaction with a simulation running entirely on the GPU. This paper describes a method to calculate haptic feedback...... with limited performance cost. It allows easy balancing of the GPU workload between calculations of simulation, visualisation, and the haptic feedback....
浅谈CUDA并行计算体系%The CUDA Parallel Computing Architecture
Institute of Scientific and Technical Information of China (English)
叶毅嘉
2015-01-01
In recent years, the rapid development of graphics processor unit(GPU) makes it progressively used for general-purpose computing. There are various platforms for parallel computing, for example, compute uniifed device architecture (CUDA) designed by NVIDIA is widely due to the strong computing power of GPU (graphic processing unit) for realizing general parallel computing.%近年来，图形处理器(Graphic Process Unit,GPU)的快速发展使得其逐步用于通用计算。在性能各异的并行计算平台中，英伟达(NVIDIA)公司推出的计算统一设备架构(Compute Unified Device Architecture,CUDA)因为充分利用GPU (Graphic Processing Unit)强大的计算能力实现了通用并行计算而受到研究者们的青睐。
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu
2011-07-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
Geant4-based Monte Carlo simulations on GPU for medical applications.
Bert, Julien; Perez-Ponce, Hector; El Bitar, Ziad; Jan, Sébastien; Boursier, Yannick; Vintache, Damien; Bonissent, Alain; Morel, Christian; Brasse, David; Visvikis, Dimitris
2013-08-21
Monte Carlo simulation (MCS) plays a key role in medical applications, especially for emission tomography and radiotherapy. However MCS is also associated with long calculation times that prevent its use in routine clinical practice. Recently, graphics processing units (GPU) became in many domains a low cost alternative for the acquisition of high computational power. The objective of this work was to develop an efficient framework for the implementation of MCS on GPU architectures. Geant4 was chosen as the MCS engine given the large variety of physics processes available for targeting different medical imaging and radiotherapy applications. In addition, Geant4 is the MCS engine behind GATE which is actually the most popular medical applications' simulation platform. We propose the definition of a global strategy and associated structures for such a GPU based simulation implementation. Different photon and electron physics effects are resolved on the fly directly on GPU without any approximations with respect to Geant4. Validations have shown equivalence in the underlying photon and electron physics processes between the Geant4 and the GPU codes with a speedup factor of 80-90. More clinically realistic simulations in emission and transmission imaging led to acceleration factors of 400-800 respectively compared to corresponding GATE simulations.
Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.
Mei, Gang; Xu, Nengxiong; Xu, Liangliang
2016-01-01
This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.
Selma, R.
2016-09-01
Paper describes comparison of cross-correlation computation speed of most commonly used computation platforms (CPU, GPU) with an FPGA-based design. It also describes the structure of cross-correlation unit implemented for testing purposes. Speedup of computations was achieved using FPGA-based design, varying between 16 and 5400 times compared to CPU computations and between 3 and 175 times compared to GPU computations.
Vasan, S N Swetadri; Ionita, Ciprian N; Titus, A H; Cartwright, A N; Bednarek, D R; Rudin, S
2012-02-23
We present the image processing upgrades implemented on a Graphics Processing Unit (GPU) in the Control, Acquisition, Processing, and Image Display System (CAPIDS) for the custom Micro-Angiographic Fluoroscope (MAF) detector. Most of the image processing currently implemented in the CAPIDS system is pixel independent; that is, the operation on each pixel is the same and the operation on one does not depend upon the result from the operation on the other, allowing the entire image to be processed in parallel. GPU hardware was developed for this kind of massive parallel processing implementation. Thus for an algorithm which has a high amount of parallelism, a GPU implementation is much faster than a CPU implementation. The image processing algorithm upgrades implemented on the CAPIDS system include flat field correction, temporal filtering, image subtraction, roadmap mask generation and display window and leveling. A comparison between the previous and the upgraded version of CAPIDS has been presented, to demonstrate how the improvement is achieved. By performing the image processing on a GPU, significant improvements (with respect to timing or frame rate) have been achieved, including stable operation of the system at 30 fps during a fluoroscopy run, a DSA run, a roadmap procedure and automatic image windowing and leveling during each frame.
Swetadri Vasan, S. N.; Ionita, Ciprian N.; Titus, A. H.; Cartwright, A. N.; Bednarek, D. R.; Rudin, S.
2012-03-01
We present the image processing upgrades implemented on a Graphics Processing Unit (GPU) in the Control, Acquisition, Processing, and Image Display System (CAPIDS) for the custom Micro-Angiographic Fluoroscope (MAF) detector. Most of the image processing currently implemented in the CAPIDS system is pixel independent; that is, the operation on each pixel is the same and the operation on one does not depend upon the result from the operation on the other, allowing the entire image to be processed in parallel. GPU hardware was developed for this kind of massive parallel processing implementation. Thus for an algorithm which has a high amount of parallelism, a GPU implementation is much faster than a CPU implementation. The image processing algorithm upgrades implemented on the CAPIDS system include flat field correction, temporal filtering, image subtraction, roadmap mask generation and display window and leveling. A comparison between the previous and the upgraded version of CAPIDS has been presented, to demonstrate how the improvement is achieved. By performing the image processing on a GPU, significant improvements (with respect to timing or frame rate) have been achieved, including stable operation of the system at 30 fps during a fluoroscopy run, a DSA run, a roadmap procedure and automatic image windowing and leveling during each frame.
Implementation of GPU-accelerated back projection for EPR imaging.
Qiao, Zhiwei; Redler, Gage; Epel, Boris; Qian, Yuhua; Halpern, Howard
2015-01-01
Electron paramagnetic resonance (EPR) Imaging (EPRI) is a robust method for measuring in vivo oxygen concentration (pO2). For 3D pulse EPRI, a commonly used reconstruction algorithm is the filtered backprojection (FBP) algorithm, in which the backprojection process is computationally intensive and may be time consuming when implemented on a CPU. A multistage implementation of the backprojection can be used for acceleration, however it is not flexible (requires equal linear angle projection distribution) and may still be time consuming. In this work, single-stage backprojection is implemented on a GPU (Graphics Processing Units) having 1152 cores to accelerate the process. The GPU implementation results in acceleration by over a factor of 200 overall and by over a factor of 3500 if only the computing time is considered. Some important experiences regarding the implementation of GPU-accelerated backprojection for EPRI are summarized. The resulting accelerated image reconstruction is useful for real-time image reconstruction monitoring and other time sensitive applications.
Parallel hyperspectral compressive sensing method on GPU
Bernabé, Sergio; Martín, Gabriel; Nascimento, José M. P.
2015-10-01
Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Tian, Zhen; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B
2015-01-01
VMAT optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units have been used to speed up the computations. However, its small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix. This paper is to report an implementation of our column-generation based VMAT algorithm on a multi-GPU platform to solve the memory limitation problem. The column-generation approach generates apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. The DDC matrix is split into four sub-matrices according to beam angles, stored on four GPUs in compressed sparse row format. Computation of beamlet price is accomplished using multi-GPU. While the remaining steps of PP and MP problems are implemented on a single GPU due to their modest computational loads. A H&N patient case was used to validate our method. We compare our multi-GPU implemen...
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster
Energy Technology Data Exchange (ETDEWEB)
Allada, Veerendra, Benjegerdes, Troy; Bode, Brett
2009-08-31
Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.
MRISIMUL: a GPU-based parallel approach to MRI simulations.
Xanthis, Christos G; Venetis, Ioannis E; Chalkias, A V; Aletras, Anthony H
2014-03-01
A new step-by-step comprehensive MR physics simulator (MRISIMUL) of the Bloch equations is presented. The aim was to develop a magnetic resonance imaging (MRI) simulator that makes no assumptions with respect to the underlying pulse sequence and also allows for complex large-scale analysis on a single computer without requiring simplifications of the MRI model. We hypothesized that such a simulation platform could be developed with parallel acceleration of the executable core within the graphic processing unit (GPU) environment. MRISIMUL integrates realistic aspects of the MRI experiment from signal generation to image formation and solves the entire complex problem for densely spaced isochromats and for a densely spaced time axis. The simulation platform was developed in MATLAB whereas the computationally demanding core services were developed in CUDA-C. The MRISIMUL simulator imaged three different computer models: a user-defined phantom, a human brain model and a human heart model. The high computational power of GPU-based simulations was compared against other computer configurations. A speedup of about 228 times was achieved when compared to serially executed C-code on the CPU whereas a speedup between 31 to 115 times was achieved when compared to the OpenMP parallel executed C-code on the CPU, depending on the number of threads used in multithreading (2-8 threads). The high performance of MRISIMUL allows its application in large-scale analysis and can bring the computational power of a supercomputer or a large computer cluster to a single GPU personal computer.
Naveros, Francisco; Luque, Niceto R; Garrido, Jesús A; Carrillo, Richard R; Anguita, Mancia; Ros, Eduardo
2015-07-01
Time-driven simulation methods in traditional CPU architectures perform well and precisely when simulating small-scale spiking neural networks. Nevertheless, they still have drawbacks when simulating large-scale systems. Conversely, event-driven simulation methods in CPUs and time-driven simulation methods in graphic processing units (GPUs) can outperform CPU time-driven methods under certain conditions. With this performance improvement in mind, we have developed an event-and-time-driven spiking neural network simulator suitable for a hybrid CPU-GPU platform. Our neural simulator is able to efficiently simulate bio-inspired spiking neural networks consisting of different neural models, which can be distributed heterogeneously in both small layers and large layers or subsystems. For the sake of efficiency, the low-activity parts of the neural network can be simulated in CPU using event-driven methods while the high-activity subsystems can be simulated in either CPU (a few neurons) or GPU (thousands or millions of neurons) using time-driven methods. In this brief, we have undertaken a comparative study of these different simulation methods. For benchmarking the different simulation methods and platforms, we have used a cerebellar-inspired neural-network model consisting of a very dense granular layer and a Purkinje layer with a smaller number of cells (according to biological ratios). Thus, this cerebellar-like network includes a dense diverging neural layer (increasing the dimensionality of its internal representation and sparse coding) and a converging neural layer (integration) similar to many other biologically inspired and also artificial neural networks.
Energy Technology Data Exchange (ETDEWEB)
Arafat, Humayun [Department of Computer Science and Engineering, The Ohio State University, Columbus OH USA; Dinan, James [Mathematics and Computer Science Division, Argonne National Laboratory, Lemont IL USA; Krishnamoorthy, Sriram [Computer Science and Mathematics Division, Pacific Northwest National Laboratory, Richland WA USA; Balaji, Pavan [Mathematics and Computer Science Division, Argonne National Laboratory, Lemont IL USA; Sadayappan, P. [Department of Computer Science and Engineering, The Ohio State University, Columbus OH USA
2016-01-06
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.
Large Scale Simulations of the Euler Equations on GPU Clusters
Liebmann, Manfred
2010-08-01
The paper investigates the scalability of a parallel Euler solver, using the Vijayasundaram method, on a GPU cluster with 32 Nvidia Geforce GTX 295 boards. The aim of this research is to enable large scale fluid dynamics simulations with up to one billion elements. We investigate communication protocols for the GPU cluster to compensate for the slow Gigabit Ethernet network between the GPU compute nodes and to maintain overall efficiency. A diesel engine intake-port and a nozzle, meshed in different resolutions, give good real world examples for the scalability tests on the GPU cluster. © 2010 IEEE.
A GPU-based Real-time Software Correlation System for the Murchison Widefield Array Prototype
Wayth, Randall B.; Greenhill, Lincoln J.; Briggs, Frank H.
2009-08-01
Modern graphics processing units (GPUs) are inexpensive commodity hardware that offer Tflop/s theoretical computing capacity. GPUs are well suited to many compute-intensive tasks including digital signal processing. We describe the implementation and performance of a GPU-based digital correlator for radio astronomy. The correlator is implemented using the NVIDIA CUDA development environment. We evaluate three design options on two generations of NVIDIA hardware. The different designs utilize the internal registers, shared memory, and multiprocessors in different ways. We find that optimal performance is achieved with the design that minimizes global memory reads on recent generations of hardware. The GPU-based correlator outperforms a single-threaded CPU equivalent by a factor of 60 for a 32-antenna array, and runs on commodity PC hardware. The extra compute capability provided by the GPU maximizes the correlation capability of a PC while retaining the fast development time associated with using standard hardware, networking, and programming languages. In this way, a GPU-based correlation system represents a middle ground in design space between high performance, custom-built hardware, and pure CPU-based software correlation. The correlator was deployed at the Murchison Widefield Array 32-antenna prototype system where it ran in real time for extended periods. We briefly describe the data capture, streaming, and correlation system for the prototype array.
Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.
Directory of Open Access Journals (Sweden)
Cheng-Ying Chou
Full Text Available Positron emission tomography (PET is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU, NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.
Mano, Omer
2017-01-01
Sensory neuroscience seeks to understand and predict how sensory neurons respond to stimuli. Nonlinear components of neural responses are frequently characterized by the second-order Wiener kernel and the closely-related spike-triggered covariance (STC). Recent advances in data acquisition have made it increasingly common and computationally intensive to compute second-order Wiener kernels/STC matrices. In order to speed up this sort of analysis, we developed a graphics processing unit (GPU)-accelerated module that computes the second-order Wiener kernel of a system’s response to a stimulus. The generated kernel can be easily transformed for use in standard STC analyses. Our code speeds up such analyses by factors of over 100 relative to current methods that utilize central processing units (CPUs). It works on any modern GPU and may be integrated into many data analysis workflows. This module accelerates data analysis so that more time can be spent exploring parameter space and interpreting data. PMID:28068420
GPU acceleration of a nonhydrostatic model for the internal solitary waves simulation
Institute of Scientific and Technical Information of China (English)
CHEN Tong-qing; ZHANG Qing-he
2013-01-01
The parallel computing algorithm for a nonhydrostatic model on one or multiple Graphic Processing Units (GPUs) for the simulation of internal solitary waves is presented and discussed.The computational efficiency of the GPU scheme is analyzed by a series of numerical experiments,including an ideal case and the field scale simulations,performed on the workstation and the supercomputer system.The calculated results show that the speedup of the developed GPU-based parallel computing scheme,compared to the implementation on a single CPU core,increases with the number of computational grid cells,and the speedup can increase quasilinearly with respect to the number of involved GPUs for the problem with relatively large number of grid cells within 32 GPUs.
Cosmological Calculations on the GPU
Bard, Deborah; Allen, Mark T; Yepremyan, Hasmik; Kratochvil, Jan M
2012-01-01
Cosmological measurements require the calculation of nontrivial quantities over large datasets. The next generation of survey telescopes (such as DES, PanSTARRS, and LSST) will yield measurements of billions of galaxies. The scale of these datasets, and the nature of the calculations involved, make cosmological calculations ideal models for implementation on graphics processing units (GPUs). We consider two cosmological calculations, the two-point angular correlation function and the aperture mass statistic, and aim to improve the calculation time by constructing code for calculating them on the GPU. Using CUDA, we implement the two algorithms on the GPU and compare the calculation speeds to comparable code run on the CPU. We obtain a code speed-up of between 10 - 180x faster, compared to performing the same calculation on the CPU. The code has been made publicly available.
Parallel Implementation of Texture Based Image Retrieval on The GPU
Directory of Open Access Journals (Sweden)
Alireza Ahmadi Mohammadabadi
2013-07-01
Full Text Available Most image processing algorithms are inherently parallel, so multithreading processors are suitable in such applications. In huge image databases, image processing takes very long time for run on a single core processor because of single thread execution of algorithms. Graphical Processors Units (GPU is more common in most image processing applications due to multithread execution of algorithms, programmability and low cost. In this paper we implement texture based image retrieval system in parallel using Compute Unified Device Architecture (CUDA programming model to run on GPU. The main goal of this research work is to parallelize the process of texture based image retrieval through entropy, standard deviation, and local range, also whole process is much faster than normal. Our work uses extensive usage of highly multithreaded architecture of multi-cored GPU. We evaluated the retrieval of the proposed technique using Recall, Precision, and Average Precision measures. Experimental results showed that parallel implementation led to an average speed up of 140.046×over the serial implementation. The average Precision and the average Recall of presented method are 39.67% and 55.00% respectively.
A GPU-Parallelized Eigen-Based Clutter Filter Framework for Ultrasound Color Flow Imaging.
Chee, Adrian J Y; Yiu, Billy Y S; Yu, Alfred C H
2017-01-01
Eigen-filters with attenuation response adapted to clutter statistics in color flow imaging (CFI) have shown improved flow detection sensitivity in the presence of tissue motion. Nevertheless, its practical adoption in clinical use is not straightforward due to the high computational cost for solving eigendecompositions. Here, we provide a pedagogical description of how a real-time computing framework for eigen-based clutter filtering can be developed through a single-instruction, multiple data (SIMD) computing approach that can be implemented on a graphical processing unit (GPU). Emphasis is placed on the single-ensemble-based eigen-filtering approach (Hankel singular value decomposition), since it is algorithmically compatible with GPU-based SIMD computing. The key algebraic principles and the corresponding SIMD algorithm are explained, and annotations on how such algorithm can be rationally implemented on the GPU are presented. Real-time efficacy of our framework was experimentally investigated on a single GPU device (GTX Titan X), and the computing throughput for varying scan depths and slow-time ensemble lengths was studied. Using our eigen-processing framework, real-time video-range throughput (24 frames/s) can be attained for CFI frames with full view in azimuth direction (128 scanlines), up to a scan depth of 5 cm ( λ pixel axial spacing) for slow-time ensemble length of 16 samples. The corresponding CFI image frames, with respect to the ones derived from non-adaptive polynomial regression clutter filtering, yielded enhanced flow detection sensitivity in vivo, as demonstrated in a carotid imaging case example. These findings indicate that the GPU-enabled eigen-based clutter filtering can improve CFI flow detection performance in real time.
Topping, T. Russell; French, James; Hancock, Monte F., Jr.
2010-04-01
Working with the Naval Research Laboratory, Celestech has implemented advanced non-linear hyperspectral image (HSI) processing algorithms optimized for Graphics Processing Units (GPU). These algorithms have demonstrated performance improvements of nearly 2 orders of magnitude over optimal CPU-based implementations. The paper briefly covers the architecture of the NIVIDIA GPU to provide a basis for discussing GPU optimization challenges and strategies. The paper then covers optimization approaches employed to extract performance from the GPU implementation of Dr. Bachmann's algorithms including memory utilization and process thread optimization considerations. The paper goes on to discuss strategies for deploying GPU-enabled servers into enterprise service oriented architectures. Also discussed are Celestech's on-going work in the area of middleware frameworks to provide an optimized multi-GPU utilization and scheduling approach that supports both multiple GPUs in a single computer as well as across multiple computers. This paper is a complementary work to the paper submitted by Dr. Charles Bachmann entitled "A Scalable Approach to Modeling Nonlinear Structure in Hyperspectral Imagery and Other High-Dimensional Data Using Manifold Coordinate Representations". Dr. Bachmann's paper covers the algorithmic and theoretical basis for the HSI processing approach.
A GPU code for analytic continuation through a sampling method
Nordström, Johan; Schött, Johan; Locht, Inka L. M.; Di Marco, Igor
We here present a code for performing analytic continuation of fermionic Green's functions and self-energies as well as bosonic susceptibilities on a graphics processing unit (GPU). The code is based on the sampling method introduced by Mishchenko et al. (2000), and is written for the widely used CUDA platform from NVidia. Detailed scaling tests are presented, for two different GPUs, in order to highlight the advantages of this code with respect to standard CPU computations. Finally, as an example of possible applications, we provide the analytic continuation of model Gaussian functions, as well as more realistic test cases from many-body physics.
GPU-Based Techniques for Global Illumination Effects
Szirmay-Kalos, László; Sbert, Mateu
2008-01-01
This book presents techniques to render photo-realistic images by programming the Graphics Processing Unit (GPU). We discuss effects such as mirror reflections, refractions, caustics, diffuse or glossy indirect illumination, radiosity, single or multiple scattering in participating media, tone reproduction, glow, and depth of field. This book targets game developers, graphics programmers, and also students with some basic understanding of computer graphics algorithms, rendering APIs like Direct3D or OpenGL, and shader programming. In order to make this book self-contained, the most important c
Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS.
Davis, Nicholas A; Pandey, Ahwan; McKinney, B A
2011-01-15
Bioinformatics researchers have a variety of programming languages and architectures at their disposal, and recent advances in graphics processing unit (GPU) computing have added a promising new option. However, many performance comparisons inflate the actual advantages of GPU technology. In this study, we carry out a realistic performance evaluation of SNPrank, a network centrality algorithm that ranks single nucleotide polymorhisms (SNPs) based on their importance in the context of a phenotype-specific interaction network. Our goal is to identify the best computational engine for the SNPrank web application and to provide a variety of well-tested implementations of SNPrank for Bioinformaticists to integrate into their research. Using SNP data from the Wellcome Trust Case Control Consortium genome-wide association study of Bipolar Disorder, we compare multiple SNPrank implementations, including Python, Matlab and Java as well as CPU versus GPU implementations. When compared with naïve, single-threaded CPU implementations, the GPU yields a large improvement in the execution time. However, with comparable effort, multi-threaded CPU implementations negate the apparent advantage of GPU implementations. The SNPrank code is open source and available at http://insilico.utulsa.edu/snprank.
An OpenCL implementation for the solution of TDSE on GPU and CPU architectures
O'Broin, Cathal
2012-01-01
Open Computing Language (OpenCL) is a parallel processing language that is ideally suited for running parallel algorithms on Graphical Processing Units (GPUs). In the present work we report the development of a generic parallel single-GPU code for the numerical solution of a system of first-order ordinary differential equations (ODEs) based on the openCL model. We have applied the code in the case of the time-dependent Schr\\"{o}dinger equation of atomic hydrogen in a strong laser field and studied its performance to the two basic kinds of compute units (GPUs and CPUs) . We found an excellent scalability and a significant speed-up of the GPU over the CPU device tending to a value of about 40.
Bédorf, Jeroen; Zwart, Simon Portegies
2012-01-01
We present a gravitational hierarchical N-body code that is designed to run efficiently on Graphics Processing Units (GPUs). All parts of the algorithm are executed on the GPU which eliminates the need for data transfer between the Central Processing Unit (CPU) and the GPU. Our tests indicate that the gravitational tree-code outperforms tuned CPU code for all parts of the algorithm and show an overall performance improvement of more than a factor 20, resulting in a processing rate of more than 2.8 million particles per second.
Directory of Open Access Journals (Sweden)
M. Huang
2015-09-01
Full Text Available The planetary boundary layer (PBL is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds, and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF model of which the Yonsei University (YSU scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.
Huang, M.; Mielikainen, J.; Huang, B.; Chen, H.; Huang, H.-L. A.; Goldberg, M. D.
2015-09-01
The planetary boundary layer (PBL) is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF) model of which the Yonsei University (YSU) scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU) massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU)-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores) with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.
Directory of Open Access Journals (Sweden)
M. Huang
2014-11-01
Full Text Available The planetary boundary layer (PBL is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds, and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF model of which the Yonsei University (YSU scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using Graphics Processing Unit (GPU massive parallel computing architecture whilst maintaining its accuracy as compared to its CPU-based implementation. This paper presents our efficient GPU-based design on WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its Central Processing Unit (CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores with respect to one CPU core is only 3.5×. We can even boost the speedup to 360× with respect to one CPU core as two K40 GPUs are applied.
Chi, Yujie; Tian, Zhen; Jia, Xun
2016-08-01
Monte Carlo (MC) particle transport simulation on a graphics-processing unit (GPU) platform has been extensively studied recently due to the efficiency advantage achieved via massive parallelization. Almost all of the existing GPU-based MC packages were developed for voxelized geometry. This limited application scope of these packages. The purpose of this paper is to develop a module to model parametric geometry and integrate it in GPU-based MC simulations. In our module, each continuous region was defined by its bounding surfaces that were parameterized by quadratic functions. Particle navigation functions in this geometry were developed. The module was incorporated to two previously developed GPU-based MC packages and was tested in two example problems: (1) low energy photon transport simulation in a brachytherapy case with a shielded cylinder applicator and (2) MeV coupled photon/electron transport simulation in a phantom containing several inserts of different shapes. In both cases, the calculated dose distributions agreed well with those calculated in the corresponding voxelized geometry. The averaged dose differences were 1.03% and 0.29%, respectively. We also used the developed package to perform simulations of a Varian VS 2000 brachytherapy source and generated a phase-space file. The computation time under the parameterized geometry depended on the memory location storing the geometry data. When the data was stored in GPU's shared memory, the highest computational speed was achieved. Incorporation of parameterized geometry yielded a computation time that was ~3 times of that in the corresponding voxelized geometry. We also developed a strategy to use an auxiliary index array to reduce frequency of geometry calculations and hence improve efficiency. With this strategy, the computational time ranged in 1.75-2.03 times of the voxelized geometry for coupled photon/electron transport depending on the voxel dimension of the auxiliary index array, and in 0
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU
Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong
2014-03-01
A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNR<52.51 dB). The comparable results of CNR were obtained from both processing methods (i.e., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
Mehrenberger, M; Marradi, L; Crouseilles, N; Sonnendrucker, E; Afeyan, B
2013-01-01
This work concerns the numerical simulation of the Vlasov-Poisson set of equations using semi- Lagrangian methods on Graphical Processing Units (GPU). To accomplish this goal, modifications to traditional methods had to be implemented. First and foremost, a reformulation of semi-Lagrangian methods is performed, which enables us to rewrite the governing equations as a circulant matrix operating on the vector of unknowns. This product calculation can be performed efficiently using FFT routines. Second, to overcome the limitation of single precision inherent in GPU, a {\\delta}f type method is adopted which only needs refinement in specialized areas of phase space but not throughout. Thus, a GPU Vlasov-Poisson solver can indeed perform high precision simulations (since it uses very high order reconstruction methods and a large number of grid points in phase space). We show results for rather academic test cases on Landau damping and also for physically relevant phenomena such as the bump on tail instability and t...
A GPU implementation of the Simulated Annealing Heuristic for the Quadratic Assignment Problem
Paul, Gerald
2012-01-01
The quadratic assignment problem (QAP) is one of the most difficult combinatorial optimization problems. An effective heuristic for obtaining approximate solutions to the QAP is simulated annealing (SA). Here we describe an SA implementation for the QAP which runs on a graphics processing unit (GPU). GPUs are composed of low cost commodity graphics chips which in combination provide a powerful platform for general purpose parallel computing. For SA runs with large numbers of iterations, we fi...
A Fast Poisson Solver with Periodic Boundary Conditions for GPU Clusters in Various Configurations
Rattermann, Dale Nicholas
Fast Poisson solvers using the Fast Fourier Transform on uniform grids are especially suited for parallel implementation, making them appropriate for portability on graphical processing unit (GPU) devices. The goal of the following work was to implement, test, and evaluate a fast Poisson solver for periodic boundary conditions for use on a variety of GPU configurations. The solver used in this research was FLASH, an immersed-boundary-based method, which is well suited for complex, time-dependent geometries, has robust adaptive mesh refinement/de-refinement capabilities to capture evolving flow structures, and has been successfully implemented on conventional, parallel supercomputers. However, these solvers are still computationally costly to employ, and the total solver time is dominated by the solution of the pressure Poisson equation using state-of-the-art multigrid methods. FLASH improves the performance of its multigrid solvers by integrating a parallel FFT solver on a uniform grid during a coarse level. This hybrid solver could then be theoretically improved by replacing the highly-parallelizable FFT solver with one that utilizes GPUs, and, thus, was the motivation for my research. In the present work, the CPU-utilizing parallel FFT solver (PFFT) used in the base version of FLASH for solving the Poisson equation on uniform grids has been modified to enable parallel execution on CUDA-enabled GPU devices. New algorithms have been implemented to replace the Poisson solver that decompose the computational domain and send each new block to a GPU for parallel computation. One-dimensional (1-D) decomposition of the computational domain minimizes the amount of network traffic involved in this bandwidth-intensive computation by limiting the amount of all-to-all communication required between processes. Advanced techniques have been incorporated and implemented in a GPU-centric code design, while allowing end users the flexibility of parameter control at runtime in
GPU-based parallel computing for fast image reconstruction in micro CT%基于GPU并行计算实现快速显微CT重构
Institute of Scientific and Technical Information of China (English)
沈飞; 陈荣昌; 肖体乔
2011-01-01
图像处理器(GPU,graphics processing unit)是一种高度并行化的流式处理器,将其引入显微CT重构可大幅加快重构速度.将CT重建程序并行化设计,使用CUDA编程标准成功实现程序编写,在单块NVDIA GTX295的GPU上运行CT重建程序.结果表明,与CPU串行方法相比,使用GPU并行计算最高可得150倍左右的加速比,将重构时间缩短至15 min左右.因此可认为,GPU并行处理可显著提高显微CT重构速度,采用优化设计的GPU服务器还可进一步提高重构速度.
Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun
2015-01-01
The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform
Directory of Open Access Journals (Sweden)
Ronglin Jiang
2014-01-01
Full Text Available This paper introduces a (finite difference time domain FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI and Open Multiprocessing (OpenMP. Since both Central Processing Unit (CPU and Graphics Processing Unit (GPU resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with 16×18 elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.
Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA.
Mrozek, Dariusz; Brożek, Miłosz; Małysiak-Mrozek, Bożena
2014-02-01
Searching for similar 3D protein structures is one of the primary processes employed in the field of structural bioinformatics. However, the computational complexity of this process means that it is constantly necessary to search for new methods that can perform such a process faster and more efficiently. Finding molecular substructures that complex protein structures have in common is still a challenging task, especially when entire databases containing tens or even hundreds of thousands of protein structures must be scanned. Graphics processing units (GPUs) and general purpose graphics processing units (GPGPUs) can perform many time-consuming and computationally demanding processes much more quickly than a classical CPU can. In this paper, we describe the GPU-based implementation of the CASSERT algorithm for 3D protein structure similarity searching. This algorithm is based on the two-phase alignment of protein structures when matching fragments of the compared proteins. The GPU (GeForce GTX 560Ti: 384 cores, 2GB RAM) implementation of CASSERT ("GPU-CASSERT") parallelizes both alignment phases and yields an average 180-fold increase in speed over its CPU-based, single-core implementation on an Intel Xeon E5620 (2.40GHz, 4 cores). In this paper, we show that massive parallelization of the 3D structure similarity search process on many-core GPU devices can reduce the execution time of the process, allowing it to be performed in real time. GPU-CASSERT is available at: http://zti.polsl.pl/dmrozek/science/gpucassert/cassert.htm.
Digital image processing using parallel computing based on CUDA technology
Skirnevskiy, I. P.; Pustovit, A. V.; Abdrashitova, M. O.
2017-01-01
This article describes expediency of using a graphics processing unit (GPU) in big data processing in the context of digital images processing. It provides a short description of a parallel computing technology and its usage in different areas, definition of the image noise and a brief overview of some noise removal algorithms. It also describes some basic requirements that should be met by certain noise removal algorithm in the projection to computer tomography. It provides comparison of the performance with and without using GPU as well as with different percentage of using CPU and GPU.
Monte Carlo Simulations of Random Frustrated Systems on Graphics Processing Units
Feng, Sheng; Fang, Ye; Hall, Sean; Papke, Ariane; Thomasson, Cade; Tam, Ka-Ming; Moreno, Juana; Jarrell, Mark
2012-02-01
We study the implementation of the classical Monte Carlo simulation for random frustrated models using the multithreaded computing environment provided by the the Compute Unified Device Architecture (CUDA) on modern Graphics Processing Units (GPU) with hundreds of cores and high memory bandwidth. The key for optimizing the performance of the GPU computing is in the proper handling of the data structure. Utilizing the multi-spin coding, we obtain an efficient GPU implementation of the parallel tempering Monte Carlo simulation for the Edwards-Anderson spin glass model. In the typical simulations, we find over two thousand times of speed-up over the single threaded CPU implementation.
EpiGPU: exhaustive pairwise epistasis scans parallelized on consumer level graphics cards.
Hemani, Gibran; Theocharidis, Athanasios; Wei, Wenhua; Haley, Chris
2011-06-01
Hundreds of genome-wide association studies have been performed over the last decade, but as single nucleotide polymorphism (SNP) chip density has increased so has the computational burden to search for epistasis [for n SNPs the computational time resource is O(n(n-1)/2)]. While the theoretical contribution of epistasis toward phenotypes of medical and economic importance is widely discussed, empirical evidence is conspicuously absent because its analysis is often computationally prohibitive. To facilitate resolution in this field, tools must be made available that can render the search for epistasis universally viable in terms of hardware availability, cost and computational time. By partitioning the 2D search grid across the multicore architecture of a modern consumer graphics processing unit (GPU), we report a 92× increase in the speed of an exhaustive pairwise epistasis scan for a quantitative phenotype, and we expect the speed to increase as graphics cards continue to improve. To achieve a comparable computational improvement without a graphics card would require a large compute-cluster, an option that is often financially non-viable. The implementation presented uses OpenCL--an open-source library designed to run on any commercially available GPU and on any operating system. The software is free, open-source, platformindependent and GPU-vendor independent. It can be downloaded from http://sourceforge.net/projects/epigpu/.
GPU accelerated flow solver for direct numerical simulation of turbulent flows
Salvadore, Francesco; Bernardini, Matteo; Botti, Michela
2013-02-01
Graphical processing units (GPUs), characterized by significant computing performance, are nowadays very appealing for the solution of computationally demanding tasks in a wide variety of scientific applications. However, to run on GPUs, existing codes need to be ported and optimized, a procedure which is not yet standardized and may require non trivial efforts, even to high-performance computing specialists. In the present paper we accurately describe the porting to CUDA (Compute Unified Device Architecture) of a finite-difference compressible Navier-Stokes solver, suitable for direct numerical simulation (DNS) of turbulent flows. Porting and validation processes are illustrated in detail, with emphasis on computational strategies and techniques that can be applied to overcome typical bottlenecks arising from the porting of common computational fluid dynamics solvers. We demonstrate that a careful optimization work is crucial to get the highest performance from GPU accelerators. The results show that the overall speedup of one NVIDIA Tesla S2070 GPU is approximately 22 compared with one AMD Opteron 2352 Barcelona chip and 11 compared with one Intel Xeon X5650 Westmere core. The potential of GPU devices in the simulation of unsteady three-dimensional turbulent flows is proved by performing a DNS of a spatially evolving compressible mixing layer.
GPU accelerated flow solver for direct numerical simulation of turbulent flows
Energy Technology Data Exchange (ETDEWEB)
Salvadore, Francesco [CASPUR – via dei Tizii 6/b, 00185 Rome (Italy); Bernardini, Matteo, E-mail: matteo.bernardini@uniroma1.it [Department of Mechanical and Aerospace Engineering, University of Rome ‘La Sapienza’ – via Eudossiana 18, 00184 Rome (Italy); Botti, Michela [CASPUR – via dei Tizii 6/b, 00185 Rome (Italy)
2013-02-15
Graphical processing units (GPUs), characterized by significant computing performance, are nowadays very appealing for the solution of computationally demanding tasks in a wide variety of scientific applications. However, to run on GPUs, existing codes need to be ported and optimized, a procedure which is not yet standardized and may require non trivial efforts, even to high-performance computing specialists. In the present paper we accurately describe the porting to CUDA (Compute Unified Device Architecture) of a finite-difference compressible Navier–Stokes solver, suitable for direct numerical simulation (DNS) of turbulent flows. Porting and validation processes are illustrated in detail, with emphasis on computational strategies and techniques that can be applied to overcome typical bottlenecks arising from the porting of common computational fluid dynamics solvers. We demonstrate that a careful optimization work is crucial to get the highest performance from GPU accelerators. The results show that the overall speedup of one NVIDIA Tesla S2070 GPU is approximately 22 compared with one AMD Opteron 2352 Barcelona chip and 11 compared with one Intel Xeon X5650 Westmere core. The potential of GPU devices in the simulation of unsteady three-dimensional turbulent flows is proved by performing a DNS of a spatially evolving compressible mixing layer.
GPU-based fast Monte Carlo simulation for radiotherapy dose calculation.
Jia, Xun; Gu, Xuejun; Graves, Yan Jiang; Folkerts, Michael; Jiang, Steve B
2011-11-21
Monte Carlo (MC) simulation is commonly considered to be the most accurate dose calculation method in radiotherapy. However, its efficiency still requires improvement for many routine clinical applications. In this paper, we present our recent progress toward the development of a graphics processing unit (GPU)-based MC dose calculation package, gDPM v2.0. It utilizes the parallel computation ability of a GPU to achieve high efficiency, while maintaining the same particle transport physics as in the original dose planning method (DPM) code and hence the same level of simulation accuracy. In GPU computing, divergence of execution paths between threads can considerably reduce the efficiency. Since photons and electrons undergo different physics and hence attain different execution paths, we use a simulation scheme where photon transport and electron transport are separated to partially relieve the thread divergence issue. A high-performance random number generator and a hardware linear interpolation are also utilized. We have also developed various components to handle the fluence map and linac geometry, so that gDPM can be used to compute dose distributions for realistic IMRT or VMAT treatment plans. Our gDPM package is tested for its accuracy and efficiency in both phantoms and realistic patient cases. In all cases, the average relative uncertainties are less than 1%. A statistical t-test is performed and the dose difference between the CPU and the GPU results is not found to be statistically significant in over 96% of the high dose region and over 97% of the entire region. Speed-up factors of 69.1 ∼ 87.2 have been observed using an NVIDIA Tesla C2050 GPU card against a 2.27 GHz Intel Xeon CPU processor. For realistic IMRT and VMAT plans, MC dose calculation can be completed with less than 1% standard deviation in 36.1 ∼ 39.6 s using gDPM.
Fourtakas, G.; Rogers, B. D.
2016-06-01
A two-phase numerical model using Smoothed Particle Hydrodynamics (SPH) is applied to two-phase liquid-sediments flows. The absence of a mesh in SPH is ideal for interfacial and highly non-linear flows with changing fragmentation of the interface, mixing and resuspension. The rheology of sediment induced under rapid flows undergoes several states which are only partially described by previous research in SPH. This paper attempts to bridge the gap between the geotechnics, non-Newtonian and Newtonian flows by proposing a model that combines the yielding, shear and suspension layer which are needed to predict accurately the global erosion phenomena, from a hydrodynamics prospective. The numerical SPH scheme is based on the explicit treatment of both phases using Newtonian and the non-Newtonian Bingham-type Herschel-Bulkley-Papanastasiou constitutive model. This is supplemented by the Drucker-Prager yield criterion to predict the onset of yielding of the sediment surface and a concentration suspension model. The multi-phase model has been compared with experimental and 2-D reference numerical models for scour following a dry-bed dam break yielding satisfactory results and improvements over well-known SPH multi-phase models. With 3-D simulations requiring a large number of particles, the code is accelerated with a graphics processing unit (GPU) in the open-source DualSPHysics code. The implementation and optimisation of the code achieved a speed up of x58 over an optimised single thread serial code. A 3-D dam break over a non-cohesive erodible bed simulation with over 4 million particles yields close agreement with experimental scour and water surface profiles.
GPU based framework for geospatial analyses
Cosmin Sandric, Ionut; Ionita, Cristian; Dardala, Marian; Furtuna, Titus
2017-04-01
Parallel processing on multiple CPU cores is already used at large scale in geocomputing, but parallel processing on graphics cards is just at the beginning. Being able to use an simple laptop with a dedicated graphics card for advanced and very fast geocomputation is an advantage that each scientist wants to have. The necessity to have high speed computation in geosciences has increased in the last 10 years, mostly due to the increase in the available datasets. These datasets are becoming more and more detailed and hence they require more space to store and more time to process. Distributed computation on multicore CPU's and GPU's plays an important role by processing one by one small parts from these big datasets. These way of computations allows to speed up the process, because instead of using just one process for each dataset, the user can use all the cores from a CPU or up to hundreds of cores from GPU The framework provide to the end user a standalone tools for morphometry analyses at multiscale level. An important part of the framework is dedicated to uncertainty propagation in geospatial analyses. The uncertainty may come from the data collection or may be induced by the model or may have an infinite sources. These uncertainties plays important roles when a spatial delineation of the phenomena is modelled. Uncertainty propagation is implemented inside the GPU framework using Monte Carlo simulations. The GPU framework with the standalone tools proved to be a reliable tool for modelling complex natural phenomena The framework is based on NVidia Cuda technology and is written in C++ programming language. The code source will be available on github at https://github.com/sandricionut/GeoRsGPU Acknowledgement: GPU framework for geospatial analysis, Young Researchers Grant (ICUB-University of Bucharest) 2016, director Ionut Sandric
Advanced Investigation and Comparative Study of Graphics Processing Unit-queries Countered
Directory of Open Access Journals (Sweden)
A. Baskar
2014-10-01
Full Text Available GPU, Graphics Processing Unit, is the buzz word ruling the market these days. What is that and how has it gained that much importance is what to be answered in this research work. The study has been constructed with full attention paid towards answering the following question. What is a GPU? How is it different from a CPU? How good/bad it is computationally when comparing to CPU? Can GPU replace CPU, or it is a day dream? How significant is arrival of APU (Accelerated Processing Unit in market? What tools are needed to make GPU work? What are the improvement/focus areas for GPU to stand in the market? All the above questions are discussed and answered well in this study with relevant explanations.
GPU Pro 4 advanced rendering techniques
Engel, Wolfgang
2013-01-01
GPU Pro4: Advanced Rendering Techniques presents ready-to-use ideas and procedures that can help solve many of your day-to-day graphics programming challenges. Focusing on interactive media and games, the book covers up-to-date methods producing real-time graphics. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Sebastien St-Laurent have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book begins with discussions on the abi
Electromagnetic metamaterial simulations using a GPU-accelerated FDTD method
Seok, Myung-Su; Lee, Min-Gon; Yoo, SeokJae; Park, Q.-Han
2015-12-01
Metamaterials composed of artificial subwavelength structures exhibit extraordinary properties that cannot be found in nature. Designing artificial structures having exceptional properties plays a pivotal role in current metamaterial research. We present a new numerical simulation scheme for metamaterial research. The scheme is based on a graphic processing unit (GPU)-accelerated finite-difference time-domain (FDTD) method. The FDTD computation can be significantly accelerated when GPUs are used instead of only central processing units (CPUs). We explain how the fast FDTD simulation of large-scale metamaterials can be achieved through communication optimization in a heterogeneous CPU/GPU-based computer cluster. Our method also includes various advanced FDTD techniques: the non-uniform grid technique, the total-field/scattered-field (TFSF) technique, the auxiliary field technique for dispersive materials, the running discrete Fourier transform, and the complex structure setting. We demonstrate the power of our new FDTD simulation scheme by simulating the negative refraction of light in a coaxial waveguide metamaterial.
Parallel Implementation of Color Based Image Retrieval Using CUDA on the GPU
Directory of Open Access Journals (Sweden)
Hadis Heidari
2013-12-01
Full Text Available Most image processing algorithms are inherently parallel, so multithreading processors are suitable in such applications. In huge image databases, image processing takes very long time for run on a single core processor because of single thread execution of algorithms. Graphical Processors Units (GPU is more common in most image processing applications due to multithread execution of algorithms, programmability and low cost. In this paper we implement color based image retrieval system in parallel using Compute Unified Device Architecture (CUDA programming model to run on GPU. The main goal of this research work is to parallelize the process of color based image retrieval through color moments; also whole process is much faster than normal. Our work uses extensive usage of highly multithreaded architecture of multi-cored GPU. An efficient use of shared memory is needed to optimize parallel reduction in CUDA. We evaluated the retrieval of the proposed technique using Recall, Precision, and Average Precision measures. Experimental results showed that parallel implementation led to an average speed up of 6.305×over the serial implementation when running on a NVIDIA GPU GeForce 610M. The average Precision and the average Recall of presented method are 53.84% and 55.00% respectively.
GPU-based normalized cuts for road extraction using satellite imagery
Indian Academy of Sciences (India)
J Senthilnath; S Sindhu; S N Omkar
2014-12-01
This paper presents a GPU implementation of normalized cuts for road extraction problem using panchromatic satellite imagery. The roads have been extracted in three stages namely pre-processing, image segmentation and post-processing. Initially, the image is pre-processed to improve the tolerance by reducing the clutter (that mostly represents the buildings, vegetation, and fallow regions). The road regions are then extracted using the normalized cuts algorithm. Normalized cuts algorithm is a graph-based partitioning approach whose focus lies in extracting the global impression (perceptual grouping) of an image rather than local features. For the segmented image, post-processing is carried out using morphological operations – erosion and dilation. Finally, the road extracted image is overlaid on the original image. Here, a GPGPU (General Purpose Graphical Processing Unit) approach has been adopted to implement the same algorithm on the GPU for fast processing. A performance comparison of this proposed GPU implementation of normalized cuts algorithm with the earlier algorithm (CPU implementation) is presented. From the results, we conclude that the computational improvement in terms of time as the size of image increases for the proposed GPU implementation of normalized cuts. Also, a qualitative and quantitative assessment of the segmentation results has been projected.
GPU-accelerated raster map reprojection
Directory of Open Access Journals (Sweden)
Petr Sloup
2016-07-01
Full Text Available Reprojecting raster maps from one projection to another is an essential part of many cartographic processes (map comparison, overlays, data presentation, ... and reducing the required computational time is desirable and often significantly decreases overall processing costs.The raster reprojection process operates per-pixel and is, therefore, a good candidate for GPU-based parallelization where the large number of processors can lead to a very high degree of parallelism.We have created an experimental implementation of the raster reprojection with GPU-based parallelization (using OpenCL API.During the evaluation, we compared the performance of our implementation to the optimized GDAL and showed that there is a class of problems where GPU-based parallelization can lead to more than sevenfold speedup.
ITS Cluster Finding Algorithm on GPU
Changaival, Boonyarit
2014-01-01
ITS cluster finding algorithm is one of the data reduction algorithms at ALICE. It needs to be processed fast due to a high amount of data readout from the detector. A variety of platforms were studied for the system design. My work is to design, implement and benchmark this algorithm on a GPU platform. GPU is one of many platform that promote parallel computing. A high-end GPU can contain over 2000 processing cores comparing to the commodity CPUs which have only four cores. The program is written in C and CUDA library. The throughput (Number of events per second) is used as a metric to measure the performance. With the latest implementation, the throughput was increased by a factor of 5.
Using GPU shaders for visualization, part 3.
Bailey, Mike
2013-01-01
GPU shaders aren't just for glossy special effects. Parts 1 and 2 of this discussion looked at using them for point clouds, cutting planes, line integral convolution, and terrain bump-mapping. Part 3 covers compute shaders and shader storage buffer objects-two features announced as part of OpenGL 4.3.
基于GPU的CUDA应用开发环境构架%CUDA Application Development Framework Structure based on GPU
Institute of Scientific and Technical Information of China (English)
邓力; 陈晓翔; 林嘉宇
2013-01-01
With its rapid development, the powerful computing power of the GPU ( graphics processing unit) makes it to be used for the non - graphics of the calculation increasingly from the first using only for the graphics of the calculation. In CPU - GPU heterogeneous system, CPU is responsible for the calculation of complex logic operation and affairs management which is not suitable for parallel processing of data, GPU is responsible for the large - scale data calculation of intensity calculation and simple logic branch which is suitable for parallel processing. The continuous improvement of CPU - GPU system makes the large - scale scientific computing to be an inevitable trend by means of speed up of GPU. Focusing on the application development of GPU, the paper introduces the structure of VS2008 + CUDA development platform in WINDOWS environment, and compares the scientific computing performance of GPU and CPU in this structure.%随着GPU(graphics processing unit,图像处理单元)的快速发展,其强大的计算能力使得GPU由最初仅用于加速图形计算,越来越多地应用到非图形领域的计算.在CPU-GPU体系中,CPU负责进行复杂的逻辑运算和事务管理等不适合并行处理的数据计算,GPU负责进行计算密集度高、逻辑分支简单的适合并行处理的大规模数据计算.CPU-GPU体系的不断完善,使得利用GPU来加速大规模科学计算成为了一种必然趋势.着眼GPU的应用开发,介绍在windows环境下CUDA+ VS2008开发平台的构架,并对该构架下GPU与CPU的科学计算性能进行比对.
RESEARCH ON SOLVING TRAVELLING SALESMAN PROBLEM USING RANK BASED ANT SYSTEM ON GPU
Directory of Open Access Journals (Sweden)
Khushbu Khatri
2015-10-01
Full Text Available Ant Colony Optimization (ACO is meta-heuristic algorithm inspired from nature to solve many combinatorial optimization problems such as Travelling Salesman Problem (TSP. There are many versions of ACO used to solve TSP like, Ant System, Elitist Ant System, Max-Min Ant System, Rank based Ant System algorithm. For improved performance, these methods can be implemented in parallel architecture like GPU, CUDA architecture. Graphics Processing Unit (GPU provides highly parallel and fully programmable platform. GPUs which have many processing units with an off-chip global memory can be used for general purpose parallel computation. This paper presents a parallel Rank Based Ant System algorithm to solve TSP by use of Pre Roulette Wheel Selection Method.
Real-time, fast radio transient searches with GPU de-dispersion
Magro, A.; Karastergiou, A.; Salvini, S.; Mort, B.; Dulwich, F.; Zarb Adami, K.
2011-11-01
The identification and subsequent discovery of fast radio transients using blind-search surveys require a large amount of processing power, in worst cases scaling as ?. For this reason, survey data are generally processed off-line, using high-performance computing architectures or hardware-based designs. In recent years, graphics processing units (GPUs) have been extensively used for numerical analysis and scientific simulations, especially after the introduction of new high-level application programming interfaces. Here, we show how GPUs can be used for fast transient discovery in real time. We present a solution to the problem of de-dispersion, providing performance comparisons with a typical computing machine and traditional pulsar processing software. We describe the architecture of a real-time, GPU-based transient search machine. In terms of performance, our GPU solution provides a speed-up factor of between 50 and 200, depending on the parameters of the search.
Real-time, fast radio transient searches with GPU de-dispersion
Magro, Alessio; Salvini, Stefano; Mort, Benjamin; Dulwich, Fred; Adami, Kristian Zarb
2011-01-01
The identification, and subsequent discovery, of fast radio transients through blind-search surveys requires a large amount of processing power, in worst cases scaling as $\\mathcal{O}(N^3)$. For this reason, survey data are generally processed offline, using high-performance computing architectures or hardware-based designs. In recent years, graphics processing units have been extensively used for numerical analysis and scientific simulations, especially after the introduction of new high-level application programming interfaces. Here we show how GPUs can be used for fast transient discovery in real-time. We present a solution to the problem of de-dispersion, providing performance comparisons with a typical computing machine and traditional pulsar processing software. We describe the architecture of a real-time, GPU-based transient search machine. In terms of performance, our GPU solution provides a speed-up factor of between 50 and 200, depending on the parameters of the search.
Data Assimilation using a GPU Accelerated Path Integral Monte Carlo Approach
Quinn, John C
2011-01-01
The answers to data assimilation questions can be expressed as path integrals over all possible state and parameter histories. We show how these path integrals can be evaluated numerically using a Markov Chain Monte Carlo method designed to run in parallel on a Graphics Processing Unit (GPU). We demonstrate the application of the method to an example with a transmembrane voltage time series of a simulated neuron as an input, and using a Hodgkin-Huxley neuron model. By taking advantage of GPU computing, we gain a parallel speedup factor of up to about 200 times faster than an equivalent serial computation on a CPU, with performance increasing as the length of the observation time used for data assimilation increases.
An Efficient Parallel Algorithm for Graph Isomorphism on GPU using CUDA
Directory of Open Access Journals (Sweden)
Min-Young Son
2015-10-01
Full Text Available Modern Graphics Processing Units (GPUs have high computation power and low cost. Recently, many applications in various fields have been computed powerfully on the GPU using CUDA. In this paper, we propose an efficient parallel algorithm for graph isomorphism which runs on the GPU using CUDA for matching large graphs. Parallelization of a sequential graph isomorphism algorithm is one of the hardest problems because it includes inherently sequential characteristics. Our approach divides the given graphs into smaller blocks using a divide-and-conquer, and then maps the blocks to parallel processing units on the GPU. The smaller blocks are solved in individual processing units, and then the results are combined using hierarchical procedures. In the experiment, we used random graphs from vertices of small size to up to tens of thousands of vertices in order to solve efficiently graph isomorphism for large graphs. The experimental results show that the proposed approach brings a considerable improvement in performance and efficiency comparing to the CPU-based results. Our result also shows high performance, especially on large graphs.
Multi-GPU Accelerated Multi-Spin Monte Carlo Simulations of the 2D Ising Model
Block, Benjamin; Preis, Tobias; 10.1016/j.cpc.2010.05.005
2010-01-01
A modern graphics processing unit (GPU) is able to perform massively parallel scientific computations at low cost. We extend our implementation of the checkerboard algorithm for the two dimensional Ising model [T. Preis et al., J. Comp. Phys. 228, 4468 (2009)] in order to overcome the memory limitations of a single GPU which enables us to simulate significantly larger systems. Using multi-spin coding techniques, we are able to accelerate simulations on a single GPU by factors up to 35 compared to an optimized single Central Processor Unit (CPU) core implementation which employs multi-spin coding. By combining the Compute Unified Device Architecture (CUDA) with the Message Parsing Interface (MPI) on the CPU level, a single Ising lattice can be updated by a cluster of GPUs in parallel. For large systems, the computation time scales nearly linearly with the number of GPUs used. As proof of concept we reproduce the critical temperature of the 2D Ising model using finite size scaling techniques.
CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis
Directory of Open Access Journals (Sweden)
Federico Raimondo
2012-01-01
Full Text Available In recent years, Independent Component Analysis (ICA has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation.
CUDAICA: GPU optimization of Infomax-ICA EEG analysis.
Raimondo, Federico; Kamienkowski, Juan E; Sigman, Mariano; Fernandez Slezak, Diego
2012-01-01
In recent years, Independent Component Analysis (ICA) has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card) of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation.
Providing Source Code Level Portability Between CPU and GPU with MapCG
Institute of Scientific and Technical Information of China (English)
Chun-Tao Hong; De-Hao Chen; Yu-Bei Chen; Wen-Guang Chen; Wei-Min Zheng; Hai-Bo Lin
2012-01-01
Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific code with low level GPU APIs such as CUDA.Although this approach can achieve good performance,it creates serious portability issues as programmers are required to write a specific version of the code for each potential target architecture.This results in high development and maintenance costs.We believe it is desirable to have a programming model which provides source code portability between CPUs and GPUs,as well as different GPUs.This would allow programmers to write one version of the code,which can be compiled and executed on either CPUs or GPUs efficiently without modification.In this paper,we propose MapCG,a MapReduce framework to provide source code level portability between CPUs and GPUs.In contrast to other approaches such as OpenCL,our framework,based on MapReduce,provides a high level programming model and makes programming much easier.We describe the design of MapCG,including the MapReduce-style high-level programming framework and the runtime system on the CPU and GPU.A prototype of the MapCG runtime,supporting multi-core CPUs and NVIDIA GPUs,was implemented. Our experimental results show that this implementation can execute the same source code efficiently on multi-core CPU platforms and GPUs,achieving an average speedup of 1.6～2.5x over previous implementations of MapReduce on eight commonly used applications.
CMFD and GPU acceleration on method of characteristics for hexagonal cores
Energy Technology Data Exchange (ETDEWEB)
Han, Yu, E-mail: hanyu1203@gmail.com [School of Nuclear Science and Engineering, Shanghai Jiaotong University, Shanghai 200240 (China); Jiang, Xiaofeng [Shanghai NuStar Nuclear Power Technology Co., Ltd., No. 81 South Qinzhou Road, XuJiaHui District, Shanghai 200000 (China); Wang, Dezhong [School of Nuclear Science and Engineering, Shanghai Jiaotong University, Shanghai 200240 (China)
2014-12-15
Highlights: • A merged hex-mesh CMFD method solved via tri-diagonal matrix inversion. • Alternative hardware acceleration of using inexpensive GPU. • A hex-core benchmark with solution to confirm two acceleration methods. - Abstract: Coarse Mesh Finite Difference (CMFD) has been widely adopted as an effective way to accelerate the source iteration of transport calculation. However in a core with hexagonal assemblies there are non-hexagonal meshes around the edges of assemblies, causing a problem for CMFD if the CMFD equations are still to be solved via tri-diagonal matrix inversion by simply scanning the whole core meshes in different directions. To solve this problem, we propose an unequal mesh CMFD formulation that combines the non-hexagonal cells on the boundary of neighboring assemblies into non-regular hexagonal cells. We also investigated the alternative hardware acceleration of using graphics processing units (GPU) with graphics card in a personal computer. The tool CUDA is employed, which is a parallel computing platform and programming model invented by the company NVIDIA for harnessing the power of GPU. To investigate and implement these two acceleration methods, a 2-D hexagonal core transport code using the method of characteristics (MOC) is developed. A hexagonal mini-core benchmark problem is established to confirm the accuracy of the MOC code and to assess the effectiveness of CMFD and GPU parallel acceleration. For this benchmark problem, the CMFD acceleration increases the speed 16 times while the GPU acceleration speeds it up 25 times. When used simultaneously, they provide a speed gain of 292 times.
Nakasato, N
2009-01-01
The kd-tree is a fundamental tool in computer science. Among others, an application of the kd-tree search (oct-tree method) to fast evaluation of particle interactions and neighbor search is highly important since computational complexity of these problems are reduced from O(N^2) with a brute force method to O(N log N) with the tree method where N is a number of particles. In this paper, we present a parallel implementation of the tree method running on a graphic processor unit (GPU). We successfully run a simulation of structure formation in the universe very efficiently. On our system, which costs roughly $900, the run with N ~ 2.87x10^6 particles took 5.79 hours and executed 1.2x10^13 force evaluations in total. We obtained the sustained computing speed of 21.8 Gflops and the cost per Gflops of 41.6/Gflops that is two and half times better than the previous record in 2006.
CPU and GPU (Cuda Template Matching Comparison
Directory of Open Access Journals (Sweden)
Evaldas Borcovas
2014-05-01
Full Text Available Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I, NVidia GeForce GT320M CUDAcompliable graphics card (GPU I and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II, NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II.Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have.
Computational unit for non-contact photonic system
Kochetov, Alexander V.; Skrylev, Pavel A.
2005-06-01
Requirements to the unified computational unit for non-contact photonic system have been formulated. Estimation of central processing unit performance and required memory size are calculated. Specialized microcontroller optimal to use as central processing unit has been selected. Memory chip types are determinated for system. The computational unit consists of central processing unit based on selected microcontroller, NVRAM memory, receiving circuit, SDRAM memory, control and power circuits. It functions, as performing unit that calculates required parameters ofrail track.
Massively Parallel Latent Semantic Analyzes using a Graphics Processing Unit
Energy Technology Data Exchange (ETDEWEB)
Cavanagh, Joseph M [ORNL; Cui, Xiaohui [ORNL
2009-01-01
Latent Semantic Indexing (LSA) aims to reduce the dimensions of large Term-Document datasets using Singular Value Decomposition. However, with the ever expanding size of data sets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. The Graphics Processing Unit (GPU) can solve some highly parallel problems much faster than the traditional sequential processor (CPU). Thus, a deployable system using a GPU to speedup large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a computer cluster. Due to the GPU s application-specific architecture, harnessing the GPU s computational prowess for LSA is a great challenge. We present a parallel LSA implementation on the GPU, using NVIDIA Compute Unified Device Architecture and Compute Unified Basic Linear Algebra Subprograms. The performance of this implementation is compared to traditional LSA implementation on CPU using an optimized Basic Linear Algebra Subprograms library. After implementation, we discovered that the GPU version of the algorithm was twice as fast for large matrices (1000x1000 and above) that had dimensions not divisible by 16. For large matrices that did have dimensions divisible by 16, the GPU algorithm ran five to six times faster than the CPU version. The large variation is due to architectural benefits the GPU has for matrices divisible by 16. It should be noted that the overall speeds for the CPU version did not vary from relative normal when the matrix dimensions were divisible by 16. Further research is needed in order to produce a fully implementable version of LSA. With that in mind, the research we presented shows that the GPU is a viable option for increasing the speed of LSA, in terms of cost/performance ratio.
Accelerating sparse linear algebra using graphics processing units
Spagnoli, Kyle E.; Humphrey, John R.; Price, Daniel K.; Kelmelis, Eric J.
2011-06-01
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math processor capable of over 1 TFLOPS of peak computational throughput at a cost similar to a high-end CPU with excellent FLOPS-to-watt ratio. High-level sparse linear algebra operations are computationally intense, often requiring large amounts of parallel operations and would seem a natural fit for the processing power of the GPU. Our work is on a GPU accelerated implementation of sparse linear algebra routines. We present results from both direct and iterative sparse system solvers. The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is accomplished by allowing each platform to do the work it performs most naturally. For example, the CPU is responsible for graph theory portion of the direct solvers while the GPU simultaneously performs the low level linear algebra routines.
Grace: a Cross-platform Micromagnetic Simulator On Graphics Processing Units
Zhu, Ru
2014-01-01
A micromagnetic simulator running on graphics processing unit (GPU) is presented. It achieves significant performance boost as compared to previous central processing unit (CPU) simulators, up to two orders of magnitude for large input problems. Different from GPU implementations of other research groups, this simulator is developed with C++ Accelerated Massive Parallelism (C++ AMP) and is hardware platform compatible. It runs on GPU from venders include NVidia, AMD and Intel, which paved the way for fast micromagnetic simulation on both high-end workstations with dedicated graphics cards and low-end personal computers with integrated graphics card. A copy of the simulator software is publicly available.
Wu, Q.; Xiong, F.; Wang, F.; Xiong, Y.
2016-10-01
In order to reduce the computational time, a fully parallel implementation of the particle swarm optimization (PSO) algorithm on a graphics processing unit (GPU) is presented. Instead of being executed on the central processing unit (CPU) sequentially, PSO is executed in parallel via the GPU on the compute unified device architecture (CUDA) platform. The processes of fitness evaluation, updating of velocity and position of all particles are all parallelized and introduced in detail. Comparative studies on the optimization of four benchmark functions and a trajectory optimization problem are conducted by running PSO on the GPU (GPU-PSO) and CPU (CPU-PSO). The impact of design dimension, number of particles and size of the thread-block in the GPU and their interactions on the computational time is investigated. The results show that the computational time of the developed GPU-PSO is much shorter than that of CPU-PSO, with comparable accuracy, which demonstrates the remarkable speed-up capability of GPU-PSO.
AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics
Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.
2017-05-01
We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.
Portillo, Ricardo; Arunagiri, Sarala; Teller, Patricia J.; Park, Song J.; Nguyen, Lam H.; Deroba, Joseph C.; Shires, Dale
2011-06-01
The continuing miniaturization and parallelization of computer hardware has facilitated the development of mobile and field-deployable systems that can accommodate terascale processing within once prohibitively small size and weight constraints. General-purpose Graphics Processing Units (GPUs) are prominent examples of such terascale devices. Unfortunately, the added computational capability of these devices often comes at the cost of larger demands on power, an already strained resource in these systems. This study explores power versus performance issues for a workload that can take advantage of GPU capability and is targeted to run in field-deployable environments, i.e., Synthetic Aperture Radar (SAR). Specifically, we focus on the Image Formation (IF) computational phase of SAR, often the most compute intensive, and evaluate two different state-of-the-art GPU implementations of this IF method. Using real and simulated data sets, we evaluate performance tradeoffs for single- and double-precision versions of these implementations in terms of time-to-solution, image output quality, and total energy consumption. We employ fine-grain direct-measurement techniques to capture isolated power utilization and energy consumption of the GPU device, and use general and radarspecific metrics to evaluate image output quality. We show that double-precision IF can provide slight image improvement to low-reflective areas of SAR images, but note that the added quality may not be worth the higher power and energy costs associated with higher precision operations.
Development of GPU-Optimized EFIT for DIII-D Equilibrium Reconstructions
Huang, Y.; Lao, L. L.; Xiao, B. J.; Luo, Z. P.; Yue, X. N.
2015-11-01
The development of a parallel, Graphical Processing Unit (GPU)-optimized version of EFIT for DIII-D equilibrium reconstructions is presented. This GPU-optimized version (P-EFIT) is built with the CUDA (Compute Unified Device Architecture) platform to take advantage of massively parallel GPU cores to significantly accelerate the computation under the EFIT framework. The parallel processing is implemented with the Single-Instruction Multiple-Thread (SIMT) architecture. New parallel modules to trace plasma surfaces and compute plasma parameters have been constructed. DIII-D magnetic benchmark tests show that P-EFIT could accurately reproduce the EFIT reconstruction algorithms at a fraction of the computational cost. The acceleration factor continues to increase as the (R, Z) spatial grids are increased from 65 × 65 to 257 × 257 , suggesting there may be rooms for further optimization by further reducing the communication cost. Details of the P-EFIT optimization algorithms will be discussed. Work supported by US DOE DE-FC02-04ER54698, and by China MOST under 2014GB103000, China NNSF 11205191, China CAS GJHZ201303.
Graphics processing unit-based quantitative second-harmonic generation imaging.
Kabir, Mohammad Mahfuzul; Jonayat, A S M; Patel, Sanjay; Toussaint, Kimani C
2014-09-01
We adapt a graphics processing unit (GPU) to dynamic quantitative second-harmonic generation imaging. We demonstrate the temporal advantage of the GPU-based approach by computing the number of frames analyzed per second from SHG image videos showing varying fiber orientations. In comparison to our previously reported CPU-based approach, our GPU-based image analysis results in ∼10× improvement in computational time. This work can be adapted to other quantitative, nonlinear imaging techniques and provides a significant step toward obtaining quantitative information from fast in vivo biological processes.
Charara, Ali
2014-11-01
The European Extremely Large Telescope project (E-ELT) is one of Europe\\'s highest priorities in ground-based astronomy. ELTs are built on top of a variety of highly sensitive and critical astronomical instruments. In particular, a new instrument called MOSAIC has been proposed to perform multi-object spectroscopy using the Multi-Object Adaptive Optics (MOAO) technique. The core implementation of the simulation lies in the intensive computation of a tomographic reconstruct or (TR), which is used to drive the deformable mirror in real time from the measurements. A new numerical algorithm is proposed (1) to capture the actual experimental noise and (2) to substantially speed up previous implementations by exposing more concurrency, while reducing the number of floating-point operations. Based on the Matrices Over Runtime System at Exascale numerical library (MORSE), a dynamic scheduler drives all computational stages of the tomographic reconstruct or simulation and allows to pipeline and to run tasks out-of order across different stages on heterogeneous systems, while ensuring data coherency and dependencies. The proposed TR simulation outperforms asymptotically previous state-of-the-art implementations up to 13-fold speedup. At more than 50000 unknowns, this appears to be the largest-scale AO problem submitted to computation, to date, and opens new research directions for extreme scale AO simulations. © 2014 IEEE.
GPU-based Ray Tracing of Dynamic Scenes
Directory of Open Access Journals (Sweden)
Christopher Lux
2010-08-01
Full Text Available Interactive ray tracing of non-trivial scenes is just becoming feasible on single graphics processing units (GPU. Recent work in this area focuses on building effective acceleration structures, which work well under the constraints of current GPUs. Most approaches are targeted at static scenes and only allow navigation in the virtual scene. So far support for dynamic scenes has not been considered for GPU implementations. We have developed a GPU-based ray tracing system for dynamic scenes consisting of a set of individual objects. Each object may independently move around, but its geometry and topology are static.
GPU Linear algebra extensions for GNU/Octave
Bosi, L. B.; Mariotti, M.; Santocchia, A.
2012-06-01
Octave is one of the most widely used open source tools for numerical analysis and liner algebra. Our project aims to improve Octave by introducing support for GPU computing in order to speed up some linear algebra operations. The core of our work is a C library that executes some BLAS operations concerning vector- vector, vector matrix and matrix-matrix functions on the GPU. OpenCL functions are used to program GPU kernels, which are bound within the GNU/octave framework. We report the project implementation design and some preliminary results about performance.
Zhmurov, A; Dima, R I; Kholodov, Y; Barsegov, V
2010-11-01
Theoretical exploration of fundamental biological processes involving the forced unraveling of multimeric proteins, the sliding motion in protein fibers and the mechanical deformation of biomolecular assemblies under physiological force loads is challenging even for distributed computing systems. Using a C(α)-based coarse-grained self organized polymer (SOP) model, we implemented the Langevin simulations of proteins on graphics processing units (SOP-GPU program). We assessed the computational performance of an end-to-end application of the program, where all the steps of the algorithm are running on a GPU, by profiling the simulation time and memory usage for a number of test systems. The ∼90-fold computational speedup on a GPU, compared with an optimized central processing unit program, enabled us to follow the dynamics in the centisecond timescale, and to obtain the force-extension profiles using experimental pulling speeds (v(f) = 1-10 μm/s) employed in atomic force microscopy and in optical tweezers-based dynamic force spectroscopy. We found that the mechanical molecular response critically depends on the conditions of force application and that the kinetics and pathways for unfolding change drastically even upon a modest 10-fold increase in v(f). This implies that, to resolve accurately the free energy landscape and to relate the results of single-molecule experiments in vitro and in silico, molecular simulations should be carried out under the experimentally relevant force loads. This can be accomplished in reasonable wall-clock time for biomolecules of size as large as 10(5) residues using the SOP-GPU package.
Radial basis function networks GPU-based implementation.
Brandstetter, Andreas; Artusi, Alessandro
2008-12-01
Neural networks (NNs) have been used in several areas, showing their potential but also their limitations. One of the main limitations is the long time required for the training process; this is not useful in the case of a fast training process being required to respond to changes in the application domain. A possible way to accelerate the learning process of an NN is to implement it in hardware, but due to the high cost and the reduced flexibility of the original central processing unit (CPU) implementation, this solution is often not chosen. Recently, the power of the graphic processing unit (GPU), on the market, has increased and it has started to be used in many applications. In particular, a kind of NN named radial basis function network (RBFN) has been used extensively, proving its power. However, their limiting time performances reduce their application in many areas. In this brief paper, we describe a GPU implementation of the entire learning process of an RBFN showing the ability to reduce the computational cost by about two orders of magnitude with respect to its CPU implementation.
True 4D Image Denoising on the GPU
Eklund, Anders; Andersson, Mats; Knutsson, Hans
2011-01-01
The use of image denoising techniques is an important part of many medical imaging applications. One common application is to improve the image quality of low-dose (noisy) computed tomography (CT) data. While 3D image denoising previously has been applied to several volumes independently, there has not been much work done on true 4D image denoising, where the algorithm considers several volumes at the same time. The problem with 4D image denoising, compared to 2D and 3D denoising, is that the computational complexity increases exponentially. In this paper we describe a novel algorithm for true 4D image denoising, based on local adaptive filtering, and how to implement it on the graphics processing unit (GPU). The algorithm was applied to a 4D CT heart dataset of the resolution 512 × 512 × 445 × 20. The result is that the GPU can complete the denoising in about 25 minutes if spatial filtering is used and in about 8 minutes if FFT-based filtering is used. The CPU implementation requires several days of processing time for spatial filtering and about 50 minutes for FFT-based filtering. The short processing time increases the clinical value of true 4D image denoising significantly. PMID:21977020
GPU accelerated numerical simulations of viscoelastic phase separation model.
Yang, Keda; Su, Jiaye; Guo, Hongxia
2012-07-05
We introduce a complete implementation of viscoelastic model for numerical simulations of the phase separation kinetics in dynamic asymmetry systems such as polymer blends and polymer solutions on a graphics processing unit (GPU) by CUDA language and discuss algorithms and optimizations in details. From studies of a polymer solution, we show that the GPU-based implementation can predict correctly the accepted results and provide about 190 times speedup over a single central processing unit (CPU). Further accuracy analysis demonstrates that both the single and the double precision calculations on the GPU are sufficient to produce high-quality results in numerical simulations of viscoelastic model. Therefore, the GPU-based viscoelastic model is very promising for studying many phase separation processes of experimental and theoretical interests that often take place on the large length and time scales and are not easily addressed by a conventional implementation running on a single CPU.
Li, Pengcheng; Liu, Celong; Li, Xianpeng; He, Honghui; Ma, Hui
2016-09-20
In earlier studies, we developed scattering models and the corresponding CPU-based Monte Carlo simulation programs to study the behavior of polarized photons as they propagate through complex biological tissues. Studying the simulation results in high degrees of freedom that created a demand for massive simulation tasks. In this paper, we report a parallel implementation of the simulation program based on the compute unified device architecture running on a graphics processing unit (GPU). Different schemes for sphere-only simulations and sphere-cylinder mixture simulations were developed. Diverse optimizing methods were employed to achieve the best acceleration. The final-version GPU program is hundreds of times faster than the CPU version. Dependence of the performance on input parameters and precision were also studied. It is shown that using single precision in the GPU simulations results in very limited losses in accuracy. Consumer-level graphics cards, even those in laptop computers, are more cost-effective than scientific graphics cards for single-precision computation.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Directory of Open Access Journals (Sweden)
Jinwei Wang
2014-01-01
Full Text Available The active appearance model (AAM is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA on the Nvidia’s GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Local alignment tool based on Hadoop framework and GPU architecture.
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
Local Alignment Tool Based on Hadoop Framework and GPU Architecture
Directory of Open Access Journals (Sweden)
Che-Lun Hung
2014-01-01
Full Text Available With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
Parallelization and checkpointing of GPU applications through program transformation
Energy Technology Data Exchange (ETDEWEB)
Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
Araki, Hiromitsu; Takada, Naoki; Niwase, Hiroaki; Ikawa, Shohei; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi
2015-12-01
We propose real-time time-division color electroholography using a single graphics processing unit (GPU) and a simple synchronization system of reference light. To facilitate real-time time-division color electroholography, we developed a light emitting diode (LED) controller with a universal serial bus (USB) module and the drive circuit for reference light. A one-chip RGB LED connected to a personal computer via an LED controller was used as the reference light. A single GPU calculates three computer-generated holograms (CGHs) suitable for red, green, and blue colors in each frame of a three-dimensional (3D) movie. After CGH calculation using a single GPU, the CPU can synchronize the CGH display with the color switching of the one-chip RGB LED via the LED controller. Consequently, we succeeded in real-time time-division color electroholography for a 3D object consisting of around 1000 points per color when an NVIDIA GeForce GTX TITAN was used as the GPU. Furthermore, we implemented the proposed method in various GPUs. The experimental results showed that the proposed method was effective for various GPUs.
Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S
2014-03-11
Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.
求解矩阵特征值的GPU实现%GPU Implementation for Solving Eigenvalues of a Matrix
Institute of Scientific and Technical Information of China (English)
夏健明; 魏德敏
2008-01-01
A GPU (graphics processing unit) implementation was presented for solving eigenvalues of a matrix. The power method and the QR method based on the GPU are used to solve the largest eigenvalue and all eigenvalues of a given matrix. The computations are compared with those by the CPU, and it is found that the computation accuracy is good, and the running time on the GPU is faster than that on the CPU by a factor of 2.7～7.6.%提出了求解矩阵特征值的GPU(图形处理器)实现方法,分别用基于GPU的幂法和QR法求解矩阵的最大特征值和所有特征值.基于GPU的计算与基于CPU的计算相比较,证实其计算精度较好,运算时间比基于CPU的运算时间快2.7～7.6倍.
Chi, Yujie; Tian, Zhen; Jia, Xun
2016-08-01
Monte Carlo (MC) particle transport simulation on a graphics-processing unit (GPU) platform has been extensively studied recently due to the efficiency advantage achieved via massive parallelization. Almost all of the existing GPU-based MC packages were developed for voxelized geometry. This limited application scope of these packages. The purpose of this paper is to develop a module to model parametric geometry and integrate it in GPU-based MC simulations. In our module, each continuous region was defined by its bounding surfaces that were parameterized by quadratic functions. Particle navigation functions in this geometry were developed. The module was incorporated to two previously developed GPU-based MC packages and was tested in two example problems: (1) low energy photon transport simulation in a brachytherapy case with a shielded cylinder applicator and (2) MeV coupled photon/electron transport simulation in a phantom containing several inserts of different shapes. In both cases, the calculated dose distributions agreed well with those calculated in the corresponding voxelized geometry. The averaged dose differences were 1.03% and 0.29%, respectively. We also used the developed package to perform simulations of a Varian VS 2000 brachytherapy source and generated a phase-space file. The computation time under the parameterized geometry depended on the memory location storing the geometry data. When the data was stored in GPU’s shared memory, the highest computational speed was achieved. Incorporation of parameterized geometry yielded a computation time that was ~3 times of that in the corresponding voxelized geometry. We also developed a strategy to use an auxiliary index array to reduce frequency of geometry calculations and hence improve efficiency. With this strategy, the computational time ranged in 1.75-2.03 times of the voxelized geometry for coupled photon/electron transport depending on the voxel dimension of the auxiliary index array, and in 0
GPU-based ultra fast IMRT plan optimization
Men, Chunhua; Choi, Dongju; Majumdar, Amitava; Zheng, Ziyi; Mueller, Klaus; Jiang, Steve B
2009-01-01
The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real-time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs). We adopt a penalty-based quadratic optimization model, which is solved by using a gradient projection method with Armijo's line search rule. Our optimization algorithm has been implemented in CUDA for parallel GPU computing as well as in C for serial CPU computing for comparison purpose. A prostate IMRT case with various beamlet and voxel sizes was used to evalu...
Explicit Integration with GPU Acceleration for Large Kinetic Networks
Brock, Benjamin; Billings, Jay Jay; Guidry, Mike
2014-01-01
We demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. This orders-of-magnitude decrease in compute time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractible, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
GPU-accelerated elastic 3D image registration for intra-surgical applications.
Ruijters, Daniel; ter Haar Romeny, Bart M; Suetens, Paul
2011-08-01
Local motion within intra-patient biomedical images can be compensated by using elastic image registration. The application of B-spline based elastic registration during interventional treatment is seriously hampered by its considerable computation time. The graphics processing unit (GPU) can be used to accelerate the calculation of such elastic registrations by using its parallel processing power, and by employing the hardwired tri-linear interpolation capabilities in order to efficiently perform the cubic B-spline evaluation. In this article it is shown that the similarity measure and its derivatives also can be calculated on the GPU, using a two pass approach. On average a speedup factor 50 compared to a straight-forward CPU implementation was reached.
A simple GPU-accelerated two-dimensional MUSCL-Hancock solver for ideal magnetohydrodynamics
Bard, Christopher M.; Dorelli, John C.
2014-02-01
We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of ≈126 for a 10242 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
Rank k Cholesky Up/Down-dating on the GPU: gpucholmodV0.2
Walder, Christian
2010-01-01
In this note we briefly describe our Cholesky modification algorithm for streaming multiprocessor architectures. Our implementation is available in C++ with Matlab binding, using CUDA to utilise the graphics processing unit (GPU). Limited speed ups are possible due to the bandwidth bound nature of the problem. Furthermore, a complex dependency pattern must be obeyed, requiring multiple kernels to be launched. Nonetheless, this makes for an interesting problem, and our approach can reduce the computation time by a factor of around 7 for matrices of size 5000 by 5000 and k=16, in comparison with the LINPACK suite running on a CPU of comparable vintage. Much larger problems can be handled however due to the O(n) scaling in required GPU memory of our method.
GPU-based acceleration of an automatic white matter segmentation algorithm using CUDA.
Labra, Nicole; Figueroa, Miguel; Guevara, Pamela; Duclap, Delphine; Hoeunou, Josselin; Poupon, Cyril; Mangin, Jean-Francois
2013-01-01
This paper presents a parallel implementation of an algorithm for automatic segmentation of white matter fibers from tractography data. We execute the algorithm in parallel using a high-end video card with a Graphics Processing Unit (GPU) as a computation accelerator, using the CUDA language. By exploiting the parallelism and the properties of the memory hierarchy available on the GPU, we obtain a speedup in execution time of 33.6 with respect to an optimized sequential version of the algorithm written in C, and of 240 with respect to the original Python/C++ implementation. The execution time is reduced from more than two hours to only 35 seconds for a subject dataset of 800,000 fibers, thus enabling applications that use interactive segmentation and visualization of small to medium-sized tractography datasets.
A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics
Bard, Christopher; Dorelli, John C.
2013-01-01
We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
Single-Pass GPU-Raycasting for Structured Adaptive Mesh Refinement Data
Kaehler, Ralf
2012-01-01
Structured Adaptive Mesh Refinement (SAMR) is a popular numerical technique to study processes with high spatial and temporal dynamic range. It reduces computational requirements by adapting the lattice on which the underlying differential equations are solved to most efficiently represent the solution. Particularly in astrophysics and cosmology such simulations now can capture spatial scales ten orders of magnitude apart and more. The irregular locations and extensions of the refined regions in the SAMR scheme and the fact that different resolution levels partially overlap, poses a challenge for GPU-based direct volume rendering methods. kD-trees have proven to be advantageous to subdivide the data domain into non-overlapping blocks of equally sized cells, optimal for the texture units of current graphics hardware, but previous GPU-supported raycasting approaches for SAMR data using this data structure required a separate rendering pass for each node, preventing the application of many advanced lighting sche...
Graphics processing units accelerated semiclassical initial value representation molecular dynamics
Energy Technology Data Exchange (ETDEWEB)
Tamascelli, Dario; Dambrosio, Francesco Saverio [Dipartimento di Fisica, Università degli Studi di Milano, via Celoria 16, 20133 Milano (Italy); Conte, Riccardo [Department of Chemistry and Cherry L. Emerson Center for Scientific Computation, Emory University, Atlanta, Georgia 30322 (United States); Ceotto, Michele, E-mail: michele.ceotto@unimi.it [Dipartimento di Chimica, Università degli Studi di Milano, via Golgi 19, 20133 Milano (Italy)
2014-05-07
This paper presents a Graphics Processing Units (GPUs) implementation of the Semiclassical Initial Value Representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations. The time-averaging formulation of the SC-IVR for power spectrum calculations is employed. Details about the GPU implementation of the semiclassical code are provided. Four molecules with an increasing number of atoms are considered and the GPU-calculated vibrational frequencies perfectly match the benchmark values. The computational time scaling of two GPUs (NVIDIA Tesla C2075 and Kepler K20), respectively, versus two CPUs (Intel Core i5 and Intel Xeon E5-2687W) and the critical issues related to the GPU implementation are discussed. The resulting reduction in computational time and power consumption is significant and semiclassical GPU calculations are shown to be environment friendly.
GPU-based Parallel Application Design for Emerging Mobile Devices
Gupta, Kshitij
A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as
GPU-based parallel clustered differential pulse code modulation
Wu, Jiaji; Li, Wenze; Kong, Wanqiu
2015-10-01
Hyperspectral remote sensing technology is widely used in marine remote sensing, geological exploration, atmospheric and environmental remote sensing. Owing to the rapid development of hyperspectral remote sensing technology, resolution of hyperspectral image has got a huge boost. Thus data size of hyperspectral image is becoming larger. In order to reduce their saving and transmission cost, lossless compression for hyperspectral image has become an important research topic. In recent years, large numbers of algorithms have been proposed to reduce the redundancy between different spectra. Among of them, the most classical and expansible algorithm is the Clustered Differential Pulse Code Modulation (CDPCM) algorithm. This algorithm contains three parts: first clusters all spectral lines, then trains linear predictors for each band. Secondly, use these predictors to predict pixels, and get the residual image by subtraction between original image and predicted image. Finally, encode the residual image. However, the process of calculating predictors is timecosting. In order to improve the processing speed, we propose a parallel C-DPCM based on CUDA (Compute Unified Device Architecture) with GPU. Recently, general-purpose computing based on GPUs has been greatly developed. The capacity of GPU improves rapidly by increasing the number of processing units and storage control units. CUDA is a parallel computing platform and programming model created by NVIDIA. It gives developers direct access to the virtual instruction set and memory of the parallel computational elements in GPUs. Our core idea is to achieve the calculation of predictors in parallel. By respectively adopting global memory, shared memory and register memory, we finally get a decent speedup.
Interactive physically-based X-ray simulation: CPU or GPU?
Vidal, Franck P; John, Nigel W; Guillemot, Romain M
2007-01-01
Interventional Radiology (IR) procedures are minimally invasive, targeted treatments performed using imaging for guidance. Needle puncture using ultrasound, x-ray, or computed tomography (CT) images is a core task in the radiology curriculum, and we are currently developing a training simulator for this. One requirement is to include support for physically-based simulation of x-ray images from CT data sets. In this paper, we demonstrate how to exploit the capability of today's graphics cards to efficiently achieve this on the Graphics Processing Unit (GPU) and compare performance with an efficient software only implementation using the Central Processing Unit (CPU).
Nagaoka, Tomoaki; Watanabe, Soichi
2012-01-01
Electromagnetic simulation with anatomically realistic computational human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the computational human model, we adapt three-dimensional FDTD code to a multi-GPU cluster environment with Compute Unified Device Architecture and Message Passing Interface. Our multi-GPU cluster system consists of three nodes. The seven GPU boards (NVIDIA Tesla C2070) are mounted on each node. We examined the performance of the FDTD calculation on multi-GPU cluster environment. We confirmed that the FDTD calculation on the multi-GPU clusters is faster than that on a multi-GPU (a single workstation), and we also found that the GPU cluster system calculate faster than a vector supercomputer. In addition, our GPU cluster system allowed us to perform the large-scale FDTD calculation because were able to use GPU memory of over 100 GB.
GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF
Mielikainen, J.; Huang, B.; Huang, A.
2011-12-01
The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has
Pizette, Patrick; Govender, Nicolin; Wilke, Daniel N.; Abriak, Nor-Edine
2017-06-01
The use of the Discrete Element Method (DEM) for industrial civil engineering industrial applications is currently limited due to the computational demands when large numbers of particles are considered. The graphics processing unit (GPU) with its highly parallelized hardware architecture shows potential to enable solution of civil engineering problems using discrete granular approaches. We demonstrate in this study the pratical utility of a validated GPU-enabled DEM modeling environment to simulate industrial scale granular problems. As illustration, the flow discharge of storage silos using 8 and 17 million particles is considered. DEM simulations have been performed to investigate the influence of particle size (equivalent size for the 20/40-mesh gravel) and induced shear stress for two hopper shapes. The preliminary results indicate that the shape of the hopper significantly influences the discharge rates for the same material. Specifically, this work shows that GPU-enabled DEM modeling environments can model industrial scale problems on a single portable computer within a day for 30 seconds of process time.
An implementation of the direct-forcing immersed boundary method using GPU power
Directory of Open Access Journals (Sweden)
Bulent Tutkun
2017-01-01
Full Text Available A graphics processing unit (GPU is utilized to apply the direct-forcing immersed boundary method. The code running on the GPU is generated with the help of the Compute Unified Device Architecture (CUDA. The first and second spatial derivatives of the incompressible Navier-Stokes equations are discretized by the sixth-order central compact finite-difference schemes. Two flow fields are simulated. The first test case is the simulated flow around a square cylinder, with the results providing good estimations of the wake region mechanics and vortex shedding. The second test case is the simulated flow around a circular cylinder. This case was selected to better understand the effects of sharp corners on the force coefficients. It was observed that the estimation of the force coefficients did not result in any troubles in the case of a circular cylinder. Additionally, the performance values obtained for the calculation time for the solution of the Poisson equation are compared with the values for other CPUs and GPUs from the literature. Consequently, approximately 3× and 20× speedups are achieved in comparison with GPU (using CUSP library and CPU, respectively. CUSP is an open-source library for sparse linear algebra and graph computations on CUDA.
GPU-Based 3D Cone-Beam CT Image Reconstruction for Large Data Volume
Directory of Open Access Journals (Sweden)
Xing Zhao
2009-01-01
Full Text Available Currently, 3D cone-beam CT image reconstruction speed is still a severe limitation for clinical application. The computational power of modern graphics processing units (GPUs has been harnessed to provide impressive acceleration of 3D volume image reconstruction. For extra large data volume exceeding the physical graphic memory of GPU, a straightforward compromise is to divide data volume into blocks. Different from the conventional Octree partition method, a new partition scheme is proposed in this paper. This method divides both projection data and reconstructed image volume into subsets according to geometric symmetries in circular cone-beam projection layout, and a fast reconstruction for large data volume can be implemented by packing the subsets of projection data into the RGBA channels of GPU, performing the reconstruction chunk by chunk and combining the individual results in the end. The method is evaluated by reconstructing 3D images from computer-simulation data and real micro-CT data. Our results indicate that the GPU implementation can maintain original precision and speed up the reconstruction process by 110–120 times for circular cone-beam scan, as compared to traditional CPU implementation.
Large Data Visualization on Distributed Memory Mulit-GPU Clusters
Energy Technology Data Exchange (ETDEWEB)
Childs, Henry R.
2010-03-01
Data sets of immense size are regularly generated on large scale computing resources. Even among more traditional methods for acquisition of volume data, such as MRI and CT scanners, data which is too large to be effectively visualization on standard workstations is now commonplace. One solution to this problem is to employ a 'visualization cluster,' a small to medium scale cluster dedicated to performing visualization and analysis of massive data sets generated on larger scale supercomputers. These clusters are designed to fit a different need than traditional supercomputers, and therefore their design mandates different hardware choices, such as increased memory, and more recently, graphics processing units (GPUs). While there has been much previous work on distributed memory visualization as well as GPU visualization, there is a relative dearth of algorithms which effectively use GPUs at a large scale in a distributed memory environment. In this work, we study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets.
Scalable multi-GPU implementation of the MAGFLOW simulator
Directory of Open Access Journals (Sweden)
Giovanni Gallo
2011-12-01
Full Text Available We have developed a robust and scalable multi-GPU (Graphics Processing Unit version of the cellular-automaton-based MAGFLOW lava simulator. The cellular automaton is partitioned into strips that are assigned to different GPUs, with minimal overlapping. For each GPU, a host thread is launched to manage allocation, deallocation, data transfer and kernel launches; the main host thread coordinates all of the GPUs, to ensure temporal coherence and data integrity. The overlapping borders and maximum temporal step need to be exchanged among the GPUs at the beginning of every evolution of the cellular automaton; data transfers are asynchronous with respect to the computations, to cover the introduced overhead. It is not required to have GPUs of the same speed or capacity; the system runs flawlessly on homogeneous and heterogeneous hardware. The speed-up factor differs from that which is ideal (#GPUs× only for a constant overhead loss of about 4E−2 · T · #GPUs, with T as the total simulation time.
Design and Implementation of Interface Circuit Between CPU and GPU%CPU 与 GPU 之间接口电路的设计与实现
Institute of Scientific and Technical Information of China (English)
石茉莉; 蒋林; 刘有耀
2013-01-01
During constructing the Collaborative Computing between Central Process Unit and Graphic Process Unit or Central Process Unit and other device .Through the Peripheral Component Interconnect connects Graphic Process Unit and Central Process Unit ,it’ s responsiable for doing the parrallel computing . Point at the asyncronous transmission and timing matched in the connection of Peripheral Component Interconnect IP core and the Graphic Process Unit ,based on the standard of the Peripheral Component Interconnect and the Graphic Process Unit chip , use the method of processing asyncronous signals ,this paper design an timing matched interface circuit between the Central Process Unit and Graphic Process Unit ,which aims at the different clock systems and the timing matced of them .The simulation results prove that the interface circute between Central Process Unit and Graphic Process Unit can works at 252 M Hz frequency ,it achieves the circuit demand ,and realize high-speed data transmission between Graphic Process Unit and Central Process Unit .%在构建CPU（Central Process Unit ，CPU）与GPU（Graphic Process Unit）或者CPU与其它设备协同计算的过程中，通过PCI（Peripheral Component Interconnect）总线将GPU等其他设备连接至CPU ，承担并行计算的任务。为了解决PCI接口芯片与GPU芯片之间的异步传输和时序匹配问题，基于 PCI总线规范与GPU 芯片的时序规范，采用跨时钟域信号的处理方法，设计了一个CPU与GPU 之间跨时钟域连接的时序匹配接口电路。通过仿真，验证了该电路的正确性。结果表明，该电路可工作在252 M Hz频率下，能够满足GPU 与CPU 间接口电路对速率和带宽的要求。
Connectivity-Based Segmentation for GPU-Accelerated Mesh Decompression
Institute of Scientific and Technical Information of China (English)
Jie-Yi Zhao; Min Tang; Ruo-Feng Tong
2012-01-01
We present a novel algorithm to partition large 3D meshes for GPU-accelerated decompression.Our formulation focuses on minimizing the replicated vertices between patches,and balancing the numbers of faces of patches for efficient parallel computing.First we generate a topology model of the original mesh and remove vertex positions.Then we assign the centers of patches using geodesic farthest point sampling and cluster the faces according to the geodesic distance to the centers.After the segmentation we swap boundary faces to fix jagged boundaries and store the boundary vertices for whole-mesh preservation.The decompression of each patch runs on a thread of GPU,and we evaluate its performance on various large benchmarks.In practice,the GPU-based decompression algorithm runs more than 48x faster on NVIDIA GeForce GTX 580 GPU compared with that on the CPU using single core.
Acceleration of a QM/MM-QMC simulation using GPU.
Uejima, Yutaka; Terashima, Tomoharu; Maezono, Ryo
2011-07-30
We accelerated an ab initio molecular QMC calculation by using GPGPU. Only the bottle-neck part of the calculation is replaced by CUDA subroutine and performed on GPU. The performance on a (single core CPU + GPU) is compared with that on a (single core CPU with double precision), getting 23.6 (11.0) times faster calculations in single (double) precision treatments on GPU. The energy deviation caused by the single precision treatment was found to be within the accuracy required in the calculation, ∼10(-5) hartree. The accelerated computational nodes mounting GPU are combined to form a hybrid MPI cluster on which we confirmed the performance linearly scales to the number of nodes.
Real-Time Incompressible Fluid Simulation on the GPU
Directory of Open Access Journals (Sweden)
Xiao Nie
2015-01-01
Full Text Available We present a parallel framework for simulating incompressible fluids with predictive-corrective incompressible smoothed particle hydrodynamics (PCISPH on the GPU in real time. To this end, we propose an efficient GPU streaming pipeline to map the entire computational task onto the GPU, fully exploiting the massive computational power of state-of-the-art GPUs. In PCISPH-based simulations, neighbor search is the major performance obstacle because this process is performed several times at each time step. To eliminate this bottleneck, an efficient parallel sorting method for this time-consuming step is introduced. Moreover, we discuss several optimization techniques including using fast on-chip shared memory to avoid global memory bandwidth limitations and thus further improve performance on modern GPU hardware. With our framework, the realism of real-time fluid simulation is significantly improved since our method enforces incompressibility constraint which is typically ignored due to efficiency reason in previous GPU-based SPH methods. The performance results illustrate that our approach can efficiently simulate realistic incompressible fluid in real time and results in a speed-up factor of up to 23 on a high-end NVIDIA GPU in comparison to single-threaded CPU-based implementation.
Institute of Scientific and Technical Information of China (English)
陈庆奎; 王海峰; 那丽春; 霍欢; 郝聚涛; 刘伯成
2012-01-01
从2004年开始,图形处理器GPU的通用计算成为一个新研究热点,此后GPGPU( General-Purpose Graphics Processing Unit)在最近几年中取得长足发展.从介绍GPGPU硬件体系结构的改变和软件技术的发展开始,阐述GPGPU主要应用领域中的研究成果及最新发展.针对各种应用领域中计算数据大规模增加的趋势,出现单个GPU计算节点无法克服的硬件限制问题,为解决该问题出现多GPU计算和GPU集群的解决方案.详细地讨论通用计算GPU集群的研究进展和应用技术,包括GPU集群硬件异构性的问题和软件框架的三个研究趋势,对几种典型的软件框架Glift、Zippy、CUDASA的特性和缺点进行较详细的分析.最后,总结GPU通用计算研究发展中存在的问题和未来的挑战.%The general purpose computation of graphic processing unit became a new research field since 2004. GPGPU has been developing rapidly in recent years at a high speed. Starting from an introduction to the development of the architecture of GPU for general-purpose computation and software technology, the study and development of GPU for general-purpose computation are introduced. Aiming at the large scale data of various application fields, GPU cluster is proposed to overcome the limitation of single GPU. So the development and application technologies of GPGPU cluster are discussed and include the issue of heterogeneous cluster and the trend of software for GPU cluster. Several frameworks for GPU cluster are analyzed in detailed, such as Glift, Zippy, and CUDASA. Finally, the unsolved problems and the new challenge in this subject are proposed.
A real-time autostereoscopic display method based on partial sub-pixel by general GPU processing
Chen, Duo; Sang, Xinzhu; Cai, Yuanfa
2013-08-01
With the progress of 3D technology, the huge computing capacity for the real-time autostereoscopic display is required. Because of complicated sub-pixel allocating, masks providing arranged sub-pixels are fabricated to reduce real-time computation. However, the binary mask has inherent drawbacks. In order to solve these problems, weighted masks are used in displaying based on partial sub-pixel. Nevertheless, the corresponding computations will be tremendously growing and unbearable for CPU. To improve calculating speed, Graphics Processing Unit (GPU) processing with parallel computing ability is adopted. Here the principle of partial sub-pixel is presented, and the texture array of Direct3D 10 is used to increase the number of computable textures. When dealing with a HD display and multi-viewpoints, a low level GPU is still able to permit a fluent real time displaying, while the performance of high level CPU is really not acceptable. Meanwhile, after using texture array, the performance of D3D10 could be double, and sometimes be triple faster than D3D9. There are several distinguishing features for the proposed method, such as the good portability, less overhead and good stability. The GPU display system could also be used for the future Ultra HD autostereoscopic display.
Ha, Sanghyun; You, Donghyun
2015-11-01
Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of both incompressible and compressible Navier-Stokes equations. A semi-implicit ADI finite-volume method for integration of the incompressible and compressible Navier-Stokes equations, which are discretized on a structured arbitrary grid, is parallelized for GPU computations using CUDA (Compute Unified Device Architecture). In the semi-implicit ADI finite-volume method, the nonlinear convection terms and the linear diffusion terms are integrated in time using a combination of an explicit scheme and an ADI scheme. Inversion of multiple tri-diagonal matrices is found to be the major challenge in GPU computations of the present method. Some of the algorithms for solving tri-diagonal matrices on GPUs are evaluated and optimized for GPU-acceleration of the present semi-implicit ADI computations of incompressible and compressible Navier-Stokes equations. Supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning Grant NRF-2014R1A2A1A11049599.
Hou, Zhenlong; Huang, Danian
2017-09-01
In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.
Solving global optimization problems on GPU cluster
Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya
2016-06-01
The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun
2015-09-01
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun
2015-10-07
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
一种适应GPU的混合OLAP查询处理模型%GPU Adaptive Hybrid OLAP Query Processing Model
Institute of Scientific and Technical Information of China (English)
张宇; 张延松; 陈红; 王珊
2016-01-01
The general purpose graphic computing units (GPGPUs) have become the new platform for high performance computing due to their massive parallel computing power, and in recent years more and more high performance database research has placed focus on GPU database development. However, today's GPU database researches commonly inherit ROLAP (relational OLAP) model, and mainly address how to realize relational operators in GPU platform and performance tuning, especially on GPU oriented parallel hash join algorithm. GPUs have higher parallel computing power than CPUs but less logical control and management capacity for complex data structure, therefore they are not adaptive for directly migrating the in-memory database query processing algorithms based on complex data structure and memory management. This paper proposes a GPU vectorized processing oriented hybrid OLAP model, semi-MOLAP, which combines direct array access and array computing of MOLAP with storage efficiency of ROLAP. The pure array oriented GPU semi-MOLAP model simplifies GPU data management, reduces complexity of GPU semi-MOLAP algorithms and improves their code efficiency. Meanwhile, the semi-MOLAP operators are divided into co-computing operators on CPU and GPU platforms to improve utilization of both CPUs and GPUs for higher query processing performance.%通用GPU因其强大的并行计算能力成为新兴的高性能计算平台,并逐渐成为近年来学术界在高性能数据库实现技术领域的研究热点.但当前GPU数据库领域的研究沿袭的是ROLAP(relational OLAP)多维分析模型,研究主要集中在关系操作符在GPU平台上的算法实现和性能优化技术,以哈希连接的GPU并行算法研究为中心.GPU拥有数千个并行计算单元,但其逻辑控制单元较少,相对于CPU具有更强的并行计算能力,但逻辑控制和复杂内存管理能力较弱,因此并不适合需要复杂数据结构和复杂内存管理机制的内存数据库查询处理算
ACCELERATING CALCULATION OF CHOLESKY FACTORISATION OF MATRIX WITH GPU%使用 GPU 加速计算矩阵的 Cholesky 分解
Institute of Scientific and Technical Information of China (English)
沈聪; 高火涛
2016-01-01
A concrete implementation of Cholesky factorisation on graphic processing unit (GPU)for large real symmetric positive definite matrix is described in this article.We analyse the hybrid parallel algorithm presented by Volkov for computing the Cholesky factorisation in de-tail.On that basis,and according to the computational performances of CPU and GPU on our own computers,we present a more reasonable hy-brid three-phase scheduling strategy,which further reduces the idle time of CPU and avoids the occurrence of GPU in idle status.Numerical experiment shows that the new hybrid scheduling algorithm achieves a speedup of more than 5 times compared with the standard MKL algo-rithm when the order of a matrix is larger than 7000,and it also observably outperforms the performance of original Volkov’s hybrid algorithm.%针对大型实对称正定矩阵的 Cholesky 分解问题，给出其在图形处理器（GPU）上的具体实现。详细分析了 Volkov 计算Cholesky 分解的混合并行算法，并在此基础上依据自身计算机的 CPU 以及 GPU 的计算性能，给出一种更为合理的三阶段混合调度方案，进一步减少 CPU 的空闲时间以及避免 GPU 空闲情况的出现。数值实验表明，当矩阵阶数超过7000时，新的混合调度算法相比标准的 MKL 算法获得了超过5倍的加速比，同时对比原 Volkov 混合算法获得了显著的性能提升。
Graphics Processing Units for HEP trigger systems
Ammendola, R.; Bauce, M.; Biagioni, A.; Chiozzi, S.; Cotta Ramusino, A.; Fantechi, R.; Fiorini, M.; Giagu, S.; Gianoli, A.; Lamanna, G.; Lonardo, A.; Messina, A.; Neri, I.; Paolucci, P. S.; Piandani, R.; Pontisso, L.; Rescigno, M.; Simula, F.; Sozzi, M.; Vicini, P.
2016-07-01
General-purpose computing on GPUs (Graphics Processing Units) is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughput, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming ripe. We will discuss the use of online parallel computing on GPU for synchronous low level trigger, focusing on CERN NA62 experiment trigger system. The use of GPU in higher level trigger system is also briefly considered.
Graphics Processing Units for HEP trigger systems
Energy Technology Data Exchange (ETDEWEB)
Ammendola, R. [INFN Sezione di Roma “Tor Vergata”, Via della Ricerca Scientifica 1, 00133 Roma (Italy); Bauce, M. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); Biagioni, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); Chiozzi, S.; Cotta Ramusino, A. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Fantechi, R. [INFN Sezione di Pisa, Largo B. Pontecorvo 3, 56127 Pisa (Italy); CERN, Geneve (Switzerland); Fiorini, M. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Giagu, S. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); Gianoli, A. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Lamanna, G., E-mail: gianluca.lamanna@cern.ch [INFN Sezione di Pisa, Largo B. Pontecorvo 3, 56127 Pisa (Italy); INFN Laboratori Nazionali di Frascati, Via Enrico Fermi 40, 00044 Frascati (Roma) (Italy); Lonardo, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); Messina, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); and others
2016-07-11
General-purpose computing on GPUs (Graphics Processing Units) is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughput, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming ripe. We will discuss the use of online parallel computing on GPU for synchronous low level trigger, focusing on CERN NA62 experiment trigger system. The use of GPU in higher level trigger system is also briefly considered.
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU
Directory of Open Access Journals (Sweden)
Yong Xia
2015-01-01
Full Text Available Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation and the other is the diffusion term of the monodomain model (partial differential equation. Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.
GPU phase-field lattice Boltzmann simulations of growth and motion of a binary alloy dendrite
Takaki, T.; Rojas, R.; Ohno, M.; Shimokawabe, T.; Aoki, T.
2015-06-01
A GPU code has been developed for a phase-field lattice Boltzmann (PFLB) method, which can simulate the dendritic growth with motion of solids in a dilute binary alloy melt. The GPU accelerated PFLB method has been implemented using CUDA C. The equiaxed dendritic growth in a shear flow and settling condition have been simulated by the developed GPU code. It has been confirmed that the PFLB simulations were efficiently accelerated by introducing the GPU computation. The characteristic dendrite morphologies which depend on the melt flow and the motion of the dendrite could also be confirmed by the simulations.
Air pollution modelling using a graphics processing unit with CUDA
Molnar, Ferenc; Meszaros, Robert; Lagzi, Istvan; 10.1016/j.cpc.2009.09.008
2010-01-01
The Graphics Processing Unit (GPU) is a powerful tool for parallel computing. In the past years the performance and capabilities of GPUs have increased, and the Compute Unified Device Architecture (CUDA) - a parallel computing architecture - has been developed by NVIDIA to utilize this performance in general purpose computations. Here we show for the first time a possible application of GPU for environmental studies serving as a basement for decision making strategies. A stochastic Lagrangian particle model has been developed on CUDA to estimate the transport and the transformation of the radionuclides from a single point source during an accidental release. Our results show that parallel implementation achieves typical acceleration values in the order of 80-120 times compared to CPU using a single-threaded implementation on a 2.33 GHz desktop computer. Only very small differences have been found between the results obtained from GPU and CPU simulations, which are comparable with the effect of stochastic tran...
Accelerating Computation of the Unit Commitment Problem (Presentation)
Energy Technology Data Exchange (ETDEWEB)
Hummon, M.; Barrows, C.; Jones, W.
2013-10-01
Production cost models (PCMs) simulate power system operation at hourly (or higher) resolution. While computation times often extend into multiple days, the sequential nature of PCM's makes parallelism difficult. We exploit the persistence of unit commitment decisions to select partition boundaries for simulation horizon decomposition and parallel computation. Partitioned simulations are benchmarked against sequential solutions for optimality and computation time.
Significantly reducing registration time in IGRT using graphics processing units
DEFF Research Database (Denmark)
Noe, Karsten Østergaard; Denis de Senneville, Baudouin; Tanderup, Kari
2008-01-01
Purpose/Objective For online IGRT, rapid image processing is needed. Fast parallel computations using graphics processing units (GPUs) have recently been made more accessible through general purpose programming interfaces. We present a GPU implementation of the Horn and Schunck method...... respiration phases in a free breathing volunteer and 41 anatomical landmark points in each image series. The registration method used is a multi-resolution GPU implementation of the 3D Horn and Schunck algorithm. It is based on the CUDA framework from Nvidia. Results On an Intel Core 2 CPU at 2.4GHz each...... registration took 30 minutes. On an Nvidia Geforce 8800GTX GPU in the same machine this registration took 37 seconds, making the GPU version 48.7 times faster. The nine image series of different respiration phases were registered to the same reference image (full inhale). Accuracy was evaluated on landmark...
Hammitzsch, M.; Spazier, J.; Reißland, S.
2014-12-01
Usually, tsunami early warning and mitigation systems (TWS or TEWS) are based on several software components deployed in a client-server based infrastructure. The vast majority of systems importantly include desktop-based clients with a graphical user interface (GUI) for the operators in early warning centers. However, in times of cloud computing and ubiquitous computing the use of concepts and paradigms, introduced by continuously evolving approaches in information and communications technology (ICT), have to be considered even for early warning systems (EWS). Based on the experiences and the knowledge gained in three research projects - 'German Indonesian Tsunami Early Warning System' (GITEWS), 'Distant Early Warning System' (DEWS), and 'Collaborative, Complex, and Critical Decision-Support in Evolving Crises' (TRIDEC) - new technologies are exploited to implement a cloud-based and web-based prototype to open up new prospects for EWS. This prototype, named 'TRIDEC Cloud', merges several complementary external and in-house cloud-based services into one platform for automated background computation with graphics processing units (GPU), for web-mapping of hazard specific geospatial data, and for serving relevant functionality to handle, share, and communicate threat specific information in a collaborative and distributed environment. The prototype in its current version addresses tsunami early warning and mitigation. The integration of GPU accelerated tsunami simulation computations have been an integral part of this prototype to foster early warning with on-demand tsunami predictions based on actual source parameters. However, the platform is meant for researchers around the world to make use of the cloud-based GPU computation to analyze other types of geohazards and natural hazards and react upon the computed situation picture with a web-based GUI in a web browser at remote sites. The current website is an early alpha version for demonstration purposes to give the
Kostopoulos, Spiros; Sidiropoulos, Konstantinos; Glotsos, Dimitris; Athanasiadis, Emmanouil; Boutsikou, Konstantina; Lavdas, Eleftherios; Oikonomou, Georgia; Fezoulidis, Ioannis V; Vlychou, Marianna; Hantes, Michael; Cavouras, Dionisis
2013-06-01
The aim was to design a pattern-recognition (PR) system for discriminating between normal and pathological knee articular cartilage of the medial femoral (MFC) and tibial condyles (MTC). The data set comprised segmented regions of interest (ROIs) from coronal and sagittal 3-T magnetic resonance images of the MFC and MTC cartilage of young patients, 28 with abnormality-free knee and 16 with pathological findings. The PR system was designed employing the probabilistic neural network classifier, textural features from the segmented ROIs and the leave-one-out evaluation method, while the PR system's precision to "unseen" data was assessed by employing the external cross-validation method. Optimal system design was accomplished on a consumer graphics processing unit (GPU) using Compute Unified Device Architecture parallel programming. PR system design on the GPU required about 3.5 min against 15 h on a CPU-based system. Highest classification accuracies for the MFC and MTC cartilages were 93.2% and 95.5%, and accuracies to "unseen" data were 89% and 86%, respectively. The proposed PR system is housed in a PC, equipped with a consumer GPU, and it may be easily retrained when new verified data are incorporated in its repository and may be of value as a second-opinion tool in a clinical environment.
A GPU Implementation of Local Search Operators for Symmetric Travelling Salesman Problem
Directory of Open Access Journals (Sweden)
Juraj Fosin
2013-06-01
Full Text Available The Travelling Salesman Problem (TSP is one of the most studied combinatorial optimization problem which is significant in many practical applications in transportation problems. The TSP problem is NP-hard problem and requires large computation power to be solved by the exact algorithms. In the past few years, fast development of general-purpose Graphics Processing Units (GPUs has brought huge improvement in decreasing the applications’ execution time. In this paper, we implement 2-opt and 3-opt local search operators for solving the TSP on the GPU using CUDA. The novelty presented in this paper is a new parallel iterated local search approach with 2-opt and 3-opt operators for symmetric TSP, optimized for the execution on GPUs. With our implementation large TSP problems (up to 85,900 cities can be solved using the GPU. We will show that our GPU implementation can be up to 20x faster without losing quality for all TSPlib problems as well as for our CRO TSP problem.
On-the-fly generation and rendering of infinite cities on the GPU
Steinberger, Markus
2014-05-01
In this paper, we present a new approach for shape-grammar-based generation and rendering of huge cities in real-time on the graphics processing unit (GPU). Traditional approaches rely on evaluating a shape grammar and storing the geometry produced as a preprocessing step. During rendering, the pregenerated data is then streamed to the GPU. By interweaving generation and rendering, we overcome the problems and limitations of streaming pregenerated data. Using our methods of visibility pruning and adaptive level of detail, we are able to dynamically generate only the geometry needed to render the current view in real-time directly on the GPU. We also present a robust and efficient way to dynamically update a scene\\'s derivation tree and geometry, enabling us to exploit frame-to-frame coherence. Our combined generation and rendering is significantly faster than all previous work. For detailed scenes, we are capable of generating geometry more rapidly than even just copying pregenerated data from main memory, enabling us to render cities with thousands of buildings at up to 100 frames per second, even with the camera moving at supersonic speed. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
High energy electromagnetic particle transportation on the GPU
Canal, P.; Elvira, D.; Jun, S. Y.; Kowalkowski, J.; Paterno, M.; Apostolakis, J.
2014-06-01
We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.
High energy electromagnetic particle transportation on the GPU
Energy Technology Data Exchange (ETDEWEB)
Canal, P. [Fermilab; Elvira, D. [Fermilab; Jun, S. Y. [Fermilab; Kowalkowski, J. [Fermilab; Paterno, M. [Fermilab; Apostolakis, J. [CERN
2014-01-01
We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.
KGP：一种在操作系统内核使用GPU加速IP查找的方案%KGP:A KERNEL-SPACE FRAMEWORK FOR ACCELERATING IP LOOKUP WITH GPU
Institute of Scientific and Technical Information of China (English)
张昕雅; 赵进; 王新; 陈卫
2014-01-01
作为一种协处理，图形处理器 GPU（Graphics Processing Unit）在计算密集型的任务中得到了越来越广泛的应用。但是，由于图形处理器驱动程序并不在操作系统内核中提供 API，因此当操作系统内核需要利用 GPU 加速其工作时，就必须将计算任务转交基于用户态 API（如 CUDA）的用户态进程执行，这显然会增加完成计算任务所需要的额外开销。KGP（Kernel-space GPU Process-ing）是一种新的针对 IP 路由查找的技术方案，它使得操作系统内核可以直接调用 GPU 完成 IP 查找的计算，以避免将计算任务转交于用户态进程带来的开销。实验结果表明，相比用户态方案，KGP 因其较低的额外开销而拥有更好的 GPU 计算性能，同时能提升Linux 内核进行 IP 路由查找的性能。%The graphics processing units (GPUs)has been increasingly popular in computing-intensive tasks as a kind of co-processors. However,since there is no API provided by GPU drives in OS kernel,the computation tasks of OS kernel are forced to be offloaded to user-space processes of user-space APIs such as CUDA to complete while kernel’s work has to be accelerated with GPU.The offloading would clearly increase extra overhead to complete a GPU computing task.KGP is a novel technical scheme targeted at IP lookup,it enables OS kernels to call GPU directly for completing the computation of IP lookup to avoid the overhead incurred from offloading tasks to user-space processes.The experiments also show that KGP has better performance in GPU computation for its lower extra overhead in contrast with user-space scheme,and it dose improve the IP lookup processing speed of Linux kernel.
Energy efficiency of computer power supply units - Final report
Energy Technology Data Exchange (ETDEWEB)
Aebischer, B. [cepe - Centre for Energy Policy and Economics, Swiss Federal Institute of Technology Zuerich, Zuerich (Switzerland); Huser, H. [Encontrol GmbH, Niederrohrdorf (Switzerland)
2002-11-15
This final report for the Swiss Federal Office of Energy (SFOE) takes a look at the efficiency of computer power supply units, which decreases rapidly during average computer use. The background and the purpose of the project are examined. The power supplies for personal computers are discussed and the testing arrangement used is described. Efficiency, power-factor and operating points of the units are examined. Potentials for improvement and measures to be taken are discussed. Also, action to be taken by those involved in the design and operation of such power units is proposed. Finally, recommendations for further work are made.
Energy Technology Data Exchange (ETDEWEB)
Su, Lin; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George, E-mail: xug2@rpi.edu [Nuclear Engineering Program, Rensselaer Polytechnic Institute, Troy, New York 12180 (United States); Yang, Youming; Bednarz, Bryan [Medical Physics, University of Wisconsin, Madison, Wisconsin 53706 (United States); Sterpin, Edmond [Molecular Imaging, Radiotherapy and Oncology, Université catholique de Louvain, Brussels, Belgium 1348 (Belgium)
2014-07-15
Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHER{sub RT} is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head and neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHER{sub RT}. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHER{sub RT} and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHER{sub RT} agree well with DOSXYZnrc. For clinical cases, results from ARCHER{sub RT} are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head and neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to
A massively parallel GPU-accelerated model for analysis of fully nonlinear free surface waves
DEFF Research Database (Denmark)
Engsig-Karup, Allan Peter; Madsen, Morten G.; Glimberg, Stefan Lemvig
2011-01-01
-throughput co-processors to the CPU. We describe and demonstrate how this approach makes it possible to do fast desktop computations for large nonlinear wave problems in numerical wave tanks (NWTs) with close to 50/100 million total grid points in double/ single precision with 4 GB global device memory...... space dimensions and is useful for fast analysis and prediction purposes in coastal and offshore engineering. A dedicated numerical model based on the proposed algorithm is executed in parallel by utilizing affordable modern special purpose graphics processing unit (GPU). The model is based on a low...
GPU implementation of the Rosenbluth generation method for static Monte Carlo simulations
Guo, Yachong; Baulin, Vladimir A.
2017-07-01
We present parallel version of Rosenbluth Self-Avoiding Walk generation method implemented on Graphics Processing Units (GPUs) using CUDA libraries. The method scales almost linearly with the number of CUDA cores and the method efficiency has only hardware limitations. The method is introduced in two realizations: on a cubic lattice and in real space. We find a good agreement between serial and parallel implementations and consistent results between lattice and real space realizations of the method for linear chain statistics. The developed GPU implementations of Rosenbluth algorithm can be used in Monte Carlo simulations and other computational methods that require large sampling of molecules conformations.
Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D
2008-08-01
The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is
Development of a GPU Compatible Version of the Fast Radiation Code RRTMG
Iacono, M. J.; Mlawer, E. J.; Berthiaume, D.; Cady-Pereira, K. E.; Suarez, M.; Oreopoulos, L.; Lee, D.
2012-12-01
The absorption of solar radiation and emission/absorption of thermal radiation are crucial components of the physics that drive Earth's climate and weather. Therefore, accurate radiative transfer calculations are necessary for realistic climate and weather simulations. Efficient radiation codes have been developed for this purpose, but their accuracy requirements still necessitate that as much as 30% of the computational time of a GCM is spent computing radiative fluxes and heating rates. The overall computational expense constitutes a limitation on a GCM's predictive ability if it becomes an impediment to adding new physics to or increasing the spatial and/or vertical resolution of the model. The emergence of Graphics Processing Unit (GPU) technology, which will allow the parallel computation of multiple independent radiative calculations in a GCM, will lead to a fundamental change in the competition between accuracy and speed. Processing time previously consumed by radiative transfer will now be available for the modeling of other processes, such as physics parameterizations, without any sacrifice in the accuracy of the radiative transfer. Furthermore, fast radiation calculations can be performed much more frequently and will allow the modeling of radiative effects of rapid changes in the atmosphere. The fast radiation code RRTMG, developed at Atmospheric and Environmental Research (AER), is utilized operationally in many dynamical models throughout the world. We will present the results from the first stage of an effort to create a version of the RRTMG radiation code designed to run efficiently in a GPU environment. This effort will focus on the RRTMG implementation in GEOS-5. RRTMG has an internal pseudo-spectral vector of length of order 100 that, when combined with the much greater length of the global horizontal grid vector from which the radiation code is called in GEOS-5, makes RRTMG/GEOS-5 particularly suited to achieving a significant speed improvement
GPU Accelerated Likelihoods for Stereo-Based Articulated Tracking
DEFF Research Database (Denmark)
Friborg, Rune Møllegaard; Hauberg, Søren; Erleben, Kenny
For many years articulated tracking has been an active research topic in the computer vision community. While working solutions have been suggested, computational time is still problematic. We present a GPU implementation of a ray-casting based likelihood model that is orders of magnitude faster...
GPU accelerated likelihoods for stereo-based articulated tracking
DEFF Research Database (Denmark)
Friborg, Rune Møllegaard; Hauberg, Søren; Erleben, Kenny
2010-01-01
For many years articulated tracking has been an active research topic in the computer vision community. While working solutions have been suggested, computational time is still problematic. We present a GPU implementation of a ray-casting based likelihood model that is orders of magnitude faster...
Advanced Algebra and Trigonometry: Supplemental Computer Units.
Dotseth, Karen
A set of computer-oriented, supplemental activities is offered which can be used with a course in advanced algebra and trigonometry. The activities involve use of the BASIC programming language; it is assumed that the teacher is familiar with programming in BASIC. Students will learn some BASIC; however, the intent is not to develop proficient…
Institute of Scientific and Technical Information of China (English)
马永军; 袁赢; 李灏
2014-01-01
Moving object recognition algorithm with high-definition video data suffers from large computation complexities and slow speed. With NVIDIA Tesla K20,c GPU,a method of accelerating the template matching target tracking algorithm with the heterogeneous system integrated with CPU and GPU was proposed. The parallel algorithm was designed by three optimizing means:constant memory,the internal memory of SMX and the brief calculation of correlation coefficient. Finally,the program was coded on compute unified device architecture and tested. The results show that the designed algo-rithm can obviously improve the real-time performance of the algorithm and guarantee the recognition effect.%针对大数据量导致模板匹配目标识别算法计算时间长，难以满足快速检测的实际需求问题，在采用最新NVIDIA Tesla GPU构建的CPU＋GPU异构平台上，设计了一种模板匹配目标识别并行算法。通过对模板图像数据常量化、输入图像数据极致流多处理器片上化和简化定位参数计算3方面优化了并行算法，并对算法进行性能测试。实验表明，该算法在保证识别效果的同时实时性明显提高。
Zhu, Jinhan; Chen, Lixin; Chen, Along; Luo, Guangwen; Deng, Xiaowu; Liu, Xiaowei
2015-04-11
To use a graphic processing unit (GPU) calculation engine to implement a fast 3D pre-treatment dosimetric verification procedure based on an electronic portal imaging device (EPID). The GPU algorithm includes the deconvolution and convolution method for the fluence-map calculations, the collapsed-cone convolution/superposition (CCCS) algorithm for the 3D dose calculations and the 3D gamma evaluation calculations. The results of the GPU-based CCCS algorithm were compared to those of Monte Carlo simulations. The planned and EPID-based reconstructed dose distributions in overridden-to-water phantoms and the original patients were compared for 6 MV and 10 MV photon beams in intensity-modulated radiation therapy (IMRT) treatment plans based on dose differences and gamma analysis. The total single-field dose computation time was less than 8 s, and the gamma evaluation for a 0.1-cm grid resolution was completed in approximately 1 s. The results of the GPU-based CCCS algorithm exhibited good agreement with those of the Monte Carlo simulations. The gamma analysis indicated good agreement between the planned and reconstructed dose distributions for the treatment plans. For the target volume, the differences in the mean dose were less than 1.8%, and the differences in the maximum dose were less than 2.5%. For the critical organs, minor differences were observed between the reconstructed and planned doses. The GPU calculation engine was used to boost the speed of 3D dose and gamma evaluation calculations, thus offering the possibility of true real-time 3D dosimetric verification.
A distributed multi-GPU system for high speed electron microscopic tomographic reconstruction
Energy Technology Data Exchange (ETDEWEB)
Zheng, Shawn Q.; Branlund, Eric; Kesthelyi, Bettina; Braunfeld, Michael B.; Cheng, Yifan; Sedat, John W. [The Howard Hughes Medical Institute and the W.M. Keck Advanced Microscopy Laboratory, Department of Biochemistry and Biophysics, University of California, San Francisco, 600, 16th Street, Room S412D, CA 94158-2517 (United States); Agard, David A., E-mail: agard@msg.ucsf.edu [The Howard Hughes Medical Institute and the W.M. Keck Advanced Microscopy Laboratory, Department of Biochemistry and Biophysics, University of California, San Francisco, 600, 16th Street, Room S412D, CA 94158-2517 (United States)
2011-07-15
Full resolution electron microscopic tomographic (EMT) reconstruction of large-scale tilt series requires significant computing power. The desire to perform multiple cycles of iterative reconstruction and realignment dramatically increases the pressing need to improve reconstruction performance. This has motivated us to develop a distributed multi-GPU (graphics processing unit) system to provide the required computing power for rapid constrained, iterative reconstructions of very large three-dimensional (3D) volumes. The participating GPUs reconstruct segments of the volume in parallel, and subsequently, the segments are assembled to form the complete 3D volume. Owing to its power and versatility, the CUDA (NVIDIA, USA) platform was selected for GPU implementation of the EMT reconstruction. For a system containing 10 GPUs provided by 5 GTX295 cards, 10 cycles of SIRT reconstruction for a tomogram of 4096{sup 2}x512 voxels from an input tilt series containing 122 projection images of 4096{sup 2} pixels (single precision float) takes a total of 1845 s of which 1032 s are for computation with the remainder being the system overhead. The same system takes only 39 s total to reconstruct 1024{sup 2}x256 voxels from 122 1024{sup 2} pixel projections. While the system overhead is non-trivial, performance analysis indicates that adding extra GPUs to the system would lead to steadily enhanced overall performance. Therefore, this system can be easily expanded to generate superior computing power for very large tomographic reconstructions and especially to empower iterative cycles of reconstruction and realignment. -- Highlights: {yields} A distributed multi-GPU system has been developed for electron microscopic tomography (EMT). {yields} This system allows for rapid constrained, iterative reconstruction of very large volumes. {yields} This system can be easily expanded to generate superior computing power for large-scale iterative EMT realignment.
Masset, Frédéric
2015-09-01
GFARGO is a GPU version of FARGO. It is written in C and C for CUDA and runs only on NVIDIA’s graphics cards. Though it corresponds to the standard, isothermal version of FARGO, not all functionnalities of the CPU version have been translated to CUDA. The code is available in single and double precision versions, the latter compatible with FERMI architectures. GFARGO can run on a graphics card connected to the display, allowing the user to see in real time how the fields evolve.
STEM image simulation with hybrid CPU/GPU programming.
Yao, Y; Ge, B H; Shen, X; Wang, Y G; Yu, R C
2016-07-01
STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. Copyright © 2016 Elsevier B.V. All rights reserved.
Fang, Ye; Feng, Sheng; Tam, Ka-Ming; Yun, Zhifeng; Moreno, Juana; Ramanujam, J.; Jarrell, Mark
2014-10-01
Monte Carlo simulations of the Ising model play an important role in the field of computational statistical physics, and they have revealed many properties of the model over the past few decades. However, the effect of frustration due to random disorder, in particular the possible spin glass phase, remains a crucial but poorly understood problem. One of the obstacles in the Monte Carlo simulation of random frustrated systems is their long relaxation time making an efficient parallel implementation on state-of-the-art computation platforms highly desirable. The Graphics Processing Unit (GPU) is such a platform that provides an opportunity to significantly enhance the computational performance and thus gain new insight into this problem. In this paper, we present optimization and tuning approaches for the CUDA implementation of the spin glass simulation on GPUs. We discuss the integration of various design alternatives, such as GPU kernel construction with minimal communication, memory tiling, and look-up tables. We present a binary data format, Compact Asynchronous Multispin Coding (CAMSC), which provides an additional 28.4% speedup compared with the traditionally used Asynchronous Multispin Coding (AMSC). Our overall design sustains a performance of 33.5 ps per spin flip attempt for simulating the three-dimensional Edwards-Anderson model with parallel tempering, which significantly improves the performance over existing GPU implementations.
Comparison of a 3-D GPU-Assisted Maxwell Code and Ray Tracing for Reflectometry on ITER
Gady, Sarah; Kubota, Shigeyuki; Johnson, Irena
2015-11-01
Electromagnetic wave propagation and scattering in magnetized plasmas are important diagnostics for high temperature plasmas. 1-D and 2-D full-wave codes are standard tools for measurements of the electron density profile and fluctuations; however, ray tracing results have shown that beam propagation in tokamak plasmas is inherently a 3-D problem. The GPU-Assisted Maxwell Code utilizes the FDTD (Finite-Difference Time-Domain) method for solving the Maxwell equations with the cold plasma approximation in a 3-D geometry. Parallel processing with GPGPU (General-Purpose computing on Graphics Processing Units) is used to accelerate the computation. Previously, we reported on initial comparisons of the code results to 1-D numerical and analytical solutions, where the size of the computational grid was limited by the on-board memory of the GPU. In the current study, this limitation is overcome by using domain decomposition and an additional GPU. As a practical application, this code is used to study the current design of the ITER Low Field Side Reflectometer (LSFR) for the Equatorial Port Plug 11 (EPP11). A detailed examination of Gaussian beam propagation in the ITER edge plasma will be presented, as well as comparisons with ray tracing. This work was made possible by funding from the Department of Energy for the Summer Undergraduate Laboratory Internship (SULI) program. This work is supported by the US DOE Contract No.DE-AC02-09CH11466 and DE-FG02-99-ER54527.
a method of gravity and seismic sequential inversion and its GPU implementation
Liu, G.; Meng, X.
2011-12-01
In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing
GPU-based Real-time Triggering in the NA62 Experiment
Ammendola, R; Cretaro, P.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P.S.; Pastorelli, E.; Piandani, R.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.
2016-01-01
Over the last few years the GPGPU (General-Purpose computing on Graphics Processing Units) paradigm represented a remarkable development in the world of computing. Computing for High-Energy Physics is no exception: several works have demonstrated the effectiveness of the integration of GPU-based systems in high level trigger of different experiments. On the other hand the use of GPUs in the low level trigger systems, characterized by stringent real-time constraints, such as tight time budget and high throughput, poses several challenges. In this paper we focus on the low level trigger in the CERN NA62 experiment, investigating the use of real-time computing on GPUs in this synchronous system. Our approach aimed at harvesting the GPU computing power to build in real-time refined physics-related trigger primitives for the RICH detector, as the the knowledge of Cerenkov rings parameters allows to build stringent conditions for data selection at trigger level. Latencies of all components of the trigger chain have...
Fast parallel tandem mass spectral library searching using GPU hardware acceleration.
Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K; Martin, Daniel B
2011-06-03
Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate-limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper, we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching), is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA, which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment.
MASSIVELY PARALLEL LATENT SEMANTIC ANALYSES USING A GRAPHICS PROCESSING UNIT
Energy Technology Data Exchange (ETDEWEB)
Cavanagh, J.; Cui, S.
2009-01-01
Latent Semantic Analysis (LSA) aims to reduce the dimensions of large term-document datasets using Singular Value Decomposition. However, with the ever-expanding size of datasets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. A graphics processing unit (GPU) can solve some highly parallel problems much faster than a traditional sequential processor or central processing unit (CPU). Thus, a deployable system using a GPU to speed up large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a PC cluster. Due to the GPU’s application-specifi c architecture, harnessing the GPU’s computational prowess for LSA is a great challenge. We presented a parallel LSA implementation on the GPU, using NVIDIA® Compute Unifi ed Device Architecture and Compute Unifi ed Basic Linear Algebra Subprograms software. The performance of this implementation is compared to traditional LSA implementation on a CPU using an optimized Basic Linear Algebra Subprograms library. After implementation, we discovered that the GPU version of the algorithm was twice as fast for large matrices (1 000x1 000 and above) that had dimensions not divisible by 16. For large matrices that did have dimensions divisible by 16, the GPU algorithm ran fi ve to six times faster than the CPU version. The large variation is due to architectural benefi ts of the GPU for matrices divisible by 16. It should be noted that the overall speeds for the CPU version did not vary from relative normal when the matrix dimensions were divisible by 16. Further research is needed in order to produce a fully implementable version of LSA. With that in mind, the research we presented shows that the GPU is a viable option for increasing the speed of LSA, in terms of cost/performance ratio.
Research of Fast 2-D Walsh Transformation Based on GPU%基于GPU的快速二维沃尔什变换研究
Institute of Scientific and Technical Information of China (English)
童莹; 张健
2011-01-01
提出了一种基于GPU(Graphics Processing Unit,图形处理器)CUDA(Compute Unified Device Architecture,计算统一设备架构)平台的快速二维沃尔什变换(Walsh Transform)实现方法.该方法利用GPU的并行结构和硬件特点,从算法实现、存储类型、逻辑构架设置等方面提高了沃尔什变换的运算速度.实验结果表明,随着图像分辨率的增加,沃尔什变换在GPU上运行时间远低于CPU,GPU比CPU具有更明显的加速效果.%Fast 2-D Walsh Transformation algorithm is presented based on NVIDIA's GPU which support Compute Unified Device Architecture(CUDA). On the basis of the parallel architectureand hardware characteristic of GPU,the paper introduces three methods to improve the implementation performance:optimizing algorithm, texture Storage technology,and setting up logic Device Architecture. The experiment result shows that with the increasing of picture resolution,the runtime of 2-D Walsh Transformation based on GPU is far fewer than on the CPU.
Bridging FPGA and GPU technologies for AO real-time control
Perret, Denis; Lainé, Maxime; Bernard, Julien; Gratadour, Damien; Sevin, Arnaud
2016-07-01
Our team has developed a common environment for high performance simulations and real-time control of AO systems based on the use of Graphics Processors Units in the context of the COMPASS project. Such a solution, based on the ability of the real time core in the simulation to provide adequate computing performance, limits the cost of developing AO RTC systems and makes them more scalable. A code developed and validated in the context of the simulation may be injected directly into the system and tested on sky. Furthermore, the use of relatively low cost components also offers significant advantages for the system hardware platform. However, the use of GPUs in an AO loop comes with drawbacks: the traditional way of offloading computation from CPU to GPUs - involving multiple copies and unacceptable overhead in kernel launching - is not well suited in a real time context. This last application requires the implementation of a solution enabling direct memory access (DMA) to the GPU memory from a third party device, bypassing the operating system. This allows this device to communicate directly with the real-time core of the simulation feeding it with the WFS camera pixel stream. We show that DMA between a custom FPGA-based frame-grabber and a computation unit (GPU, FPGA, or Coprocessor such as Xeon-phi) across PCIe allows us to get latencies compatible with what will be needed on ELTs. As a fine-grained synchronization mechanism is not yet made available by GPU vendors, we propose the use of memory polling to avoid interrupts handling and involvement of a CPU. Network and Vision protocols are handled by the FPGA-based Network Interface Card (NIC). We present the results we obtained on a complete AO loop using camera and deformable mirror simulators.
Workload Analysis for Typical GPU Programs Using CUPTI Interface%基于 CUPTI 接口的典型 GPU 程序负载特征分析
Institute of Scientific and Technical Information of China (English)
郑祯; 翟季冬; 李焱; 陈文光
2016-01-01
GPU‐based high performance computers have become an important trend in the area of high performance computing .However ,developing efficient parallel programs on current GPU devices is very complex because of the complex memory hierarchy and thread hierarchy . To address this problem ,we summarize five kinds of key metrics that reflect the performance of programs according to the hardware and software architecture .Then we design and implement a performance analysis tool based on underlying CUPTI interfaces provided by NVIDIA , which can collect key metrics automatically without modifying the source code .The tool can analyze the performance behaviors of GPU programs effectively with very little impact on the execution of programs .Finally ,we analyze 17 programs in Rodinia benchmark , which is a famous benchmark for GPU programs , and a real application using our tool .By analyzing the value of key metrics ,we find the performance bottlenecks of each program and map the bottlenecks back to source code .These analysis results can be used to guide the optimization of CUDA programs and GPU architecture .Result shows that most bottlenecks come from inefficient memory access ,and include unreasonable global memory and shared memory access pattern ,and low concurrency for these programs . We summarize the common reasons for typical performance bottlenecks and give some high‐level suggestions for developing efficient GPU programs .%基于图形处理器（graphics processing unit ，GPU）加速设备的高性能计算机已经成为目前高性能计算领域的一个重要发展趋势。然而，在当前的 GPU 设备上开发高效的并行程序仍然是一件非常复杂的事情。针对这一问题，1）总结了影响 GPU 程序性能的5类关键性能指标；2）采用 NVIDIA 公司提供的 CUPTI 底层接口，设计并实现了一套 GPU 程序性能分析工具集，该工具集可以有效地分析 GPU程序的性能行为；3）
Price, D C; Barsdell, B R; Babich, R; Greenhill, L J
2014-01-01
The magnitude of the real-time digital signal processing challenge attached to large radio astronomical antenna arrays motivates use of high performance computing (HPC) systems. The need for high power efficiency (performance per watt) at remote observatory sites parallels that in HPC broadly, where efficiency is an emerging critical metric. We investigate how the performance per watt of graphics processing units (GPUs) is affected by temperature, core clock frequency and voltage. Our results highlight how the underlying physical processes that govern transistor operation affect power efficiency. In particular, we show experimentally that GPU power consumption grows non-linearly with both temperature and supply voltage, as predicted by physical transistor models. We show lowering GPU supply voltage and increasing clock frequency while maintaining a low die temperature increases the power efficiency of an NVIDIA K20 GPU by up to 37-48% over default settings when running xGPU, a compute-bound code used in radio...
A GPU-based large-scale Monte Carlo simulation method for systems with long-range interactions
Liang, Yihao; Xing, Xiangjun; Li, Yaohang
2017-06-01
In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures, and adopts the sequential updating scheme of Metropolis algorithm. It makes no approximation in the computation of energy, and reaches a remarkable 440-fold speedup, compared with the serial implementation on CPU. We further use this method to simulate primitive model electrolytes, and measure very precisely all ion-ion pair correlation functions at high concentrations. From these data, we extract the renormalized Debye length, renormalized valences of constituent ions, and renormalized dielectric constants. These results demonstrate unequivocally physics beyond the classical Poisson-Boltzmann theory.
A GPU-based Large-scale Monte Carlo Simulation Method for Systems with Long-range Interactions
Liang, Yihao; Li, Yaohang
2016-01-01
In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures. It adopts the sequential updating scheme of Metropolis algorithm, and makes no approximation in the computation of energy. It reaches a remarkable 440-fold speedup, compared with the serial implementation on CPU. We use this method to simulate primitive model electrolytes. We measure very precisely all ion-ion pair correlation functions at high concentrations, and extract renormalized Debye length, renormalized valences of constituent ions, and renormalized dielectric constants. These results demonstrate unequivocally physics beyond the classical Poisson-Boltzmann theory.
GPU-based video motion magnification
DomŻał, Mariusz; Jedrasiak, Karol; Sobel, Dawid; Ryt, Artur; Nawrat, Aleksander
2016-06-01
Video motion magnification (VMM) allows people see otherwise not visible subtle changes in surrounding world. VMM is also capable of hiding them with a modified version of the algorithm. It is possible to magnify motion related to breathing of patients in hospital to observe it or extinguish it and extract other information from stabilized image sequence for example blood flow. In both cases we would like to perform calculations in real time. Unfortunately, the VMM algorithm requires a great amount of computing power. In the article we suggest that VMM algorithm can be parallelized (each thread processes one pixel) and in order to prove that we implemented the algorithm on GPU using CUDA technology. CPU is used only to grab, write, display frame and schedule work for GPU. Each GPU kernel performs spatial decomposition, reconstruction and motion amplification. In this work we presented approach that achieves a significant speedup over existing methods and allow to VMM process video in real-time. This solution can be used as preprocessing for other algorithms in more complex systems or can find application wherever real time motion magnification would be useful. It is worth to mention that the implementation runs on most modern desktops and laptops compatible with CUDA technology.
GPU accelerated study of heat transfer and fluid flow by lattice Boltzmann method on CUDA
Ren, Qinlong
Lattice Boltzmann method (LBM) has been developed as a powerful numerical approach to simulate the complex fluid flow and heat transfer phenomena during the past two decades. As a mesoscale method based on the kinetic theory, LBM has several advantages compared with traditional numerical methods such as physical representation of microscopic interactions, dealing with complex geometries and highly parallel nature. Lattice Boltzmann method has been applied to solve various fluid behaviors and heat transfer process like conjugate heat transfer, magnetic and electric field, diffusion and mixing process, chemical reactions, multiphase flow, phase change process, non-isothermal flow in porous medium, microfluidics, fluid-structure interactions in biological system and so on. In addition, as a non-body-conformal grid method, the immersed boundary method (IBM) could be applied to handle the complex or moving geometries in the domain. The immersed boundary method could be coupled with lattice Boltzmann method to study the heat transfer and fluid flow problems. Heat transfer and fluid flow are solved on Euler nodes by LBM while the complex solid geometries are captured by Lagrangian nodes using immersed boundary method. Parallel computing has been a popular topic for many decades to accelerate the computational speed in engineering and scientific fields. Today, almost all the laptop and desktop have central processing units (CPUs) with multiple cores which could be used for parallel computing. However, the cost of CPUs with hundreds of cores is still high which limits its capability of high performance computing on personal computer. Graphic processing units (GPU) is originally used for the computer video cards have been emerged as the most powerful high-performance workstation in recent years. Unlike the CPUs, the cost of GPU with thousands of cores is cheap. For example, the GPU (GeForce GTX TITAN) which is used in the current work has 2688 cores and the price is only 1
Hsu, Yu-Han H; Ferl, Gregory Z; Ng, Chee M
2013-05-01
Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is often used to examine vascular function in malignant tumors and noninvasively monitor drug efficacy of antivascular therapies in clinical studies. However, complex numerical methods used to derive tumor physiological properties from DCE-MRI images can be time-consuming and computationally challenging. Recent advancement of computing technology in graphics processing unit (GPU) makes it possible to build an energy-efficient and high-power parallel computing platform for solving complex numerical problems. This study develops the first reported fast GPU-based method for nonparametric kinetic analysis of DCE-MRI data using clinical scans of glioblastoma patients treated with bevacizumab (Avastin®). In the method, contrast agent concentration-time profiles in arterial blood and tumor tissue are smoothed using a robust kernel-based regression algorithm in order to remove artifacts due to patient motion and then deconvolved to produce the impulse response function (IRF). The area under the curve (AUC) and mean residence time (MRT) of the IRF are calculated using statistical moment analysis, and two tumor physiological properties that relate to vascular permeability, volume transfer constant between blood plasma and extravascular extracellular space (K(trans)) and fractional interstitial volume (ve) are estimated using the approximations AUC/MRT and AUC. The most significant feature in this method is the use of GPU-computing to analyze data from more than 60,000 voxels in each DCE-MRI image in parallel fashion. All analysis steps have been automated in a single program script that requires only blood and tumor data as the sole input. The GPU-accelerated method produces K(trans) and ve estimates that are comparable to results from previous studies but reduces computational time by more than 80-fold compared to a previously reported central processing unit-based nonparametric method. Furthermore, it is at
A Novel CPU/GPU Simulation Environment for Large-Scale Biologically-Realistic Neural Modeling
Directory of Open Access Journals (Sweden)
Roger V Hoang
2013-10-01
Full Text Available Computational Neuroscience is an emerging field that provides unique opportunities to studycomplex brain structures through realistic neural simulations. However, as biological details are added tomodels, the execution time for the simulation becomes longer. Graphics Processing Units (GPUs are now being utilized to accelerate simulations due to their ability to perform computations in parallel. As such, they haveshown significant improvement in execution time compared to Central Processing Units (CPUs. Most neural simulators utilize either multiple CPUs or a single GPU for better performance, but still show limitations in execution time when biological details are not sacrificed. Therefore, we present a novel CPU/GPU simulation environment for large-scale biological networks,the NeoCortical Simulator version 6 (NCS6. NCS6 is a free, open-source, parallelizable, and scalable simula-tor, designed to run on clusters of multiple machines, potentially with high performance computing devicesin each of them. It has built-in leaky-integrate-and-fire (LIF and Izhikevich (IZH neuron models, but usersalso have the capability to design their own plug-in interface for different neuron types as desired. NCS6is currently able to simulate one million cells and 100 million synapses in quasi real time by distributing dataacross these heterogeneous clusters of CPUs and GPUs.
A novel CPU/GPU simulation environment for large-scale biologically realistic neural modeling.
Hoang, Roger V; Tanna, Devyani; Jayet Bray, Laurence C; Dascalu, Sergiu M; Harris, Frederick C
2013-01-01
Computational Neuroscience is an emerging field that provides unique opportunities to study complex brain structures through realistic neural simulations. However, as biological details are added to models, the execution time for the simulation becomes longer. Graphics Processing Units (GPUs) are now being utilized to accelerate simulations due to their ability to perform computations in parallel. As such, they have shown significant improvement in execution time compared to Central Processing Units (CPUs). Most neural simulators utilize either multiple CPUs or a single GPU for better performance, but still show limitations in execution time when biological details are not sacrificed. Therefore, we present a novel CPU/GPU simulation environment for large-scale biological networks, the NeoCortical Simulator version 6 (NCS6). NCS6 is a free, open-source, parallelizable, and scalable simulator, designed to run on clusters of multiple machines, potentially with high performance computing devices in each of them. It has built-in leaky-integrate-and-fire (LIF) and Izhikevich (IZH) neuron models, but users also have the capability to design their own plug-in interface for different neuron types as desired. NCS6 is currently able to simulate one million cells and 100 million synapses in quasi real time by distributing data across eight machines with each having two video cards.
Direct numerical simulation of turbulence using GPU accelerated supercomputers
Khajeh-Saeed, Ali; Blair Perot, J.
2013-02-01
Direct numerical simulations of turbulence are optimized for up to 192 graphics processors. The results from two large GPU clusters are compared to the performance of corresponding CPU clusters. A number of important algorithm changes are necessary to access the full computational power of graphics processors and these adaptations are discussed. It is shown that the handling of subdomain communication becomes even more critical when using GPU based supercomputers. The potential for overlap of MPI communication with GPU computation is analyzed and then optimized. Detailed timings reveal that the internal calculations are now so efficient that the operations related to MPI communication are the primary scaling bottleneck at all but the very largest problem sizes that can fit on the hardware. This work gives a glimpse of the CFD performance issues will dominate many hardware platform in the near future.
Energy Technology Data Exchange (ETDEWEB)
2016-07-15
The Kokkos Clang compiler is a version of the Clang C++ compiler that has been modified to perform targeted code generation for Kokkos constructs in the goal of generating highly optimized code and to provide semantic (domain) awareness throughout the compilation toolchain of these constructs such as parallel for and parallel reduce. This approach is taken to explore the possibilities of exposing the developer’s intentions to the underlying compiler infrastructure (e.g. optimization and analysis passes within the middle stages of the compiler) instead of relying solely on the restricted capabilities of C++ template metaprogramming. To date our current activities have focused on correct GPU code generation and thus we have not yet focused on improving overall performance. The compiler is implemented by recognizing specific (syntactic) Kokkos constructs in order to bypass normal template expansion mechanisms and instead use the semantic knowledge of Kokkos to directly generate code in the compiler’s intermediate representation (IR); which is then translated into an NVIDIA-centric GPU program and supporting runtime calls. In addition, by capturing and maintaining the higher-level semantics of Kokkos directly within the lower levels of the compiler has the potential for significantly improving the ability of the compiler to communicate with the developer in the terms of their original programming model/semantics.
High Speed 3D Tomography on CPU, GPU, and FPGA
Directory of Open Access Journals (Sweden)
GAC Nicolas
2008-01-01
Full Text Available Abstract Back-projection (BP is a costly computational step in tomography image reconstruction such as positron emission tomography (PET. To reduce the computation time, this paper presents a pipelined, prefetch, and parallelized architecture for PET BP (3PA-PET. The key feature of this architecture is its original memory access strategy, masking the high latency of the external memory. Indeed, the pattern of the memory references to the data acquired hinders the processing unit. The memory access bottleneck is overcome by an efficient use of the intrinsic temporal and spatial locality of the BP algorithm. A loop reordering allows an efficient use of general purpose processor's caches, for software implementation, as well as the 3D predictive and adaptive cache (3D-AP cache, when considering hardware implementations. Parallel hardware pipelines are also efficient thanks to a hierarchical 3D-AP cache: each pipeline performs a memory reference in about one clock cycle to reach a computational throughput close to 100%. The 3PA-PET architecture is prototyped on a system on programmable chip (SoPC to validate the system and to measure its expected performances. Time performances are compared with a desktop PC, a workstation, and a graphic processor unit (GPU.
High Speed 3D Tomography on CPU, GPU, and FPGA
Directory of Open Access Journals (Sweden)
Dominique Houzet
2009-02-01
Full Text Available Back-projection (BP is a costly computational step in tomography image reconstruction such as positron emission tomography (PET. To reduce the computation time, this paper presents a pipelined, prefetch, and parallelized architecture for PET BP (3PA-PET. The key feature of this architecture is its original memory access strategy, masking the high latency of the external memory. Indeed, the pattern of the memory references to the data acquired hinders the processing unit. The memory access bottleneck is overcome by an efficient use of the intrinsic temporal and spatial locality of the BP algorithm. A loop reordering allows an efficient use of general purpose processor's caches, for software implementation, as well as the 3D predictive and adaptive cache (3D-AP cache, when considering hardware implementations. Parallel hardware pipelines are also efficient thanks to a hierarchical 3D-AP cache: each pipeline performs a memory reference in about one clock cycle to reach a computational throughput close to 100%. The 3PA-PET architecture is prototyped on a system on programmable chip (SoPC to validate the system and to measure its expected performances. Time performances are compared with a desktop PC, a workstation, and a graphic processor unit (GPU.
Avaliação de desempenho e consumo energético para configurações de Wavefront pools de uma GPU AMD
Directory of Open Access Journals (Sweden)
Ariel Gustavo Zuquello
2016-07-01
Full Text Available O uso de sistemas heterogêneos CPU-GPU para atender à crescente demanda por aplicações com grande paralelismo de dados resulta na necessidade de estudar e avaliar tais arquiteturas para melhorá-las continuamente. Neste artigo foram feitas simulações da execução de uma suíte de benchmark em uma GPU AMD ATI RadeonTM HD 7970, de modo a avaliar o impacto sobre o desempenho e o consumo energético quando alterado o número de Wavefront Pools presentes em cada compute unit da GPU, que é 4 por padrão. O resultado mais significante evidencia um aumento de velocidade de cerca de 5,7% para a configuração com duas Wavefront Pools em conjunto com um aumento no consumo de energia de cerca de 5,1%. Todavia, as outras configurações avaliadas também representam opções para diferentes tipos de necessidades, conforme a categoria de demanda computacional.Palavras-chave: Sistemas heterogêneos. Simulações. Desempenho.Performance evaluation and energy consumption for settings of Wavefront pools of a GPU AMDAbstractThe use of CPU-GPU heterogeneous systems to meet the growing demand for applications with large data parallelism results in the need to study and evaluate these architectures in order to improve them continuously. In this paper we made simulations of running a benchmark suite on an AMD GPU ATI RadeonTM HD 7970 in order to assess the impact on performance and power consumption when tuning the number of Wavefront Pools present in each GPU compute unit, which is 4 by default. The most significant result shows a speedup of about 5.7% for configuration with two Wavefront Pools in conjunction with an increase of about 5.1% in the energy consumption. However, the other evaluated configuration also represent options for different kinds of needs, according to the computational demand.Keyworks: Heterogeneous systems. Simulation. Performance.
Directory of Open Access Journals (Sweden)
J. Adam Wilson
2009-07-01
Full Text Available The clock speeds of modern computer processors have nearly plateaued in the past five years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card (GPU was developed for real-time neural signal processing of a brain-computer interface (BCI. The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter, followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally-intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a CPU-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.
Geological Visualization System with GPU-Based Interpolation
Huang, L.; Chen, K.; Lai, Y.; Chang, P.; Song, S.
2011-12-01
There has been a large number of research using parallel-processing GPU to accelerate the computation. In Near Surface Geology efficient interpolations are critical for proper interpretation of measured data. Additionally, an appropriate interpolation method for generating proper results depends on the factors such as the dense of the measured locations and the estimation model. Therefore, fast interpolation process is needed to efficiently find a proper interpolation algorithm for a set of collected data. However, a general CPU framework has to process each computation in a sequential manner and is not efficient enough to handle a large number of interpolation generally needed in Near Surface Geology. When carefully observing the interpolation processing, the computation for each grid point is independent from all other computation. Therefore, the GPU parallel framework should be an efficient technology to accelerate the interpolation process which is critical in Near Surface Geology. Thus in this paper we design a geological visualization system whose core includes a set of interpolation algorithms including Nearest Neighbor, Inverse Distance and Kriging. All these interpolation algorithms are implemented using both the CPU framework and GPU framework. The comparison between CPU and GPU implementation in the aspect of precision and processing speed shows that parallel computation can accelerate the interpolation process and also demonstrates the possibility of using GPU-equipped personal computer to replace the expensive workstation. Immediate update at the measurement site is the dream of geologists. In the future the parallel and remote computation ability of cloud will be explored to make the mobile computation on the measurement site possible.
Kauffmann, Claude; Tang, An; Therasse, Eric; Soulez, Gilles
2010-03-01
We developed a hybrid CPU-GPU framework enabling semi-automated segmentation of abdominal aortic aneurysm (AAA) on Computed Tomography Angiography (CTA) examinations. AAA maximal diameter (D-max) and volume measurements and their progression between 2 examinations can be generated by this software improving patient followup. In order to improve the workflow efficiency some segmentation tasks were implemented and executed on the graphics processing unit (GPU). A GPU based algorithm is used to automatically segment the lumen of the aneurysm within short computing time. In a second step, the user interacted with the software to validate the boundaries of the intra-luminal thrombus (ILT) on GPU-based curved image reformation. Automatic computation of D-max and volume were performed on the 3D AAA model. Clinical validation was conducted on 34 patients having 2 consecutive MDCT examinations within a minimum interval of 6 months. The AAA segmentation was performed twice by a experienced radiologist (reference standard) and once by 3 unsupervised technologists on all 68 MDCT. The ICC for intra-observer reproducibility was 0.992 (>=0.987) for D-max and 0.998 (>=0.994) for volume measurement. The ICC for inter-observer reproducibility was 0.985 (0.977-0.90) for D-max and 0.998 (0.996- 0.999) for volume measurement. Semi-automated AAA segmentation for volume follow-up was more than twice as sensitive than D-max follow-up, while providing an equivalent reproducibility.
Porting a Hall MHD Code to a Graphic Processing Unit
Dorelli, John C.
2011-01-01
We present our experience porting a Hall MHD code to a Graphics Processing Unit (GPU). The code is a 2nd order accurate MUSCL-Hancock scheme which makes use of an HLL Riemann solver to compute numerical fluxes and second-order finite differences to compute the Hall contribution to the electric field. The divergence of the magnetic field is controlled with Dedner?s hyperbolic divergence cleaning method. Preliminary benchmark tests indicate a speedup (relative to a single Nehalem core) of 58x for a double precision calculation. We discuss scaling issues which arise when distributing work across multiple GPUs in a CPU-GPU cluster.
Numerical computations with GPUs
Kindratenko, Volodymyr
2014-01-01
This book brings together research on numerical methods adapted for Graphics Processing Units (GPUs). It explains recent efforts to adapt classic numerical methods, including solution of linear equations and FFT, for massively parallel GPU architectures. This volume consolidates recent research and adaptations, covering widely used methods that are at the core of many scientific and engineering computations. Each chapter is written by authors working on a specific group of methods; these leading experts provide mathematical background, parallel algorithms and implementation details leading to
GPU-Powered Coherent Beamforming
Magro, Alessio; Hickish, Jack
2014-01-01
GPU-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimised for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves $\\sim$1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimised, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.
Glotsos, Dimitris; Kostopoulos, Spiros; Sidiropoulos, Konstantinos; Ravazoula, Panagiota; Kalatzis, Ioannis; Asvestas, Pantelis; Cavouras, Dionisis
2014-01-01
In this study a Computer-Aided Microscopy (CAM) system is proposed for investigating the importance of the histological criteria involved in diagnosing of cancers in microscopy in order to suggest the more informative features for discriminating low from high-grade brain tumours. Four families of criteria have been examined, involving the greylevel variations (i.e. texture), the morphology (i.e. roundness), the architecture (i.e. cellularity) and the overall tumour qualities (expert's ordinal scale). The proposed CAM system was constructed using a modified Seeded Region Growing algorithm for image segmentation, and the Probabilistic Neural Network classifier for image classification. The implementation was designed on a commercial Graphics Processing Unit card using parallel programming. The system's performance using textural, morphological, architectural and ordinal information was 90.8%, 87.0%, 81.2% and 88.9% respectively. Results indicate that nuclei texture is the most important family of features regarding the degree of malignancy, and, thus, may guide more accurate predictions for discriminating low from high grade gliomas. Considering that nuclei texture is almost impractical to be encoded by visual observation, the need to incorporate computer-aided diagnostic tools as second opinion in daily clinical practice of diagnosing rare brain tumours may be justified.
Software Accelerates Computing Time for Complex Math
2014-01-01
Ames Research Center awarded Newark, Delaware-based EM Photonics Inc. SBIR funding to utilize graphic processing unit (GPU) technology- traditionally used for computer video games-to develop high-computing software called CULA. The software gives users the ability to run complex algorithms on personal computers with greater speed. As a result of the NASA collaboration, the number of employees at the company has increased 10 percent.
Simulation of isothermal multi-phase fuel-coolant interaction using MPS method with GPU acceleration
Energy Technology Data Exchange (ETDEWEB)
Gou, W.; Zhang, S.; Zheng, Y. [Zhejiang Univ., Hangzhou (China). Center for Engineering and Scientific Computation
2016-07-15
The energetic fuel-coolant interaction (FCI) has been one of the primary safety concerns in nuclear power plants. Graphical processing unit (GPU) implementation of the moving particle semi-implicit (MPS) method is presented and used to simulate the fuel coolant interaction problem. The governing equations are discretized with the particle interaction model of MPS. Detailed implementation on single-GPU is introduced. The three-dimensional broken dam is simulated to verify the developed GPU acceleration MPS method. The proposed GPU acceleration algorithm and developed code are then used to simulate the FCI problem. As a summary of results, the developed GPU-MPS method showed a good agreement with the experimental observation and theoretical prediction.
Kantor, Jiří
2013-01-01
Tato práce popisuje tvorbu jednoduchého raytraceru pro OpenSceneGraph, který běží na grafické kartě. V práci jsou popsány věci, které bylo nutné provést v OpenSceneGraphu, aby bylo možno předávat data do GPU a také několik metod pro hledání průsečíků paprsku a trojúhelníku, což je klíčový algoritmus v raytracingu. This work describes creation of a simple raytracer for OpenSceneGraph, which performs its operations on the graphics card. Things, that needed to be done in OpenSceneGraph in ord...
GPU-centric resolved-particle disperse two-phase flow simulation using the Physalis method
Sierakowski, Adam J.
2016-10-01
We present work on a new implementation of the Physalis method for resolved-particle disperse two-phase flow simulations. We discuss specifically our GPU-centric programming model that avoids all device-host data communication during the simulation. Summarizing the details underlying the implementation of the Physalis method, we illustrate the application of two GPU-centric parallelization paradigms and record insights on how to best leverage the GPU's prioritization of bandwidth over latency. We perform a comparison of the computational efficiency between the current GPU-centric implementation and a legacy serial-CPU-optimized code and conclude that the GPU hardware accounts for run time improvements up to a factor of 60 by carefully normalizing the run times of both codes.
Semi-automatic tool to ease the creation and optimization of GPU programs
DEFF Research Database (Denmark)
Jepsen, Jacob
2014-01-01
We present a tool that reduces the development time of GPU-executable code. We implement a catalogue of common optimizations specific to the GPU architecture. Through the tool, the programmer can semi-automatically transform a computationally-intensive code section into GPU-executable form...... and apply optimizations thereto. Based on experiments, the code generated by the tool can be 3-256X faster than code generated by an OpenACC compiler, 4-37X faster than optimized CPU code, and attain up to 25% of peak performance of the GPU. We found that by using pattern-matching rules, many...... of the transformations can be performed automatically, which makes the tool usable for both novices and experts in GPU programming....
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations.
Kutzner, Carsten; Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L; Grubmüller, Helmut
2015-10-05
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well-exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)-based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off-loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime.
GPU-based fast Monte Carlo dose calculation for proton therapy.
Jia, Xun; Schümann, Jan; Paganetti, Harald; Jiang, Steve B
2012-12-07
Accurate radiation dose calculation is essential for successful proton radiotherapy. Monte Carlo (MC) simulation is considered to be the most accurate method. However, the long computation time limits it from routine clinical applications. Recently, graphics processing units (GPUs) have been widely used to accelerate computationally intensive tasks in radiotherapy. We have developed a fast MC dose calculation package, gPMC, for proton dose calculation on a GPU. In gPMC, proton transport is modeled by the class II condensed history simulation scheme with a continuous slowing down approximation. Ionization, elastic and inelastic proton nucleus interactions are considered. Energy straggling and multiple scattering are modeled. Secondary electrons are not transported and their energies are locally deposited. After an inelastic nuclear interaction event, a variety of products are generated using an empirical model. Among them, charged nuclear fragments are terminated with energy locally deposited. Secondary protons are stored in a stack and transported after finishing transport of the primary protons, while secondary neutral particles are neglected. gPMC is implemented on the GPU under the CUDA platform. We have validated gPMC using the TOPAS/Geant4 MC code as the gold standard. For various cases including homogeneous and inhomogeneous phantoms as well as a patient case, good agreements between gPMC and TOPAS/Geant4 are observed. The gamma passing rate for the 2%/2 mm criterion is over 98.7% in the region with dose greater than 10% maximum dose in all cases, excluding low-density air regions. With gPMC it takes only 6-22 s to simulate 10 million source protons to achieve ∼1% relative statistical uncertainty, depending on the phantoms and energy. This is an extremely high efficiency compared to the computational time of tens of CPU hours for TOPAS/Geant4. Our fast GPU-based code can thus facilitate the routine use of MC dose calculation in proton therapy.
GPU accelerated spectral finite elements on all-hex meshes
Remacle, J.-F.; Gandham, R.; Warburton, T.
2016-11-01
This paper presents a spectral element finite element scheme that efficiently solves elliptic problems on unstructured hexahedral meshes. The discrete equations are solved using a matrix-free preconditioned conjugate gradient algorithm. An additive Schwartz two-scale preconditioner is employed that allows h-independence convergence. An extensible multi-threading programming API is used as a common kernel language that allows runtime selection of different computing devices (GPU and CPU) and different threading interfaces (CUDA, OpenCL and OpenMP). Performance tests demonstrate that problems with over 50 million degrees of freedom can be solved in a few seconds on an off-the-shelf GPU.
Kalantzis, Georgios; Tachibana, Hidenobu
2014-01-01
For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Research on GPU Acceleration for Monte Carlo Criticality Calculation
Xu, Qi; Yu, Ganglin; Wang, Kan
2014-06-01
The Monte Carlo neutron transport method can be naturally parallelized by multi-core architectures due to the dependency between particles during the simulation. The GPU+CPU heterogeneous parallel mode has become an increasingly popular way of parallelism in the field of scientific supercomputing. Thus, this work focuses on the GPU acceleration method for the Monte Carlo criticality simulation, as well as the computational efficiency that GPUs can bring. The "neutron transport step" is introduced to increase the GPU thread occupancy. In order to test the sensitivity of the MC code's complexity, a 1D one-group code and a 3D multi-group general purpose code are respectively transplanted to GPUs, and the acceleration effects are compared. The result of numerical experiments shows considerable acceleration effect of the "neutron transport step" strategy. However, the performance comparison between the 1D code and the 3D code indicates the poor scalability of MC codes on GPUs.
High-Performance Matrix-Vector Multiplication on the GPU
DEFF Research Database (Denmark)
Sørensen, Hans Henrik Brandenborg
2012-01-01
In this paper, we develop a high-performance GPU kernel for one of the most popular dense linear algebra operations, the matrix-vector multiplication. The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for scientific computing...
A Heterogeneous Accelerated Matrix Multiplication: OpenCL + APU + GPU+ Fast Matrix Multiply
D'Alberto, Paolo
2012-01-01
As users and developers, we are witnessing the opening of a new computing scenario: the introduction of hybrid processors into a single die, such as an accelerated processing unit (APU) processor, and the plug-and-play of additional graphics processing units (GPUs) onto a single motherboard. These APU processors provide multiple symmetric cores with their memory hierarchies and an integrated GPU. Moreover, these processors are designed to work with external GPUs that can push the peak performance towards the TeraFLOPS boundary. We present a case study for the development of dense Matrix Multiplication (MM) codes for matrix sizes up to 19K\\times19K, thus using all of the above computational engines, and an achievable peak performance of 200 GFLOPS for, literally, a made- at-home built. We present the results of our experience, the quirks, the pitfalls, the achieved performance, and the achievable peak performance.
A GPU implementation of a track-repeating algorithm for proton radiotherapy dose calculations
Yepes, Pablo P; Taddei, Phillip J
2010-01-01
An essential component in proton radiotherapy is the algorithm to calculate the radiation dose to be delivered to the patient. The most common dose algorithms are fast but they are approximate analytical approaches. However their level of accuracy is not always satisfactory, especially for heterogeneous anatomic areas, like the thorax. Monte Carlo techniques provide superior accuracy, however, they often require large computation resources, which render them impractical for routine clinical use. Track-repeating algorithms, for example the Fast Dose Calculator, have shown promise for achieving the accuracy of Monte Carlo simulations for proton radiotherapy dose calculations in a fraction of the computation time. We report on the implementation of the Fast Dose Calculator for proton radiotherapy on a card equipped with graphics processor units (GPU) rather than a central processing unit architecture. This implementation reproduces the full Monte Carlo and CPU-based track-repeating dose calculations within 2%, w...
Institute of Scientific and Technical Information of China (English)
赵嵩; 徐彦; 曹海旺; 杨恒
2015-01-01
提出了一种多特征融合粒子滤波跟踪算法，并利用 GPU (Graphic Processing Unit)技术对算法进行了并行优化。针对单一特征描述目标模型的缺陷，此算法采用了具有互补性的灰度与梯度直方图特征建立目标模型，从而提高粒子滤波算法跟踪的稳定性和精度。同时，针对粒子滤波计算量大的缺点，此算法对粒子滤波进行了基于GPU 的并行优化设计和实现，从而提升跟踪算法的计算速度。可以满足算法的实时性应用。%A parallel particle filter object tracking algorithm is given out,which is based on multiple feature fusion with the help of GPU (Graphic Processing Unit)technology.Due to the limitation of the model representation based on single visual feature,two complementary visual features,which are gray histogram and gradient histogram,are used in the algorithm to improve the tracking stability and accuracy.Moreover,to handle the large amount computation cost of the particle filter,a GPU parallel optimized scheme is designed to improve the algorithm speed. and can meet the real-time application requirement.
Cloud GPU-based simulations for SQUAREMR
Kantasis, George; Xanthis, Christos G.; Haris, Kostas; Heiberg, Einar; Aletras, Anthony H.
2017-01-01
Quantitative Magnetic Resonance Imaging (MRI) is a research tool, used more and more in clinical practice, as it provides objective information with respect to the tissues being imaged. Pixel-wise T1 quantification (T1 mapping) of the myocardium is one such application with diagnostic significance. A number of mapping sequences have been developed for myocardial T1 mapping with a wide range in terms of measurement accuracy and precision. Furthermore, measurement results obtained with these pulse sequences are affected by errors introduced by the particular acquisition parameters used. SQUAREMR is a new method which has the potential of improving the accuracy of these mapping sequences through the use of massively parallel simulations on Graphical Processing Units (GPUs) by taking into account different acquisition parameter sets. This method has been shown to be effective in myocardial T1 mapping; however, execution times may exceed 30 min which is prohibitively long for clinical applications. The purpose of this study was to accelerate the construction of SQUAREMR's multi-parametric database to more clinically acceptable levels. The aim of this study was to develop a cloud-based cluster in order to distribute the computational load to several GPU-enabled nodes and accelerate SQUAREMR. This would accommodate high demands for computational resources without the need for major upfront equipment investment. Moreover, the parameter space explored by the simulations was optimized in order to reduce the computational load without compromising the T1 estimates compared to a non-optimized parameter space approach. A cloud-based cluster with 16 nodes resulted in a speedup of up to 13.5 times compared to a single-node execution. Finally, the optimized parameter set approach allowed for an execution time of 28 s using the 16-node cluster, without compromising the T1 estimates by more than 10 ms. The developed cloud-based cluster and optimization of the parameter set reduced
Computational wave optics library for C++: CWO++ library
Shimobaba, Tomoyoshi; Sakurai, Takahiro; Okada, Naohisa; Nishitsuji, Takashi; Takada, Naoki; Shiraki, Atsushi; Masuda, Nobuyuki; Ito, Tomoyoshi
2011-01-01
Diffraction calculations, such as the angular spectrum method, and Fresnel diffractions, are used for calculating scalar light propagation. The calculations are used in wide-ranging optics fields: for example, computer generated holograms (CGHs), digital holography, diffractive optical elements, microscopy, image encryption and decryption, three-dimensional analysis for optical devices and so on. However, increasing demands made by large-scale diffraction calculations have rendered the computational power of recent computers insufficient. We have already developed a numerical library for diffraction calculations using a graphic processing unit (GPU), which was named the GWO library. However, this GWO library is not user-friendly, since it is based on C language and was also run only on a GPU. In this paper, we develop a new C++ class library for