Accelerated 3D Monte Carlo light dosimetry using a graphics processing unit (GPU) cluster
Lo, William Chun Yip; Lilge, Lothar
2010-11-01
This paper presents a basic computational framework for real-time, 3-D light dosimetry on graphics processing unit (GPU) clusters. The GPU-based approach offers a direct solution to overcome the long computation time preventing Monte Carlo simulations from being used in complex optimization problems such as treatment planning, particularly if simulated annealing is employed as the optimization algorithm. The current multi- GPU implementation is validated using a commercial light modelling software (ASAP from Breault Research Organization). It also supports the latest Fermi GPU architecture and features an interactive 3-D visualization interface. The software is available for download at http://code.google.com/p/gpu3d.
Accelerated rescaling of single Monte Carlo simulation runs with the Graphics Processing Unit (GPU).
Yang, Owen; Choi, Bernard
2013-01-01
To interpret fiber-based and camera-based measurements of remitted light from biological tissues, researchers typically use analytical models, such as the diffusion approximation to light transport theory, or stochastic models, such as Monte Carlo modeling. To achieve rapid (ideally real-time) measurement of tissue optical properties, especially in clinical situations, there is a critical need to accelerate Monte Carlo simulation runs. In this manuscript, we report on our approach using the Graphics Processing Unit (GPU) to accelerate rescaling of single Monte Carlo runs to calculate rapidly diffuse reflectance values for different sets of tissue optical properties. We selected MATLAB to enable non-specialists in C and CUDA-based programming to use the generated open-source code. We developed a software package with four abstraction layers. To calculate a set of diffuse reflectance values from a simulated tissue with homogeneous optical properties, our rescaling GPU-based approach achieves a reduction in computation time of several orders of magnitude as compared to other GPU-based approaches. Specifically, our GPU-based approach generated a diffuse reflectance value in 0.08ms. The transfer time from CPU to GPU memory currently is a limiting factor with GPU-based calculations. However, for calculation of multiple diffuse reflectance values, our GPU-based approach still can lead to processing that is ~3400 times faster than other GPU-based approaches.
Shi, Yulin; Veidenbaum, Alexander V; Nicolau, Alex; Xu, Xiangmin
2015-01-15
Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post hoc processing and analysis. Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22× speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. Copyright © 2014 Elsevier B.V. All rights reserved.
Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
2014-01-01
Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
Exploring Graphics Processing Unit (GPU Resource Sharing Efficiency for High Performance Computing
Directory of Open Access Journals (Sweden)
Teng Li
2013-11-01
Full Text Available The increasing incorporation of Graphics Processing Units (GPUs as accelerators has been one of the forefront High Performance Computing (HPC trends and provides unprecedented performance; however, the prevalent adoption of the Single-Program Multiple-Data (SPMD programming model brings with it challenges of resource underutilization. In other words, under SPMD, every CPU needs GPU capability available to it. However, since CPUs generally outnumber GPUs, the asymmetric resource distribution gives rise to overall computing resource underutilization. In this paper, we propose to efficiently share the GPU under SPMD and formally define a series of GPU sharing scenarios. We provide performance-modeling analysis for each sharing scenario with accurate experimentation validation. With the modeling basis, we further conduct experimental studies to explore potential GPU sharing efficiency improvements from multiple perspectives. Both further theoretical and experimental GPU sharing performance analysis and results are presented. Our results not only demonstrate the significant performance gain for SPMD programs with the proposed efficient GPU sharing, but also the further improved sharing efficiency with the optimization techniques based on our accurate modeling.
SU-E-J-91: FFT Based Medical Image Registration Using a Graphics Processing Unit (GPU).
Luce, J; Hoggarth, M; Lin, J; Block, A; Roeske, J
2012-06-01
To evaluate the efficiency gains obtained from using a Graphics Processing Unit (GPU) to perform a Fourier Transform (FT) based image registration. Fourier-based image registration involves obtaining the FT of the component images, and analyzing them in Fourier space to determine the translations and rotations of one image set relative to another. An important property of FT registration is that by enlarging the images (adding additional pixels), one can obtain translations and rotations with sub-pixel resolution. The expense, however, is an increased computational time. GPUs may decrease the computational time associated with FT image registration by taking advantage of their parallel architecture to perform matrix computations much more efficiently than a Central Processor Unit (CPU). In order to evaluate the computational gains produced by a GPU, images with known translational shifts were utilized. A program was written in the Interactive Data Language (IDL; Exelis, Boulder, CO) to performCPU-based calculations. Subsequently, the program was modified using GPU bindings (Tech-X, Boulder, CO) to perform GPU-based computation on the same system. Multiple image sizes were used, ranging from 256×256 to 2304×2304. The time required to complete the full algorithm by the CPU and GPU were benchmarked and the speed increase was defined as the ratio of the CPU-to-GPU computational time. The ratio of the CPU-to- GPU time was greater than 1.0 for all images, which indicates the GPU is performing the algorithm faster than the CPU. The smallest improvement, a 1.21 ratio, was found with the smallest image size of 256×256, and the largest speedup, a 4.25 ratio, was observed with the largest image size of 2304×2304. GPU programming resulted in a significant decrease in computational time associated with a FT image registration algorithm. The inclusion of the GPU may provide near real-time, sub-pixel registration capability. © 2012 American Association of Physicists in
GPU MrBayes V3.1: MrBayes on Graphics Processing Units for Protein Sequence Data.
Pang, Shuai; Stones, Rebecca J; Ren, Ming-Ming; Liu, Xiao-Guang; Wang, Gang; Xia, Hong-ju; Wu, Hao-Yang; Liu, Yang; Xie, Qiang
2015-09-01
We present a modified GPU (graphics processing unit) version of MrBayes, called ta(MC)(3) (GPU MrBayes V3.1), for Bayesian phylogenetic inference on protein data sets. Our main contributions are 1) utilizing 64-bit variables, thereby enabling ta(MC)(3) to process larger data sets than MrBayes; and 2) to use Kahan summation to improve accuracy, convergence rates, and consequently runtime. Versus the current fastest software, we achieve a speedup of up to around 2.5 (and up to around 90 vs. serial MrBayes), and more on multi-GPU hardware. GPU MrBayes V3.1 is available from http://sourceforge.net/projects/mrbayes-gpu/. © The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
permGPU: Using graphics processing units in RNA microarray association studies
Directory of Open Access Journals (Sweden)
George Stephen L
2010-06-01
Full Text Available Abstract Background Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. Results We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. Conclusions permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.
Graphics Processing Unit (GPU) Acceleration of the Goddard Earth Observing System Atmospheric Model
Putnam, Williama
2011-01-01
The Goddard Earth Observing System 5 (GEOS-5) is the atmospheric model used by the Global Modeling and Assimilation Office (GMAO) for a variety of applications, from long-term climate prediction at relatively coarse resolution, to data assimilation and numerical weather prediction, to very high-resolution cloud-resolving simulations. GEOS-5 is being ported to a graphics processing unit (GPU) cluster at the NASA Center for Climate Simulation (NCCS). By utilizing GPU co-processor technology, we expect to increase the throughput of GEOS-5 by at least an order of magnitude, and accelerate the process of scientific exploration across all scales of global modeling, including: The large-scale, high-end application of non-hydrostatic, global, cloud-resolving modeling at 10- to I-kilometer (km) global resolutions Intermediate-resolution seasonal climate and weather prediction at 50- to 25-km on small clusters of GPUs Long-range, coarse-resolution climate modeling, enabled on a small box of GPUs for the individual researcher After being ported to the GPU cluster, the primary physics components and the dynamical core of GEOS-5 have demonstrated a potential speedup of 15-40 times over conventional processor cores. Performance improvements of this magnitude reduce the required scalability of 1-km, global, cloud-resolving models from an unfathomable 6 million cores to an attainable 200,000 GPU-enabled cores.
Acceleration of orbital-free first principles calculation with graphics processing unit GPU
Energy Technology Data Exchange (ETDEWEB)
Aoki, M [School of Management, Shizuoka Sangyo University, 1572-1, Ohwara, Iwata, 438-0043 (Japan); Tomono, H; Tsumuraya, K [School of Science and Technology, Meiji University, 1-1-1, Higashi-mita, Tama, Kawasaki, 214-8571 (Japan); Iitaka, T, E-mail: maoki@ssu.ac.j [Computational Astrophysics Laboratory, RIKEN (Institute of Physical and Chemical Research), Hirosawa 2-1, Wako, Saitama, 351-0198 (Japan)
2010-03-01
Computational material design requires efficient algorithms and high-speed computers for calculating and predicting material properties. The orbital-free first principles calculation (OF-FPC) method, which is a tool for calculating and designing material properties, is an O(N) method and is suitable for large-scaled systems. The stagnation in the development of CPU devices with high mobility of electron carriers has driven the development of parallel computing and the production of CPU devices with finer spaced wiring. We, for the first time, propose another method to accelerate the computation using Graphics Processing Unit (GPU). The implementation of the Fast Fourier Transform (CUFFT) library that uses GPU, into our in-house OF-FPC code, reduces the computation time to half of that of the CPU.
Fast computation of MadGraph amplitudes on graphics processing unit (GPU)
Energy Technology Data Exchange (ETDEWEB)
Hagiwara, K. [KEK Theory Center and Sokendai, Tsukuba (Japan); Kanzaki, J. [KEK and Sokendai, Tsukuba (Japan); Li, Q. [Peking University, Department of Physics and State Key, Laboratory of Nuclear Physics and Technology, Beijing (China); Okamura, N. [International University of Health and Welfare, Department of Radiological Sciences, Ohtawara, Tochigi (Japan); Stelzer, T. [University of Illinois, Department of Physics, Urbana, IL (United States)
2013-11-15
Continuing our previous studies on QED and QCD processes, we use the graphics processing unit (GPU) for fast calculations of helicity amplitudes for general Standard Model (SM) processes. Additional HEGET codes to handle all SM interactions are introduced, as well as the program MG2CUDA that converts arbitrary MadGraph generated HELAS amplitudes (FORTRAN) into HEGET codes in CUDA. We test all the codes by comparing amplitudes and cross sections for multi-jet processes at the LHC associated with production of single and double weak bosons, a top-quark pair, Higgs boson plus a weak boson or a top-quark pair, and multiple Higgs bosons via weak-boson fusion, where all the heavy particles are allowed to decay into light quarks and leptons with full spin correlations. All the helicity amplitudes computed by HEGET are found to agree with those computed by HELAS within the expected numerical accuracy, and the cross sections obtained by gBASES, a GPU version of the Monte Carlo integration program, agree with those obtained by BASES (FORTRAN), as well as those obtained by MadGraph. The performance of GPU was over a factor of 10 faster than CPU for all processes except those with the highest number of jets. (orig.)
Software Graphics Processing Unit (sGPU) for Deep Space Applications
McCabe, Mary; Salazar, George; Steele, Glen
2015-01-01
A graphics processing capability will be required for deep space missions and must include a range of applications, from safety-critical vehicle health status to telemedicine for crew health. However, preliminary radiation testing of commercial graphics processing cards suggest they cannot operate in the deep space radiation environment. Investigation into an Software Graphics Processing Unit (sGPU)comprised of commercial-equivalent radiation hardened/tolerant single board computers, field programmable gate arrays, and safety-critical display software shows promising results. Preliminary performance of approximately 30 frames per second (FPS) has been achieved. Use of multi-core processors may provide a significant increase in performance.
2013-04-15
... Operating License No. DPR-73 issued for Three Mile Island Nuclear Power Station (TMI), Unit 2, located in... NUCLEAR REGULATORY COMMISSION [Docket No. 50-320; NRC-2013-0065] GPU Nuclear Inc., Three Mile Island Nuclear Power Station, Unit 2, Exemption From Certain Security Requirements AGENCY: Nuclear...
Gorospe, George E., Jr.; Daigle, Matthew J.; Sankararaman, Shankar; Kulkarni, Chetan S.; Ng, Eley
2017-01-01
Prognostic methods enable operators and maintainers to predict the future performance for critical systems. However, these methods can be computationally expensive and may need to be performed each time new information about the system becomes available. In light of these computational requirements, we have investigated the application of graphics processing units (GPUs) as a computational platform for real-time prognostics. Recent advances in GPU technology have reduced cost and increased the computational capability of these highly parallel processing units, making them more attractive for the deployment of prognostic software. We present a survey of model-based prognostic algorithms with considerations for leveraging the parallel architecture of the GPU and a case study of GPU-accelerated battery prognostics with computational performance results.
Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R
2012-02-23
We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient's skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures.
GPU Computing Gems Emerald Edition
Hwu, Wen-mei W
2011-01-01
".the perfect companion to Programming Massively Parallel Processors by Hwu & Kirk." -Nicolas Pinto, Research Scientist at Harvard & MIT, NVIDIA Fellow 2009-2010 Graphics processing units (GPUs) can do much more than render graphics. Scientists and researchers increasingly look to GPUs to improve the efficiency and performance of computationally-intensive experiments across a range of disciplines. GPU Computing Gems: Emerald Edition brings their techniques to you, showcasing GPU-based solutions including: Black hole simulations with CUDA GPU-accelerated computation and interactive display of
Graphics processing unit (GPU)-based computation of heat conduction in thermally anisotropic solids
Nahas, C. A.; Balasubramaniam, Krishnan; Rajagopal, Prabhu
2013-01-01
Numerical modeling of anisotropic media is a computationally intensive task since it brings additional complexity to the field problem in such a way that the physical properties are different in different directions. Largely used in the aerospace industry because of their lightweight nature, composite materials are a very good example of thermally anisotropic media. With advancements in video gaming technology, parallel processors are much cheaper today and accessibility to higher-end graphical processing devices has increased dramatically over the past couple of years. Since these massively parallel GPUs are very good in handling floating point arithmetic, they provide a new platform for engineers and scientists to accelerate their numerical models using commodity hardware. In this paper we implement a parallel finite difference model of thermal diffusion through anisotropic media using the NVIDIA CUDA (Compute Unified device Architecture). We use the NVIDIA GeForce GTX 560 Ti as our primary computing device which consists of 384 CUDA cores clocked at 1645 MHz with a standard desktop pc as the host platform. We compare the results from standard CPU implementation for its accuracy and speed and draw implications for simulation using the GPU paradigm.
マルチ GPU に対する理論モデル及び GPU アルゴリズムの解析と実現
小山田, 徹
2017-01-01
The GPU (Graphics Processing Unit) has high computing power.The technology called GPGPU which uses GPU for general purpose computation is used in various fields.In this research, the theoretical model of multi GPU with plural GPUs construct, analyze and implement algorithms that operate on multi GPU.Keyword: multi GPU, theoretical model
Schmidt, J.; Piret, C.; Zhang, N.; Kadlec, B. J.; Liu, Y.; Yuen, D. A.; Wright, G. B.; Sevre, E. O.
2008-12-01
The faster growth curves in the speed of GPUs relative to CPUs in recent years and its rapidly gained popularity has spawned a new area of development in computational technology. There is much potential in utilizing GPUs for solving evolutionary partial differential equations and producing the attendant visualization. We are concerned with modeling tsunami waves, where computational time is of extreme essence, for broadcasting warnings. In order to test the efficacy of the GPU on the set of shallow-water equations, we employed the NVIDIA board 8600M GT on a MacBook Pro. We have compared the relative speeds between the CPU and the GPU on a single processor for two types of spatial discretization based on second-order finite-differences and radial basis functions. RBFs are a more novel method based on a gridless and a multi- scale, adaptive framework. Using the NVIDIA 8600M GT, we received a speed up factor of 8 in favor of GPU for the finite-difference method and a factor of 7 for the RBF scheme. We have also studied the atmospheric dynamics problem of swirling flows over a spherical surface and found a speed-up of 5.3 using the GPU. The time steps employed for the RBF method are larger than those used in finite-differences, because of the much fewer number of nodal points needed by RBF. Thus, in modeling the same physical time, RBF acting in concert with GPU would be the fastest way to go.
Putnam, William M.
2011-01-01
Earth system models like the Goddard Earth Observing System model (GEOS-5) have been pushing the limits of large clusters of multi-core microprocessors, producing breath-taking fidelity in resolving cloud systems at a global scale. GPU computing presents an opportunity for improving the efficiency of these leading edge models. A GPU implementation of GEOS-5 will facilitate the use of cloud-system resolving resolutions in data assimilation and weather prediction, at resolutions near 3.5 km, improving our ability to extract detailed information from high-resolution satellite observations and ultimately produce better weather and climate predictions
Acceleration of PIC simulation with GPU
International Nuclear Information System (INIS)
Suzuki, Junya; Shimazu, Hironori; Fukazawa, Keiichiro; Den, Mitsue
2011-01-01
Particle-in-cell (PIC) is a simulation technique for plasma physics. The large number of particles in high-resolution plasma simulation increases the volume computation required, making it vital to increase computation speed. In this study, we attempt to accelerate computation speed on graphics processing units (GPUs) using KEMPO, a PIC simulation code package. We perform two tests for benchmarking, with small and large grid sizes. In these tests, we run KEMPO1 code using a CPU only, both a CPU and a GPU, and a GPU only. The results showed that performance using only a GPU was twice that of using a CPU alone. While, execution time for using both a CPU and GPU is comparable to the tests with a CPU alone, because of the significant bottleneck in communication between the CPU and GPU. (author)
Distributed GPU Computing in GIScience
Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
2013-12-01
Transactions on, 9(3), 378-394. 2. Li, J., Jiang, Y., Yang, C., Huang, Q., & Rice, M. (2013). Visualizing 3D/4D Environmental Data Using Many-core Graphics Processing Units (GPUs) and Multi-core Central Processing Units (CPUs). Computers & Geosciences, 59(9), 78-89. 3. Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5), 879-899.
Michael R. Saunders; Justin E. Arseneault
2013-01-01
In long-term, large-scale forest management studies, documentation of pre-treatment differences among and variability within experimental units is critical for drawing the proper inferences from imposed treatments. We compared pre-treatment overstory and large shrub communities (diameters at breast height >1.5 cm) for the 9 research cores with the Hardwood Ecosystem...
Validation of GPU based TomoTherapy dose calculation engine.
Chen, Quan; Lu, Weiguo; Chen, Yu; Chen, Mingli; Henderson, Douglas; Sterpin, Edmond
2012-04-01
The graphic processing unit (GPU) based TomoTherapy convolution/superposition(C/S) dose engine (GPU dose engine) achieves a dramatic performance improvement over the traditional CPU-cluster based TomoTherapy dose engine (CPU dose engine). Besides the architecture difference between the GPU and CPU, there are several algorithm changes from the CPU dose engine to the GPU dose engine. These changes made the GPU dose slightly different from the CPU-cluster dose. In order for the commercial release of the GPU dose engine, its accuracy has to be validated. Thirty eight TomoTherapy phantom plans and 19 patient plans were calculated with both dose engines to evaluate the equivalency between the two dose engines. Gamma indices (Γ) were used for the equivalency evaluation. The GPU dose was further verified with the absolute point dose measurement with ion chamber and film measurements for phantom plans. Monte Carlo calculation was used as a reference for both dose engines in the accuracy evaluation in heterogeneous phantom and actual patients. The GPU dose engine showed excellent agreement with the current CPU dose engine. The majority of cases had over 99.99% of voxels with Γ(1%, 1 mm) GPU dose engine also showed similar degree of accuracy in heterogeneous media as the current TomoTherapy dose engine. It is verified and validated that the ultrafast TomoTherapy GPU dose engine can safely replace the existing TomoTherapy cluster based dose engine without degradation in dose accuracy.
GPU computing and applications
See, Simon
2015-01-01
This book presents a collection of state of the art research on GPU Computing and Application. The major part of this book is selected from the work presented at the 2013 Symposium on GPU Computing and Applications held in Nanyang Technological University, Singapore (Oct 9, 2013). Three major domains of GPU application are covered in the book including (1) Engineering design and simulation; (2) Biomedical Sciences; and (3) Interactive & Digital Media. The book also addresses the fundamental issues in GPU computing with a focus on big data processing. Researchers and developers in GPU Computing and Applications will benefit from this book. Training professionals and educators can also benefit from this book to learn the possible application of GPU technology in various areas.
Weigel, Martin
2011-09-01
Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
Vasan, S N Swetadri; Ionita, Ciprian N; Titus, A H; Cartwright, A N; Bednarek, D R; Rudin, S
2012-02-23
We present the image processing upgrades implemented on a Graphics Processing Unit (GPU) in the Control, Acquisition, Processing, and Image Display System (CAPIDS) for the custom Micro-Angiographic Fluoroscope (MAF) detector. Most of the image processing currently implemented in the CAPIDS system is pixel independent; that is, the operation on each pixel is the same and the operation on one does not depend upon the result from the operation on the other, allowing the entire image to be processed in parallel. GPU hardware was developed for this kind of massive parallel processing implementation. Thus for an algorithm which has a high amount of parallelism, a GPU implementation is much faster than a CPU implementation. The image processing algorithm upgrades implemented on the CAPIDS system include flat field correction, temporal filtering, image subtraction, roadmap mask generation and display window and leveling. A comparison between the previous and the upgraded version of CAPIDS has been presented, to demonstrate how the improvement is achieved. By performing the image processing on a GPU, significant improvements (with respect to timing or frame rate) have been achieved, including stable operation of the system at 30 fps during a fluoroscopy run, a DSA run, a roadmap procedure and automatic image windowing and leveling during each frame.
Application of GPU to computational multiphase fluid dynamics
International Nuclear Information System (INIS)
Nagatake, T; Kunugi, T
2010-01-01
The MARS (Multi-interfaces Advection and Reconstruction Solver) [1] is one of the surface volume tracking methods for multi-phase flows. Nowadays, the performance of GPU (Graphics Processing Unit) is much higher than the CPU (Central Processing Unit). In this study, the GPU was applied to the MARS in order to accelerate the computation of multi-phase flows (GPU-MARS), and the performance of the GPU-MARS was discussed. From the performance of the interface tracking method for the analyses of one-directional advection problem, it is found that the computing time of GPU(single GTX280) was around 4 times faster than that of the CPU (Xeon 5040, 4 threads parallelized). From the performance of Poisson Solver by using the algorithm developed in this study, it is found that the performance of the GPU showed around 30 times faster than that of the CPU. Finally, it is confirmed that the GPU showed the large acceleration of the fluid flow computation (GPU-MARS) compared to the CPU. However, it is also found that the double-precision computation of the GPU must perform with very high precision.
NMF-mGPU: non-negative matrix factorization on multi-GPU systems.
Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto
2015-02-13
In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the
MIGS-GPU: Microarray Image Gridding and Segmentation on the GPU.
Katsigiannis, Stamos; Zacharia, Eleni; Maroulis, Dimitris
2017-05-01
Complementary DNA (cDNA) microarray is a powerful tool for simultaneously studying the expression level of thousands of genes. Nevertheless, the analysis of microarray images remains an arduous and challenging task due to the poor quality of the images that often suffer from noise, artifacts, and uneven background. In this study, the MIGS-GPU [Microarray Image Gridding and Segmentation on Graphics Processing Unit (GPU)] software for gridding and segmenting microarray images is presented. MIGS-GPU's computations are performed on the GPU by means of the compute unified device architecture (CUDA) in order to achieve fast performance and increase the utilization of available system resources. Evaluation on both real and synthetic cDNA microarray images showed that MIGS-GPU provides better performance than state-of-the-art alternatives, while the proposed GPU implementation achieves significantly lower computational times compared to the respective CPU approaches. Consequently, MIGS-GPU can be an advantageous and useful tool for biomedical laboratories, offering a user-friendly interface that requires minimum input in order to run.
Bédorf, J.; Gaburov, E.; Portegies Zwart, S.
2012-07-01
We present a gravitational hierarchical N-body code that is designed to run efficiently on Graphics Processing Units (GPUs). All parts of the algorithm are exectued on the GPU which eliminates the need for data transfer between the Central Processing Unit (CPU) and the GPU. Our tests indicate that the gravitational tree-code outperforms tuned CPU code for all parts of the algorithm and show an overall performance improvement of more than a factor 20, resulting in a processing rate of more than 2.8 million particles per second.
Vlasov on GPU (VOG project******
Directory of Open Access Journals (Sweden)
Mehrenberger M.
2013-12-01
Full Text Available This work concerns the numerical simulation of the Vlasov-Poisson equation using semi-Lagrangian methods on Graphics Processing Units (GPU. To accomplish this goal, modifications to traditional methods had to be implemented. First and foremost, a reformulation of semi-Lagrangian methods is performed, which enables us to rewrite the governing equations as a circulant matrix operating on the vector of unknowns. This product calculation can be performed efficiently using FFT routines. Nowadays GPU is no more limited to single precision; however, single precision may still be preferred with respect to performance and available memory. So, in order to be able to deal with single precision, a δf type method is adopted which only needs refinement in specialized areas of phase space but not throughout. Thus, a GPU Vlasov-Poisson solver can indeed perform high precision simulations (since it uses very high order of reconstruction and a large number of grid points in phase space. We show results for more academic test cases and also for physically relevant phenomena such as the bump on tail instability and the simulation of Kinetic Electrostatic Electron Nonlinear (KEEN waves.
GPU Accelerated Vector Median Filter
Aras, Rifat; Shen, Yuzhong
2011-01-01
Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
Dokken, Tor
2007-01-01
The evolution from single core CPUs to heterogeneous computing. Activities in SINTEF ICT related to heterogeneous computing within:Geometry processingLinear algebraPartial differential equationsVisualizationImage processing GPU-multicore programming
GPU: the biggest key processor for AI and parallel processing
Baji, Toru
2017-07-01
Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
SIFT implementation based on GPU
Jiang, Chao; Geng, Ze-xun; Wei, Xiao-feng; Shen, Chen
2013-08-01
Abstract—Image matching is the core research topics of digital photogrammetry and computer vision. SIFT(Scale-Invariant Feature Transform) algorithm is a feature matching algorithm based on local invariant features which is proposed by Lowe at 1999, SIFT features are invariant to image rotation and scaling, even partially invariant to change in 3D camera viewpoint and illumination. They are well localized in both the spatial and frequency domains, reducing the probability of disruption by occlusion, clutter, or noise. So the algorithm has a widely used in image matching and 3D reconstruction based on stereo image. Traditional SIFT algorithm's implementation and optimization are generally for CPU. Due to the large numbers of extracted features(even if only several objects can also extract large numbers of SIFT feature), high-dimensional of the feature vector(usually a 128-dimensional SIFT feature vector), and the complexity for the SIFT algorithm, therefore the SIFT algorithm on the CPU processing speed is slow, hard to fulfil the real-time requirements. Programmable Graphic Process United(PGPU) is commonly used by the current computer graphics as a dedicated device for image processing. The development experience of recent years shows that a high-performance GPU, which can be achieved 10 times single-precision floating-point processing performanceone compared with the same time of a high-performance desktop CPU, simultaneity the GPU's memory bandwidth is up to five times compared with the same period desktop platform. Provide the same computing power, the GPU's cost and power consumption should be less than the CPU-based system. At the same time, due to the parallel nature of graphics rendering and image processing, so GPU-accelerated image processing become to an efficient solution for some algorithm which have requirements for real-time. In this paper, we realized the algorithm by OpenGL shader language and compare to the results which realized by CPU
GPU accelerated implementation of NCI calculations using promolecular density.
Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric
2017-05-30
The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
GPU-based High-Performance Computing for Radiation Therapy
Jia, Xun; Ziegenhein, Peter; Jiang, Steve B.
2014-01-01
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. Graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past a few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of studies have been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this article, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. PMID:24486639
GPU-based high-performance computing for radiation therapy
International Nuclear Information System (INIS)
Jia, Xun; Jiang, Steve B; Ziegenhein, Peter
2014-01-01
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of study has been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this paper, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. (topical review)
GPU-based high-performance computing for radiation therapy
Jia, Xun; Ziegenhein, Peter; Jiang, Steve B.
2014-02-01
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of study has been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this paper, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented.
GPU-based high-performance computing for radiation therapy.
Jia, Xun; Ziegenhein, Peter; Jiang, Steve B
2014-02-21
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of study has been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this paper, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented.
DEFF Research Database (Denmark)
Yang, Po; Dong, Feng; Codreanu, Valeriu
2018-01-01
to the lack of specialist GPU (Graphics processing units) programming skills, the explosion of GPU power has not been fully utilized in general SME applications by inexperienced users. Also, existing automatic CPU-to-GPU code translators are mainly designed for research purposes with poor user interface...... design and hard-to-use. Little attentions have been paid to the applicability, usability and learnability of these tools for normal users. In this paper, we present an online automated CPU-to-GPU source translation system, (GPSME) for inexperienced users to utilize GPU capability in accelerating general...
GPU Pro 5 advanced rendering techniques
Engel, Wolfgang
2014-01-01
In GPU Pro5: Advanced Rendering Techniques, section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Marius Bjorge have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book covers rendering, lighting, effects in image space, mobile devices, 3D engine design, and compute. It explores rasterization of liquids, ray tracing of art assets that would otherwise be used in a rasterized engine, physically based area lights, volumetric light
GPU Acceleration of DSP for Communication Receivers.
Gunther, Jake; Gunther, Hyrum; Moon, Todd
2017-09-01
Graphics processing unit (GPU) implementations of signal processing algorithms can outperform CPU-based implementations. This paper describes the GPU implementation of several algorithms encountered in a wide range of high-data rate communication receivers including filters, multirate filters, numerically controlled oscillators, and multi-stage digital down converters. These structures are tested by processing the 20 MHz wide FM radio band (88-108 MHz). Two receiver structures are explored: a single channel receiver and a filter bank channelizer. Both run in real time on NVIDIA GeForce GTX 1080 graphics card.
GPU Pro 4 advanced rendering techniques
Engel, Wolfgang
2013-01-01
GPU Pro4: Advanced Rendering Techniques presents ready-to-use ideas and procedures that can help solve many of your day-to-day graphics programming challenges. Focusing on interactive media and games, the book covers up-to-date methods producing real-time graphics. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Sebastien St-Laurent have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book begins with discussions on the abi
Parallel GPU implementation of PWR reactor burnup
International Nuclear Information System (INIS)
Heimlich, A.; Silva, F.C.; Martinez, A.S.
2016-01-01
Highlights: • Three GPU algorithms used to evaluate the burn-up in a PWR reactor. • Exhibit speed improvement exceeding 200 times over the sequential. • The C++ container is expansible to accept new nuclides chains. - Abstract: This paper surveys three methods, implemented for multi-core CPU and graphic processor unit (GPU), to evaluate the fuel burn-up in a pressurized light water nuclear reactor (PWR) using the solutions of a large system of coupled ordinary differential equations. The reactor physics simulation of a PWR reactor spends a long execution time with burnup calculations, so performance improvement using GPU can imply in better core design and thus extended fuel life cycle. The results of this study exhibit speed improvement exceeding 200 times over the sequential solver, within 1% accuracy.
GPU based acceleration of first principles calculation
International Nuclear Information System (INIS)
Tomono, H; Tsumuraya, K; Aoki, M; Iitaka, T
2010-01-01
We present a Graphics Processing Unit (GPU) accelerated simulations of first principles electronic structure calculations. The FFT, which is the most time-consuming part, is about 10 times accelerated. As the result, the total computation time of a first principles calculation is reduced to 15 percent of that of the CPU.
Energy Technology Data Exchange (ETDEWEB)
Choi, Sunghoon, E-mail: choi.sh@yonsei.ac.kr [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Lee, Seungwan [Department of Radiological Science, College of Medical Science, Konyang University, 158 Gwanjeodong-ro, Daejeon, 308-812 (Korea, Republic of); Lee, Haenghwa [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Lee, Donghoon; Choi, Seungyeon [Department of Radiation Convergence Engineering, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Shin, Jungwook [LISTEM Corporation, 94 Donghwagongdan-ro, Munmak-eup, Wonju (Korea, Republic of); Seo, Chang-Woo [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Kim, Hee-Joung, E-mail: hjk1@yonsei.ac.kr [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Department of Radiation Convergence Engineering, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of)
2017-03-11
Digital tomosynthesis offers the advantage of low radiation doses compared to conventional computed tomography (CT) by utilizing small numbers of projections (~80) acquired over a limited angular range. It produces 3D volumetric data, although there are artifacts due to incomplete sampling. Based upon these characteristics, we developed a prototype digital tomosynthesis R/F system for applications in chest imaging. Our prototype chest digital tomosynthesis (CDT) R/F system contains an X-ray tube with high power R/F pulse generator, flat-panel detector, R/F table, electromechanical radiographic subsystems including a precise motor controller, and a reconstruction server. For image reconstruction, users select between analytic and iterative reconstruction methods. Our reconstructed images of Catphan700 and LUNGMAN phantoms clearly and rapidly described the internal structures of phantoms using graphics processing unit (GPU) programming. Contrast-to-noise ratio (CNR) values of the CTP682 module of Catphan700 were higher in images using a simultaneous algebraic reconstruction technique (SART) than in those using filtered back-projection (FBP) for all materials by factors of 2.60, 3.78, 5.50, 2.30, 3.70, and 2.52 for air, lung foam, low density polyethylene (LDPE), Delrin{sup ®} (acetal homopolymer resin), bone 50% (hydroxyapatite), and Teflon, respectively. Total elapsed times for producing 3D volume were 2.92 s and 86.29 s on average for FBP and SART (20 iterations), respectively. The times required for reconstruction were clinically feasible. Moreover, the total radiation dose from our system (5.68 mGy) was lower than that of conventional chest CT scan. Consequently, our prototype tomosynthesis R/F system represents an important advance in digital tomosynthesis applications.
Zhu, Jinhan; Chen, Lixin; Chen, Along; Luo, Guangwen; Deng, Xiaowu; Liu, Xiaowei
2015-04-11
To use a graphic processing unit (GPU) calculation engine to implement a fast 3D pre-treatment dosimetric verification procedure based on an electronic portal imaging device (EPID). The GPU algorithm includes the deconvolution and convolution method for the fluence-map calculations, the collapsed-cone convolution/superposition (CCCS) algorithm for the 3D dose calculations and the 3D gamma evaluation calculations. The results of the GPU-based CCCS algorithm were compared to those of Monte Carlo simulations. The planned and EPID-based reconstructed dose distributions in overridden-to-water phantoms and the original patients were compared for 6 MV and 10 MV photon beams in intensity-modulated radiation therapy (IMRT) treatment plans based on dose differences and gamma analysis. The total single-field dose computation time was less than 8 s, and the gamma evaluation for a 0.1-cm grid resolution was completed in approximately 1 s. The results of the GPU-based CCCS algorithm exhibited good agreement with those of the Monte Carlo simulations. The gamma analysis indicated good agreement between the planned and reconstructed dose distributions for the treatment plans. For the target volume, the differences in the mean dose were less than 1.8%, and the differences in the maximum dose were less than 2.5%. For the critical organs, minor differences were observed between the reconstructed and planned doses. The GPU calculation engine was used to boost the speed of 3D dose and gamma evaluation calculations, thus offering the possibility of true real-time 3D dosimetric verification.
GPU is in great jeopardy: PUC report details GPU's deteriorating financial position
International Nuclear Information System (INIS)
Anon.
1980-01-01
The General Public Utilities Corporation (GPU) faces bankruptcy or reorganization without Federal financial help and rate relief for the cleanup at Three Mile Island, but neither the Administration nor the Pennsylvania Public Utilities Commission is inclined to help. Bankruptcy will leave GPU's customers without power and will leave Unit 2 contaminated, making it unlikely that the courts will permit dissolution. The Nuclear Regulatory Commission's permission to restore Unit 1 could make financial recovery possible. Its reluctance to do so and its use of the psychological stress factor can be interpreted as applying a double standard to Babcock and Wilcox reactors
GPU-accelerated voxelwise hepatic perfusion quantification.
Wang, H; Cao, Y
2012-09-07
Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using compute unified device architecture-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, nonlinear least-squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626 400 voxels in a patient's liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10(-6). The method will be useful for generating liver perfusion images in clinical settings.
GPU-Accelerated Voxelwise Hepatic Perfusion Quantification
Wang, H; Cao, Y
2012-01-01
Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using CUDA-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, non-linear least squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626400 voxels in a patient’s liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10−6. The method will be useful for generating liver perfusion images in clinical settings. PMID:22892645
GPU-accelerated micromagnetic simulations using cloud computing
International Nuclear Information System (INIS)
Jermain, C.L.; Rowlands, G.E.; Buhrman, R.A.; Ralph, D.C.
2016-01-01
Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics. - Highlights: • The benefits of cloud computing for GPU-accelerated micromagnetics are examined. • We present the MuCloud software for running simulations on cloud computing. • Simulation run times are measured to benchmark cloud computing performance. • Comparison benchmarks are analyzed between CPU and GPU based solvers.
Architecting the Finite Element Method Pipeline for the GPU.
Fu, Zhisong; Lewis, T James; Kirby, Robert M; Whitaker, Ross T
2014-02-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers.
Efficient implementation of MrBayes on multi-GPU.
Bao, Jie; Xia, Hongju; Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-06-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.
Efficient GPU implementation for Particle in Cell algorithm
Energy Technology Data Exchange (ETDEWEB)
Joseph, Rejith George [University of Florida; Ravunnikutty, Girish [University of Florida; Ranka, Sanjay [ORNL; Klasky, Scott A [ORNL
2011-01-01
Particle in cell method is widely used method in the plasma physics to study the trajectories of charged particles under electromagnetic fields. The PIC algorithm is computationally intensive and its time requirements are proportional to the number of charged particles involved in the simulation. The focus of the paper is to parallelize the PIC algorithm on Graphics Processing Unit (GPU). We present several performance tradeoffs related to the small shared memory and atomic operations on the GPU to achieve high performance.
GPU Accelerated Ultrasonic Tomography Using Propagation and Back Propagation Method
2015-09-28
tomographic imaging algorithm using Graphic Processing Units ( GPUs ). The Compute Unified Device Architecture (CUDA) programming model is used to develop...our parallelized algorithm since the CUDA model allows the user to interact with the GPU resources more efficiently than traditional Shader methods...3. DATES COVERED (From - To) - UU UU UU UU 28-09-2015 Approved for public release; distribution is unlimited. GPU accelerated ultrasonic tomography
Fully 3D GPU PET reconstruction
Energy Technology Data Exchange (ETDEWEB)
Herraiz, J.L., E-mail: joaquin@nuclear.fis.ucm.es [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Espana, S. [Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA (United States); Cal-Gonzalez, J. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Vaquero, J.J. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Desco, M. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Unidad de Medicina y Cirugia Experimental, Hospital General Universitario Gregorio Maranon, Madrid (Spain); Udias, J.M. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain)
2011-08-21
Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.
Parallel generation of architecture on the GPU
Steinberger, Markus
2014-05-01
In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars on the graphics processing unit (GPU). Unlike previous approaches that are either limited in the kind of shapes they allow, the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling including stochasticity and context-sensitivity. To increase parallelism, we explicitly express independence in the grammar, reduce inter-rule dependencies required for context-sensitive evaluation, and introduce intra-rule parallelism. Our rule scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by dynamically grouping rules in on-chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude faster than the standard in CPU-based rule evaluation, while offering equal expressive power. In comparison to the state of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context-sensitivity. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
Wong, Un-Hong; Aoki, Takayuki; Wong, Hon-Cheng
2014-07-01
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct-MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU-MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.
Fast CGH computation using S-LUT on GPU.
Pan, Yuechao; Xu, Xuewu; Solanki, Sanjeev; Liang, Xinan; Tanjung, Ridwan Bin Adrian; Tan, Chiwei; Chong, Tow-Chong
2009-10-12
In computation of full-parallax computer-generated hologram (CGH), balance between speed and memory usage is always the core of algorithm development. To solve the speed problem of coherent ray trace (CRT) algorithm and memory problem of look-up table (LUT) algorithm without sacrificing reconstructed object quality, we develop a novel algorithm with split look-up tables (S-LUT) and implement it on graphics processing unit (GPU). Our results show that S-LUT on GPU has the fastest speed among all the algorithms investigated in this paper, while it still maintaining low memory usage. We also demonstrate high quality objects reconstructed from CGHs computed with S-LUT on GPU. The GPU implementation of our new algorithm may enable real-time and interactive holographic 3D display in the future.
An efficient spectral crystal plasticity solver for GPU architectures
Malahe, Michael
2018-03-01
We present a spectral crystal plasticity (CP) solver for graphics processing unit (GPU) architectures that achieves a tenfold increase in efficiency over prior GPU solvers. The approach makes use of a database containing a spectral decomposition of CP simulations performed using a conventional iterative solver over a parameter space of crystal orientations and applied velocity gradients. The key improvements in efficiency come from reducing global memory transactions, exposing more instruction-level parallelism, reducing integer instructions and performing fast range reductions on trigonometric arguments. The scheme also makes more efficient use of memory than prior work, allowing for larger problems to be solved on a single GPU. We illustrate these improvements with a simulation of 390 million crystal grains on a consumer-grade GPU, which executes at a rate of 2.72 s per strain step.
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System.
Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram
2014-01-01
We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second.
Accelerated GPU based SPECT Monte Carlo simulations.
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-07
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: (99m) Tc, (111)In and (131)I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational
Accelerated GPU based SPECT Monte Carlo simulations
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-01
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: 99m Tc, 111In and 131I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational efficiency
Survey of using GPU CUDA programming model in medical image analysis
Directory of Open Access Journals (Sweden)
T. Kalaiselvi
2017-01-01
Full Text Available With the technology development of medical industry, processing data is expanding rapidly and computation time also increases due to many factors like 3D, 4D treatment planning, the increasing sophistication of MRI pulse sequences and the growing complexity of algorithms. Graphics processing unit (GPU addresses these problems and gives the solutions for using their features such as, high computation throughput, high memory bandwidth, support for floating-point arithmetic and low cost. Compute unified device architecture (CUDA is a popular GPU programming model introduced by NVIDIA for parallel computing. This review paper briefly discusses the need of GPU CUDA computing in the medical image analysis. The GPU performances of existing algorithms are analyzed and the computational gain is discussed. A few open issues, hardware configurations and optimization principles of existing methods are discussed. This survey concludes the few optimization techniques with the medical imaging algorithms on GPU. Finally, limitation and future scope of GPU programming are discussed.
Chase, Patrick; Vondran, Gary
2011-01-01
Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a 500 NVIDIA GTX-580 GPU is 3x faster than a 1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.
Parallel GPU implementation of iterative PCA algorithms.
Andrecut, M
2009-11-01
Principal component analysis (PCA) is a key statistical technique for multivariate data analysis. For large data sets, the common approach to PCA computation is based on the standard NIPALS-PCA algorithm, which unfortunately suffers from loss of orthogonality, and therefore its applicability is usually limited to the estimation of the first few components. Here we present an algorithm based on Gram-Schmidt orthogonalization (called GS-PCA), which eliminates this shortcoming of NIPALS-PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel implementation of both NIPALS-PCA and GS-PCA algorithms. The numerical results show that the GPU parallel optimized versions, based on CUBLAS (NVIDIA), are substantially faster (up to 12 times) than the CPU optimized versions based on CBLAS (GNU Scientific Library).
Advantages of GPU technology in DFT calculations of intercalated graphene
International Nuclear Information System (INIS)
Pešić, J; Gajić, R
2014-01-01
Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an
Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU.
Arefan, D; Talebpour, A; Ahmadinejhad, N; Kamali Asl, A
2015-06-01
Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU). At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU) card and the Graphics Processing Unit (GPU). It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU).
Energy Technology Data Exchange (ETDEWEB)
Almeida, Adino Americo Heimlich
2009-07-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)
ALICE HLT high speed tracking on GPU
Gorbunov, Sergey; Aamodt, Kenneth; Alt, Torsten; Appelshauser, Harald; Arend, Andreas; Bach, Matthias; Becker, Bruce; Bottger, Stefan; Breitner, Timo; Busching, Henner; Chattopadhyay, Sukalyan; Cleymans, Jean; Cicalo, Corrado; Das, Indranil; Djuvsland, Oystein; Engel, Heiko; Erdal, Hege Austrheim; Fearick, Roger; Haaland, Oystein Senneset; Hille, Per Thomas; Kalcher, Sebastian; Kanaki, Kalliopi; Kebschull, Udo Wolfgang; Kisel, Ivan; Kretz, Matthias; Lara, Camillo; Lindal, Sven; Lindenstruth, Volker; Masoodi, Arshad Ahmad; Ovrebekk, Gaute; Panse, Ralf; Peschek, Jorg; Ploskon, Mateusz; Pocheptsov, Timur; Ram, Dinesh; Rascanu, Theodor; Richter, Matthias; Rohrich, Dieter; Ronchetti, Federico; Skaali, Bernhard; Smorholm, Olav; Stokkevag, Camilla; Steinbeck, Timm Morten; Szostak, Artur; Thader, Jochen; Tveter, Trine; Ullaland, Kjetil; Vilakazi, Zeblon; Weis, Robert; Yin, Zhong-Bao; Zelnicek, Pierre
2011-01-01
The on-line event reconstruction in ALICE is performed by the High Level Trigger, which should process up to 2000 events per second in proton-proton collisions and up to 300 central events per second in heavy-ion collisions, corresponding to an inp ut data stream of 30 GB/s. In order to fulfill the time requirements, a fast on-line tracker has been developed. The algorithm combines a Cellular Automaton method being used for a fast pattern recognition and the Kalman Filter method for fitting of found trajectories and for the final track selection. The tracker was adapted to run on Graphics Processing Units (GPU) using the NVIDIA Compute Unified Device Architecture (CUDA) framework. The implementation of the algorithm had to be adjusted at many points to allow for an efficient usage of the graphics cards. In particular, achieving a good overall workload for many processor cores, efficient transfer to and from the GPU, as well as optimized utilization of the different memories the GPU offers turned out to be cri...
Numerical simulation of lava flow using a GPU SPH model
Directory of Open Access Journals (Sweden)
Eugenio Rustico
2011-12-01
Full Text Available A smoothed particle hydrodynamics (SPH method for lava-flow modeling was implemented on a graphical processing unit (GPU using the compute unified device architecture (CUDA developed by NVIDIA. This resulted in speed-ups of up to two orders of magnitude. The three-dimensional model can simulate lava flow on a real topography with free-surface, non-Newtonian fluids, and with phase change. The entire SPH code has three main components, neighbor list construction, force computation, and integration of the equation of motion, and it is computed on the GPU, fully exploiting the computational power. The simulation speed achieved is one to two orders of magnitude faster than the equivalent central processing unit (CPU code. This GPU implementation of SPH allows high resolution SPH modeling in hours and days, rather than in weeks and months, on inexpensive and readily available hardware.
GPU accelerated manifold correction method for spinning compact binaries
Ran, Chong-xi; Liu, Song; Zhong, Shuang-ying
2018-04-01
The graphics processing unit (GPU) acceleration of the manifold correction algorithm based on the compute unified device architecture (CUDA) technology is designed to simulate the dynamic evolution of the Post-Newtonian (PN) Hamiltonian formulation of spinning compact binaries. The feasibility and the efficiency of parallel computation on GPU have been confirmed by various numerical experiments. The numerical comparisons show that the accuracy on GPU execution of manifold corrections method has a good agreement with the execution of codes on merely central processing unit (CPU-based) method. The acceleration ability when the codes are implemented on GPU can increase enormously through the use of shared memory and register optimization techniques without additional hardware costs, implying that the speedup is nearly 13 times as compared with the codes executed on CPU for phase space scan (including 314 × 314 orbits). In addition, GPU-accelerated manifold correction method is used to numerically study how dynamics are affected by the spin-induced quadrupole-monopole interaction for black hole binary system.
High Performance GPU-Based Fourier Volume Rendering
Directory of Open Access Journals (Sweden)
Marwan Abdellah
2015-01-01
Full Text Available Fourier volume rendering (FVR is a significant visualization technique that has been used widely in digital radiography. As a result of its O(N2logN time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are O(N3 computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
High Performance GPU-Based Fourier Volume Rendering.
Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr
2015-01-01
Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its (N (2)logN) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are (N (3)) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU
Directory of Open Access Journals (Sweden)
Zhou Zhe
2017-04-01
Full Text Available According to previous reports, information could be leaked from GPU memory; however, the security implications of such a threat were mostly over-looked, because only limited information could be indirectly extracted through side-channel attacks. In this paper, we propose a novel algorithm for recovering raw data directly from the GPU memory residues of many popular applications such as Google Chrome and Adobe PDF reader. Our algorithm enables harvesting highly sensitive information including credit card numbers and email contents from GPU memory residues. Evaluation results also indicate that nearly all GPU-accelerated applications are vulnerable to such attacks, and adversaries can launch attacks without requiring any special privileges both on traditional multi-user operating systems, and emerging cloud computing scenarios.
Incompressible SPH (ISPH) with fast Poisson solver on a GPU
Chow, Alex D.; Rogers, Benedict D.; Lind, Steven J.; Stansby, Peter K.
2018-05-01
This paper presents a fast incompressible SPH (ISPH) solver implemented to run entirely on a graphics processing unit (GPU) capable of simulating several millions of particles in three dimensions on a single GPU. The ISPH algorithm is implemented by converting the highly optimised open-source weakly-compressible SPH (WCSPH) code DualSPHysics to run ISPH on the GPU, combining it with the open-source linear algebra library ViennaCL for fast solutions of the pressure Poisson equation (PPE). Several challenges are addressed with this research: constructing a PPE matrix every timestep on the GPU for moving particles, optimising the limited GPU memory, and exploiting fast matrix solvers. The ISPH pressure projection algorithm is implemented as 4 separate stages, each with a particle sweep, including an algorithm for the population of the PPE matrix suitable for the GPU, and mixed precision storage methods. An accurate and robust ISPH boundary condition ideal for parallel processing is also established by adapting an existing WCSPH boundary condition for ISPH. A variety of validation cases are presented: an impulsively started plate, incompressible flow around a moving square in a box, and dambreaks (2-D and 3-D) which demonstrate the accuracy, flexibility, and speed of the methodology. Fragmentation of the free surface is shown to influence the performance of matrix preconditioners and therefore the PPE matrix solution time. The Jacobi preconditioner demonstrates robustness and reliability in the presence of fragmented flows. For a dambreak simulation, GPU speed ups demonstrate up to 10-18 times and 1.1-4.5 times compared to single-threaded and 16-threaded CPU run times respectively.
Development of parallel GPU based algorithms for problems in nuclear area
International Nuclear Information System (INIS)
Almeida, Adino Americo Heimlich
2009-01-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)
A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU.
Mei, Gang; Zhang, Jiayin; Xu, Nengxiong; Zhao, Kunyang
2018-01-01
The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
Implementation and Optimization of GPU-Based Static State Security Analysis in Power Systems
Directory of Open Access Journals (Sweden)
Yong Chen
2017-01-01
Full Text Available Static state security analysis (SSSA is one of the most important computations to check whether a power system is in normal and secure operating state. It is a challenge to satisfy real-time requirements with CPU-based concurrent methods due to the intensive computations. A sensitivity analysis-based method with Graphics processing unit (GPU is proposed for power systems, which can reduce calculation time by 40% compared to the execution on a 4-core CPU. The proposed method involves load flow analysis and sensitivity analysis. In load flow analysis, a multifrontal method for sparse LU factorization is explored on GPU through dynamic frontal task scheduling between CPU and GPU. The varying matrix operations during sensitivity analysis on GPU are highly optimized in this study. The results of performance evaluations show that the proposed GPU-based SSSA with optimized matrix operations can achieve a significant reduction in computation time.
GPU-based high performance Monte Carlo simulation in neutron transport
International Nuclear Information System (INIS)
Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A.
2009-01-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)
GPU-based high performance Monte Carlo simulation in neutron transport
Energy Technology Data Exchange (ETDEWEB)
Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A. [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil). Lab. de Inteligencia Artificial Aplicada], e-mail: cmnap@ien.gov.br
2009-07-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)
How General-Purpose can a GPU be?
Directory of Open Access Journals (Sweden)
Philip Machanick
2015-12-01
Full Text Available The use of graphics processing units (GPUs in general-purpose computation (GPGPU is a growing field. GPU instruction sets, while implementing a graphics pipeline, draw from a range of single instruction multiple datastream (SIMD architectures characteristic of the heyday of supercomputers. Yet only one of these SIMD instruction sets has been of application on a wide enough range of problems to survive the era when the full range of supercomputer design variants was being explored: vector instructions. This paper proposes a reconceptualization of the GPU as a multicore design with minimal exotic modes of parallelism so as to make GPGPU truly general.
Accelerating Pseudo-Random Number Generator for MCNP on GPU
Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu
2010-09-01
Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.
GPU-BSM: A GPU-Based Tool to Map Bisulfite-Treated Reads
Manconi, Andrea; Orro, Alessandro; Manca, Emanuele; Armano, Giuliano; Milanesi, Luciano
2014-01-01
Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads. PMID:24842718
GPU-BSM: a GPU-based tool to map bisulfite-treated reads.
Directory of Open Access Journals (Sweden)
Andrea Manconi
Full Text Available Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.
GPU's for event reconstruction in the FairRoot framework
International Nuclear Information System (INIS)
Al-Turany, M; Uhlig, F; Karabowicz, R
2010-01-01
FairRoot is the simulation and analysis framework used by CBM and PANDA experiments at FAIR/GSI. The use of graphics processor units (GPUs) for event reconstruction in FairRoot will be presented. The fact that CUDA (Nvidia's Compute Unified Device Architecture) development tools work alongside the conventional C/C++ compiler, makes it possible to mix GPU code with general-purpose code for the host CPU, based on this some of the reconstruction tasks can be send to the graphic cards. Moreover, tasks that run on the GPU's can also run in emulation mode on the host CPU, which has the advantage that the same code is used on both CPU and GPU.
Accelerating electron tomography reconstruction algorithm ICON with GPU.
Chen, Yu; Wang, Zihao; Zhang, Jingrong; Li, Lun; Wan, Xiaohua; Sun, Fei; Zhang, Fa
2017-01-01
Electron tomography (ET) plays an important role in studying in situ cell ultrastructure in three-dimensional space. Due to limited tilt angles, ET reconstruction always suffers from the "missing wedge" problem. With a validation procedure, iterative compressed-sensing optimized NUFFT reconstruction (ICON) demonstrates its power in the restoration of validated missing information for low SNR biological ET dataset. However, the huge computational demand has become a major problem for the application of ICON. In this work, we analyzed the framework of ICON and classified the operations of major steps of ICON reconstruction into three types. Accordingly, we designed parallel strategies and implemented them on graphics processing units (GPU) to generate a parallel program ICON-GPU. With high accuracy, ICON-GPU has a great acceleration compared to its CPU version, up to 83.7×, greatly relieving ICON's dependence on computing resource.
gPGA: GPU Accelerated Population Genetics Analyses.
Directory of Open Access Journals (Sweden)
Chunbao Zhou
Full Text Available The isolation with migration (IM model is important for studies in population genetics and phylogeography. IM program applies the IM model to genetic data drawn from a pair of closely related populations or species based on Markov chain Monte Carlo (MCMC simulations of gene genealogies. But computational burden of IM program has placed limits on its application.With strong computational power, Graphics Processing Unit (GPU has been widely used in many fields. In this article, we present an effective implementation of IM program on one GPU based on Compute Unified Device Architecture (CUDA, which we call gPGA.Compared with IM program, gPGA can achieve up to 52.30X speedup on one GPU. The evaluation results demonstrate that it allows datasets to be analyzed effectively and rapidly for research on divergence population genetics. The software is freely available with source code at https://github.com/chunbaozhou/gPGA.
Medical image processing on the GPU - past, present and future.
Eklund, Anders; Dufort, Paul; Forsberg, Daniel; LaConte, Stephen M
2013-12-01
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges. Copyright © 2013 Elsevier B.V. All rights reserved.
Accelerated ray tracing for radiotherapy dose calculations on a GPU
de Greef, M.; Crezee, J.; van Eijk, J. C.; Pool, R.; Bel, A.
2009-01-01
PURPOSE: The graphical processing unit (GPU) on modern graphics cards offers the possibility of accelerating arithmetically intensive tasks. By splitting the work into a large number of independent jobs, order-of-magnitude speedups are reported. In this article, the possible speedup of PLATO's ray
GPU-based discrete element rigid body transport
CSIR Research Space (South Africa)
Govender, Nicolin
2013-08-01
Full Text Available granular flows, is described. The code under development, BLAZE-DEM, is designed for Graphics Processor Units to take advantage of the speed and low cost driven by the virtual environment and animation industry. It uses a spatial hash and GPU parallelism...
GPU-Boosted Camera-Only Indoor Localization
DEFF Research Database (Denmark)
Özkil, Ali Gürcan; Fan, Zhun; Kristensen, Jens Klæstrup
relies on local image features detection, description and matching; by parallelizing these computationally intensive tasks on the graphical processing unit (GPU), it is possible to do online localization using a Topometric Appearance Map. The method is developed as an integral part of a mobile service...
Application of GPU to Multi-interfaces Advection and Reconstruction Solver (MARS)
International Nuclear Information System (INIS)
Nagatake, Taku; Takase, Kazuyuki; Kunugi, Tomoaki
2010-01-01
In the nuclear engineering fields, a high performance computer system is necessary to perform the large scale computations. Recently, a Graphics Processing Unit (GPU) has been developed as a rendering computational system in order to reduce a Central Processing Unit (CPU) load. In the graphics processing, the high performance computing is needed to render the high-quality 3D objects in some video games. Thus the GPU consists of many processing units and a wide memory bandwidth. In this study, the Multi-interfaces Advection and Reconstruction Solver (MARS) which is one of the interface volume tracking methods for multi-phase flows has been performed. The multi-phase flow computation is very important for the nuclear reactors and other engineering fields. The MARS consists of two computing parts: the interface tracking part and the fluid motion computing part. As for the interface tracking part, the performance of GPU (GTX280) was 6 times faster than that of the CPU (Dual-Xeon 5040), and in the fluid motion computing part the Poisson Solver by the GPU (GTX285) was 22 times faster than that by the CPU(Core i7). As for the Dam Breaking Problem, the result of GPU-MARS showed slightly different from the experimental result. Because the GPU-MARS was developed using the single-precision GPU, it can be considered that the round-off error might be accumulated. (author)
GPU Computing to Improve Game Engine Performance
Directory of Open Access Journals (Sweden)
Abu Asaduzzaman
2014-07-01
Full Text Available Although the graphics processing unit (GPU was originally designed to accelerate the image creation for output to display, today’s general purpose GPU (GPGPU computing offers unprecedented performance by offloading computing-intensive portions of the application to the GPGPU, while running the remainder of the code on the central processing unit (CPU. The highly parallel structure of a many core GPGPU can process large blocks of data faster using multithreaded concurrent processing. A game engine has many “components” and multithreading can be used to implement their parallelism. However, effective implementation of multithreading in a multicore processor has challenges, such as data and task parallelism. In this paper, we investigate the impact of using a GPGPU with a CPU to design high-performance game engines. First, we implement a separable convolution filter (heavily used in image processing with the GPGPU. Then, we implement a multiobject interactive game console in an eight-core workstation using a multithreaded asynchronous model (MAM, a multithreaded synchronous model (MSM, and an MSM with data parallelism (MSMDP. According to the experimental results, speedup of about 61x and 5x is achieved due to GPGPU and MSMDP implementation, respectively. Therefore, GPGPU-assisted parallel computing has the potential to improve multithreaded game engine performance.
Evolution of GPU nuclear's training program
International Nuclear Information System (INIS)
Long, R.L.; Coe, R.P.
1987-01-01
GPU Nuclear Corporation (GPUN) manages the operators of Three Mile Island Unit 1 and Oyster Creek Nuclear Generating Stations and the recovery activities at the Three Mile Island Unit 2 plant. From the time it was formed in January 1980 GPUN emphasized the use of behavioral learning objectives as the basis for all its training programs. This paper describes the evolution to a formalized performance based Training System Development (TSD) Process. The Training and Education Department staff increased from 10 in 1979 to the current 120 dedicated professionals, with a corresponding increase in facilities and acquisition of sophisticated Basic Principles Training Simulators and a Three Mile Island Unit 1 control Room Replica Simulator. The impact of these developments and achievement of full INPO accreditation are discussed and related to plant performance improvements
GPU-Accelerated Adjoint Algorithmic Differentiation.
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-accelerated adjoint algorithmic differentiation
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
Parallel implementation of DNA sequences matching algorithms using PWM on GPU architecture.
Sharma, Rahul; Gupta, Nitin; Narang, Vipin; Mittal, Ankush
2011-01-01
Positional Weight Matrices (PWMs) are widely used in representation and detection of Transcription Factor Of Binding Sites (TFBSs) on DNA. We implement online PWM search algorithm over parallel architecture. A large PWM data can be processed on Graphic Processing Unit (GPU) systems in parallel which can help in matching sequences at a faster rate. Our method employs extensive usage of highly multithreaded architecture and shared memory of multi-cored GPU. An efficient use of shared memory is required to optimise parallel reduction in CUDA. Our optimised method has a speedup of 230-280x over linear implementation on GPU named GeForce GTX 280.
Parallel Computer System for 3D Visualization Stereo on GPU
Al-Oraiqat, Anas M.; Zori, Sergii A.
2018-03-01
This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
GPU-completeness: theory and implications
Lin, I.-Jong
2011-01-01
This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe
Acceleration for 2D time-domain elastic full waveform inversion using a single GPU card
Jiang, Jinpeng; Zhu, Peimin
2018-05-01
Full waveform inversion (FWI) is a challenging procedure due to the high computational cost related to the modeling, especially for the elastic case. The graphics processing unit (GPU) has become a popular device for the high-performance computing (HPC). To reduce the long computation time, we design and implement the GPU-based 2D elastic FWI (EFWI) in time domain using a single GPU card. We parallelize the forward modeling and gradient calculations using the CUDA programming language. To overcome the limitation of relatively small global memory on GPU, the boundary saving strategy is exploited to reconstruct the forward wavefield. Moreover, the L-BFGS optimization method used in the inversion increases the convergence of the misfit function. A multiscale inversion strategy is performed in the workflow to obtain the accurate inversion results. In our tests, the GPU-based implementations using a single GPU device achieve >15 times speedup in forward modeling, and about 12 times speedup in gradient calculation, compared with the eight-core CPU implementations optimized by OpenMP. The test results from the GPU implementations are verified to have enough accuracy by comparing the results obtained from the CPU implementations.
GPU-accelerated Gibbs ensemble Monte Carlo simulations of Lennard-Jonesium
Mick, Jason; Hailat, Eyad; Russo, Vincent; Rushaidat, Kamel; Schwiebert, Loren; Potoff, Jeffrey
2013-12-01
This work describes an implementation of canonical and Gibbs ensemble Monte Carlo simulations on graphics processing units (GPUs). The pair-wise energy calculations, which consume the majority of the computational effort, are parallelized using the energetic decomposition algorithm. While energetic decomposition is relatively inefficient for traditional CPU-bound codes, the algorithm is ideally suited to the architecture of the GPU. The performance of the CPU and GPU codes are assessed for a variety of CPU and GPU combinations for systems containing between 512 and 131,072 particles. For a system of 131,072 particles, the GPU-enabled canonical and Gibbs ensemble codes were 10.3 and 29.1 times faster (GTX 480 GPU vs. i5-2500K CPU), respectively, than an optimized serial CPU-bound code. Due to overhead from memory transfers from system RAM to the GPU, the CPU code was slightly faster than the GPU code for simulations containing less than 600 particles. The critical temperature Tc∗=1.312(2) and density ρc∗=0.316(3) were determined for the tail corrected Lennard-Jones potential from simulations of 10,000 particle systems, and found to be in exact agreement with prior mixed field finite-size scaling calculations [J.J. Potoff, A.Z. Panagiotopoulos, J. Chem. Phys. 109 (1998) 10914].
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads.
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications.
Lei, Guoqing; Dou, Yong; Wan, Wen; Xia, Fei; Li, Rongchun; Ma, Meng; Zou, Dan
2012-01-01
Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
2012-01-01
Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
Accelerating the XGBoost algorithm using GPU computing
Directory of Open Access Journals (Sweden)
Rory Mitchell
2017-07-01
Full Text Available We present a CUDA-based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the graphics processing unit (GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelised, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort-based approach for larger depths. We show speedups of between 3× and 6× using a Titan X compared to a 4 core i7 CPU, and 1.2× using a Titan X compared to 2× Xeon CPUs (24 cores. We show that it is possible to process the Higgs dataset (10 million instances, 28 features entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks.
Travel Software using GPU Hardware
Szalwinski, Chris M; Dimov, Veliko Atanasov; CERN. Geneva. ATS Department
2015-01-01
Travel is the main multi-particle tracking code being used at CERN for the beam dynamics calculations through hadron and ion linear accelerators. It uses two routines for the calculation of space charge forces, namely, rings of charges and point-to-point. This report presents the studies to improve the performance of Travel using GPU hardware. The studies showed that the performance of Travel with the point-to-point simulations of space-charge effects can be speeded up at least 72 times using current GPU hardware. Simple recompilation of the source code using an Intel compiler can improve performance at least 4 times without GPU support. The limited memory of the GPU is the bottleneck. Two algorithms were investigated on this point: repeated computation and tiling. The repeating computation algorithm is simpler and is the currently recommended solution. The tiling algorithm was more complicated and degraded performance. Both build and test instructions for the parallelized version of the software are inclu...
Considerations for GPU SEE Testing
Wyrwas, Edward J.
2017-01-01
This presentation will discuss the considerations an engineer should take to perform Single Event Effects (SEE) testing on GPU devices. Notable topics will include setup complexity, architecture insight which permits cross platform normalization, acquiring a reasonable detail of information from the test suite, and a few lessons learned from preliminary testing.
A survey of GPU-based medical image computing techniques
Shi, Lin; Liu, Wen; Zhang, Heye; Xie, Yongming
2012-01-01
Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine. PMID:23256080
Wu, Xin; Koslowski, Axel; Thiel, Walter
2012-07-10
In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
Implementation of Multipattern String Matching Accelerated with GPU for Intrusion Detection System
Nehemia, Rangga; Lim, Charles; Galinium, Maulahikmah; Rinaldi Widianto, Ahmad
2017-04-01
As Internet-related security threats continue to increase in terms of volume and sophistication, existing Intrusion Detection System is also being challenged to cope with the current Internet development. Multi Pattern String Matching algorithm accelerated with Graphical Processing Unit is being utilized to improve the packet scanning performance of the IDS. This paper implements a Multi Pattern String Matching algorithm, also called Parallel Failureless Aho Corasick accelerated with GPU to improve the performance of IDS. OpenCL library is used to allow the IDS to support various GPU, including popular GPU such as NVIDIA and AMD, used in our research. The experiment result shows that the application of Multi Pattern String Matching using GPU accelerated platform provides a speed up, by up to 141% in term of throughput compared to the previous research.
Directory of Open Access Journals (Sweden)
G Boroni
2017-03-01
Full Text Available Lattice Boltzmann Method (LBM has shown great potential in fluid simulations, but performance issues and difficulties to manage complex boundary conditions have hindered a wider application. The upcoming of Graphic Processing Units (GPU Computing offered a possible solution for the performance issue, and methods like the Immersed Boundary (IB algorithm proved to be a flexible solution to boundaries. Unfortunately, the implicit IB algorithm makes the LBM implementation in GPU a non-trivial task. This work presents a fully parallel GPU implementation of LBM in combination with IB. The fluid-boundary interaction is implemented via GPU kernels, using execution configurations and data structures specifically designed to accelerate each code execution. Simulations were validated against experimental and analytical data showing good agreement and improving the computational time. Substantial reductions of calculation rates were achieved, lowering down the required time to execute the same model in a CPU to about two magnitude orders.
GPU based numerical simulation of core shooting process
Directory of Open Access Journals (Sweden)
Yi-zhong Zhang
2017-11-01
Full Text Available Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model (TFM and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit (GPU has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture (CUDA platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications.
GPU-based prompt gamma ray imaging from boron neutron capture therapy
International Nuclear Information System (INIS)
Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae; Jo Hong, Key; Sil Lee, Keum
2015-01-01
Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations
GPU-based prompt gamma ray imaging from boron neutron capture therapy.
Yoon, Do-Kun; Jung, Joo-Young; Jo Hong, Key; Sil Lee, Keum; Suk Suh, Tae
2015-01-01
The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.
Frame-to-frame coherent GPU splatting
Grau Carrion, Sergi; Tost Pardell, Daniela
2008-01-01
Recently, several improvements of the classical splatting volume rendering technique have been proposed, specifically, GPU implementations that considerably speed it up. However, splatting time-varying data, even using the GPU, is still slow and not very suitable for interactive data exploration through time. In this paper, we propose a GPU-based viewaligned splatting algorithm that exploits frame-to-frame coherence to render time-varying volume datasets. Our method run-length enc...
Directory of Open Access Journals (Sweden)
Christley Scott
2010-08-01
Full Text Available Abstract Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a
Quantifying NUMA and Contention Effects in Multi-GPU Systems
Energy Technology Data Exchange (ETDEWEB)
Spafford, Kyle L [ORNL; Meredith, Jeremy S [ORNL; Vetter, Jeffrey S [ORNL
2011-01-01
As system architects strive for increased density and power efficiency, the traditional compute node is being augmented with an increasing number of graphics processing units (GPUs). The integration of multiple GPUs per node introduces complex performance phenomena including non-uniform memory access (NUMA) and contention for shared system resources. Utilizing the Keeneland system, this paper quantifies these effects and presents some guidance on programming strategies to maximize performance in multi-GPU environments.
Proton Testing of nVidia GTX 1050 GPU
Wyrwas, E. J.
2017-01-01
Single-Event Effects (SEE) testing was conducted on the nVidia GTX 1050 Graphics Processor Unit (GPU); herein referred to as device under test (DUT). Testing was conducted at Massachusetts General Hospitals (MGH) Francis H. Burr Proton Therapy Center on April 9th, 2017 using 200-MeV protons. This testing trip was purposed to provide a baseline assessment of the radiation susceptibility of the DUT as no previous testing had been conducted on this component.
A GPU Accelerated Spring Mass System for Surgical Simulation
DEFF Research Database (Denmark)
Mosegaard, Jesper; Sørensen, Thomas Sangild
2005-01-01
There is a growing demand for surgical simulators to dofast and precise calculations of tissue deformation to simulateincreasingly complex morphology in real-time. Unfortunately, evenfast spring-mass based systems have slow convergence rates for largemodels. This paper presents a method to accele...... to accelerate computation of aspring-mass system in order to simulate a complex organ such as theheart. This acceleration is achieved by taking advantage of moderngraphics processing units (GPU)....
Basket Option Pricing Using GP-GPU Hardware Acceleration
Douglas, Craig C.
2010-08-01
We introduce a basket option pricing problem arisen in financial mathematics. We discretized the problem based on the alternating direction implicit (ADI) method and parallel cyclic reduction is applied to solve the set of tridiagonal matrices generated by the ADI method. To reduce the computational time of the problem, a general purpose graphics processing units (GP-GPU) environment is considered. Numerical results confirm the convergence and efficiency of the proposed method. © 2010 IEEE.
GPU PRO 3 Advanced rendering techniques
Engel, Wolfgang
2012-01-01
GPU Pro3, the third volume in the GPU Pro book series, offers practical tips and techniques for creating real-time graphics that are useful to beginners and seasoned game and graphics programmers alike. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Wessam Bahnassi, and Sebastien St-Laurent have once again brought together a high-quality collection of cutting-edge techniques for advanced GPU programming. With contributions by more than 50 experts, GPU Pro3: Advanced Rendering Techniques covers battle-tested tips and tricks for creating interesting geometry, realistic sha
Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU
Directory of Open Access Journals (Sweden)
Arefan D
2015-06-01
Full Text Available Digital Breast Tomosynthesis (DBT is a technology that creates three dimensional (3D images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU. At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU card and the Graphics Processing Unit (GPU. It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU.
Hennessey, Susan Marie; Friend, Julie; Elander, Richard T; Tucker, III, Melvin P
2013-05-21
A method is provided for producing an improved pretreated biomass product for use in saccharification followed by fermentation to produce a target chemical that includes removal of saccharification and or fermentation inhibitors from the pretreated biomass product. Specifically, the pretreated biomass product derived from using the present method has fewer inhibitors of saccharification and/or fermentation without a loss in sugar content.
Fast distributed large-pixel-count hologram computation using a GPU cluster.
Pan, Yuechao; Xu, Xuewu; Liang, Xinan
2013-09-10
Large-pixel-count holograms are one essential part for big size holographic three-dimensional (3D) display, but the generation of such holograms is computationally demanding. In order to address this issue, we have built a graphics processing unit (GPU) cluster with 32.5 Tflop/s computing power and implemented distributed hologram computation on it with speed improvement techniques, such as shared memory on GPU, GPU level adaptive load balancing, and node level load distribution. Using these speed improvement techniques on the GPU cluster, we have achieved 71.4 times computation speed increase for 186M-pixel holograms. Furthermore, we have used the approaches of diffraction limits and subdivision of holograms to overcome the GPU memory limit in computing large-pixel-count holograms. 745M-pixel and 1.80G-pixel holograms were computed in 343 and 3326 s, respectively, for more than 2 million object points with RGB colors. Color 3D objects with 1.02M points were successfully reconstructed from 186M-pixel hologram computed in 8.82 s with all the above three speed improvement techniques. It is shown that distributed hologram computation using a GPU cluster is a promising approach to increase the computation speed of large-pixel-count holograms for large size holographic display.
A real-time spike sorting method based on the embedded GPU.
Zelan Yang; Kedi Xu; Xiang Tian; Shaomin Zhang; Xiaoxiang Zheng
2017-07-01
Microelectrode arrays with hundreds of channels have been widely used to acquire neuron population signals in neuroscience studies. Online spike sorting is becoming one of the most important challenges for high-throughput neural signal acquisition systems. Graphic processing unit (GPU) with high parallel computing capability might provide an alternative solution for increasing real-time computational demands on spike sorting. This study reported a method of real-time spike sorting through computing unified device architecture (CUDA) which was implemented on an embedded GPU (NVIDIA JETSON Tegra K1, TK1). The sorting approach is based on the principal component analysis (PCA) and K-means. By analyzing the parallelism of each process, the method was further optimized in the thread memory model of GPU. Our results showed that the GPU-based classifier on TK1 is 37.92 times faster than the MATLAB-based classifier on PC while their accuracies were the same with each other. The high-performance computing features of embedded GPU demonstrated in our studies suggested that the embedded GPU provide a promising platform for the real-time neural signal processing.
GPU-accelerated Monte Carlo convolution/superposition implementation for dose calculation.
Zhou, Bo; Yu, Cedric X; Chen, Danny Z; Hu, X Sharon
2010-11-01
Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution/superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution/superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors' GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. A speedup in the range of 6.7-11.4x is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors' GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article.
GPU Based Software Correlators - Perspectives for VLBI2010
Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun
2010-01-01
Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm.
Tian, Zhen; Peng, Fei; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B
2015-06-01
Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU's relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors' group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors' method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H&N) cancer case is then used to validate the
The experience of GPU calculations at Lunarc
Sjöström, Anders; Lindemann, Jonas; Church, Ross
2011-09-01
To meet the ever increasing demand for computational speed and use of ever larger datasets, multi GPU instal- lations look very tempting. Lunarc and the Theoretical Astrophysics group at Lund Observatory collaborate on a pilot project to evaluate and utilize multi-GPU architectures for scientific calculations. Starting with a small workshop in 2009, continued investigations eventually lead to the procurement of the GPU-resource Timaeus, which is a four-node eight-GPU cluster with two Nvidia m2050 GPU-cards per node. The resource is housed within the larger cluster Platon and share disk-, network- and system resources with that cluster. The inaugu- ration of Timaeus coincided with the meeting "Computational Physics with GPUs" in November 2010, hosted by the Theoretical Astrophysics group at Lund Observatory. The meeting comprised of a two-day workshop on GPU-computing and a two-day science meeting on using GPUs as a tool for computational physics research, with a particular focus on astrophysics and computational biology. Today Timaeus is used by research groups from Lund, Stockholm and Lule in fields ranging from Astrophysics to Molecular Chemistry. We are investigating the use of GPUs with commercial software packages and user supplied MPI-enabled codes. Looking ahead, Lunarc will be installing a new cluster during the summer of 2011 which will have a small number of GPU-enabled nodes that will enable us to continue working with the combination of parallel codes and GPU-computing. It is clear that the combination of GPUs/CPUs is becoming an important part of high performance computing and here we will describe what has been done at Lunarc regarding GPU-computations and how we will continue to investigate the new and coming multi-GPU servers and how they can be utilized in our environment.
Lossless data compression for improving the performance of a GPU-based beamformer.
Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi
2015-04-01
The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer. © The Author(s) 2014.
Mori, Shinichiro; Kobayashi, Masanao; Kumagai, Motoki; Minohara, Shinichi
2009-01-01
To provide faster calculation of digitally reconstructed radiographs (DRRs) in patient-positioning verification, we developed and evaluated a graphic processing unit (GPU)-based DRR software application and compared it with a central processing unit (CPU)-based application. The evaluation metrics were calculation speed and image quality for various slice thicknesses. The results showed that the GPU-based DRR computation was an average of 50 times faster than the CPU-based methodology, whereas the image quality was very similar. This excellent performance may increase the accuracy of patient positioning and improve the patient treatment throughput time.
Fast, parallel implementation of particle filtering on the GPU architecture
Gelencsér-Horváth, Anna; Tornai, Gábor János; Horváth, András; Cserey, György
2013-12-01
In this paper, we introduce a modified cellular particle filter (CPF) which we mapped on a graphics processing unit (GPU) architecture. We developed this filter adaptation using a state-of-the art CPF technique. Mapping this filter realization on a highly parallel architecture entailed a shift in the logical representation of the particles. In this process, the original two-dimensional organization is reordered as a one-dimensional ring topology. We proposed a proof-of-concept measurement on two models with an NVIDIA Fermi architecture GPU. This design achieved a 411- μs kernel time per state and a 77-ms global running time for all states for 16,384 particles with a 256 neighbourhood size on a sequence of 24 states for a bearing-only tracking model. For a commonly used benchmark model at the same configuration, we achieved a 266- μs kernel time per state and a 124-ms global running time for all 100 states. Kernel time includes random number generation on the GPU with curand. These results attest to the effective and fast use of the particle filter in high-dimensional, real-time applications.
SAR wind retrieval: from Singlecore to Multicore and GPU computing
Myasoedov, Alexander; Monzikova, Anna
The large spatial coverage and high resolution of spaceborne synthetic aperture radars (SAR) offers a unique opportunity to derive mesoscale wind fields over the ocean surface, providing high resolution wind fields near the shore. On the other hand, due to the large size of SAR images their processing might be a headache when dealing with operational tasks or doing long-period statistical analysis. Algorithms for satellite image processing often offer many possibilities for parallelism (e.g., pixel-by-pixel processing) which makes them good candidates for execution on high-performance parallel computing hardware such as Multicore CPUs and modern graphic processing units (GPUs). In this study we implement different SAR wind speed retrieval algorithms (e.g. CMOD4, CMOD5) for Singlecore and Multicore systems, including GPUs. For this purpose both serial and parallelized versions of CMOD algorithms were written in Matlab, Python, CPython and PyOpenCL. We apply these algorithms to an Envisat ASAR image, compare the results received with different versions of the algorithms executed on both Intel CPU and a Tesla GPU. As a result of our experiments we not only show the up to 400 times speedup of GPU comparing to CPU but also try to give some advises on how much time we have spent and efforts were made for writing the same algorithm using different programming languages. We hope that our experience will help other scientist to achieve all the goodness from the GPU/Multicore computing.
Implementation of GPU-accelerated back projection for EPR imaging.
Qiao, Zhiwei; Redler, Gage; Epel, Boris; Qian, Yuhua; Halpern, Howard
2015-01-01
Electron paramagnetic resonance (EPR) Imaging (EPRI) is a robust method for measuring in vivo oxygen concentration (pO2). For 3D pulse EPRI, a commonly used reconstruction algorithm is the filtered backprojection (FBP) algorithm, in which the backprojection process is computationally intensive and may be time consuming when implemented on a CPU. A multistage implementation of the backprojection can be used for acceleration, however it is not flexible (requires equal linear angle projection distribution) and may still be time consuming. In this work, single-stage backprojection is implemented on a GPU (Graphics Processing Units) having 1152 cores to accelerate the process. The GPU implementation results in acceleration by over a factor of 200 overall and by over a factor of 3500 if only the computing time is considered. Some important experiences regarding the implementation of GPU-accelerated backprojection for EPRI are summarized. The resulting accelerated image reconstruction is useful for real-time image reconstruction monitoring and other time sensitive applications.
Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T
2011-11-21
We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
Risk management at GPU Nuclear
International Nuclear Information System (INIS)
Long, R.L.
1991-01-01
This paper reports on GPU Nuclear. Among other goals, it established the independence of key safety functions as highlighted by the lessons learned from the accident. In particular, an independent Nuclear Assurance Division was established which include Quality Assurance, Training and Education, Emergency Preparedness, and Nuclear Safety Assessment. The latter consisted of corporate and site independent-safety-review groups. As the GPU Nuclear organization matured, a mid-1987 reorganization created an even more focused Planning and Nuclear Safety Division bringing together Nuclear Safety Assessment with Licensing and Regulatory Affairs and Risk Management. The Risk Management Group (RMG), which began its work in fall 1987, was formed to develop a framework for proactive identification, evaluation, and cost-effective reduction and management of risks of all types. The RMG set out to learn as much as possible about risks and their management in nuclear and other high-technology industries. This began with a thorough literature search. It progressed to interviews with individuals and organizations which have demonstrated innovative ideas, experience, and reputations for safe and reliable operation
Enhancing professionalism at GPU nuclear
International Nuclear Information System (INIS)
Coe, R.P.; Landy, F.J.
1991-01-01
Late in 1988, GPU Nuclear embarked on a major program aimed at enhancing Professionalism at its Oyster Creek and Three Mile Island Nuclear Generating Stations. The program was also to include its Corporate Headquarters in Parsippany, New Jersey. The overall program was to take several directions which included on-site degree programs, a sabbatical leave-type program for personnel to finish college degrees, advanced technical training for licensed staff, career progression for SROs and expanded teamwork and leadership training for control room crews. The largest portion of this initiative was the development and delivery of professionalism training to the nearly two thousand people at both sites. Three primary philosophies guided the development of the program. Employees as Experts: First, GPU Nuclear employees were considered to be the most valuable source of information for designing a Professionalism program because it is these individuals who are sensitive to the issues encountered in the workplace. Realism: The second philosophy guiding this effort was that the program must be grounded in real life challenges that employees face and must address. Active Learning: The third guiding philosophy was that, in order to have any real impact on the way employees think about professionalism, the program must utilize active rather than passive learning techniques
Memory-Scalable GPU Spatial Hierarchy Construction.
Qiming Hou; Xin Sun; Kun Zhou; Lauterbach, C; Manocha, D
2011-04-01
Recent GPU algorithms for constructing spatial hierarchies have achieved promising performance for moderately complex models by using the breadth-first search (BFS) construction order. While being able to exploit the massive parallelism on the GPU, the BFS order also consumes excessive GPU memory, which becomes a serious issue for interactive applications involving very complex models with more than a few million triangles. In this paper, we propose to use the partial breadth-first search (PBFS) construction order to control memory consumption while maximizing performance. We apply the PBFS order to two hierarchy construction algorithms. The first algorithm is for kd-trees that automatically balances between the level of parallelism and intermediate memory usage. With PBFS, peak memory consumption during construction can be efficiently controlled without costly CPU-GPU data transfer. We also develop memory allocation strategies to effectively limit memory fragmentation. The resulting algorithm scales well with GPU memory and constructs kd-trees of models with millions of triangles at interactive rates on GPUs with 1 GB memory. Compared with existing algorithms, our algorithm is an order of magnitude more scalable for a given GPU memory bound. The second algorithm is for out-of-core bounding volume hierarchy (BVH) construction for very large scenes based on the PBFS construction order. At each iteration, all constructed nodes are dumped to the CPU memory, and the GPU memory is freed for the next iteration's use. In this way, the algorithm is able to build trees that are too large to be stored in the GPU memory. Experiments show that our algorithm can construct BVHs for scenes with up to 20 M triangles, several times larger than previous GPU algorithms.
Airborne SAR Real-time Imaging Algorithm Design and Implementation with CUDA on NVIDIA GPU
Directory of Open Access Journals (Sweden)
Meng Da-di
2013-12-01
Full Text Available Synthetic Aperture Radar (SAR image processing requires huge computation amount. Traditionally, this task runs on the workstation or server based on Central Processing Unit (CPU and is rather time-consuming, hence real-time processing of SAR data is impossible. Based on Compute Unified Device Architecture (CUDA technology, a new plan of SAR imaging algorithm operated on NVIDIA Graphic Processing Unit (GPU is proposed. The new proposal makes it possible that the data processing procedure and CPU/GPU data exchanging execute concurrently, especially when SAR data size exceeds total GPU global memory size. Multi-GPU is suitably supported by the new proposal and all of computational resources are fully exploited. It is shown by experiment on NVIDIA K20C and INTEL E5645 that the proposed solution accelerates SAR data processing by tens of times. Consequently, the GPU based SAR processing system with the proposed solution embedded is much more power saving and portable, which makes it qualified to be a real-time SAR data processing system. Experiment shows that SAR data of 36 Mega points can be processed in real-time per second by K20C with the new solution equipped.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung
2015-01-01
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
Directory of Open Access Journals (Sweden)
Chun-Liang Lee
Full Text Available The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
Cheng, Chun-Pei; Lan, Kuo-Lun; Liu, Wen-Chun; Chang, Ting-Tsung; Tseng, Vincent S
2016-12-01
Hepatitis B viral (HBV) infection is strongly associated with an increased risk of liver diseases like cirrhosis or hepatocellular carcinoma (HCC). Many lines of evidence suggest that deletions occurring in HBV genomic DNA are highly associated with the activity of HBV via the interplay between aberrant viral proteins release and human immune system. Deletions finding on the HBV whole genome sequences is thus a very important issue though there exist underlying the challenges in mining such big and complex biological data. Although some next generation sequencing (NGS) tools are recently designed for identifying structural variations such as insertions or deletions, their validity is generally committed to human sequences study. This design may not be suitable for viruses due to different species. We propose a graphics processing unit (GPU)-based data mining method called DeF-GPU to efficiently and precisely identify HBV deletions from large NGS data, which generally contain millions of reads. To fit the single instruction multiple data instructions, sequencing reads are referred to as multiple data and the deletion finding procedure is referred to as a single instruction. We use Compute Unified Device Architecture (CUDA) to parallelize the procedures, and further validate DeF-GPU on 5 synthetic and 1 real datasets. Our results suggest that DeF-GPU outperforms the existing commonly-used method Pindel and is able to exactly identify the deletions of our ground truth in few seconds. The source code and other related materials are available at https://sourceforge.net/projects/defgpu/. Copyright Â© 2016 Elsevier Inc. All rights reserved.
SU-E-T-493: Accelerated Monte Carlo Methods for Photon Dosimetry Using a Dual-GPU System and CUDA.
Liu, T; Ding, A; Xu, X
2012-06-01
To develop a Graphics Processing Unit (GPU) based Monte Carlo (MC) code that accelerates dose calculations on a dual-GPU system. We simulated a clinical case of prostate cancer treatment. A voxelized abdomen phantom derived from 120 CT slices was used containing 218×126×60 voxels, and a GE LightSpeed 16-MDCT scanner was modeled. A CPU version of the MC code was first developed in C++ and tested on Intel Xeon X5660 2.8GHz CPU, then it was translated into GPU version using CUDA C 4.1 and run on a dual Tesla m 2 090 GPU system. The code was featured with automatic assignment of simulation task to multiple GPUs, as well as accurate calculation of energy- and material- dependent cross-sections. Double-precision floating point format was used for accuracy. Doses to the rectum, prostate, bladder and femoral heads were calculated. When running on a single GPU, the MC GPU code was found to be ×19 times faster than the CPU code and ×42 times faster than MCNPX. These speedup factors were doubled on the dual-GPU system. The dose Result was benchmarked against MCNPX and a maximum difference of 1% was observed when the relative error is kept below 0.1%. A GPU-based MC code was developed for dose calculations using detailed patient and CT scanner models. Efficiency and accuracy were both guaranteed in this code. Scalability of the code was confirmed on the dual-GPU system. © 2012 American Association of Physicists in Medicine.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.
Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H
2012-09-01
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
Ruymgaart, A Peter; Elber, Ron
2012-11-13
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
Richmond, Paul; Buesing, Lars; Giugliano, Michele; Vasilaki, Eleni
2011-01-01
High performance computing on the Graphics Processing Unit (GPU) is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a “non-democratic” mechanism), achieve mediocre learning results at best. In absence of recurrent connections, where all neurons “vote” independently (“democratic”) for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated. PMID:21572529
Directory of Open Access Journals (Sweden)
Paul Richmond
2011-05-01
Full Text Available High performance computing on the Graphics Processing Unit (GPU is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism, achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic" for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.
Richmond, Paul; Buesing, Lars; Giugliano, Michele; Vasilaki, Eleni
2011-05-04
High performance computing on the Graphics Processing Unit (GPU) is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism), achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic") for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.
Plain Polynomial Arithmetic on GPU
International Nuclear Information System (INIS)
Haque, Sardar Anisul; Maza, Marc Moreno
2012-01-01
As for serial code on CPUs, parallel code on GPUs for dense polynomial arithmetic relies on a combination of asymptotically fast and plain algorithms. Those are employed for data of large and small size, respectively. Parallelizing both types of algorithms is required in order to achieve peak performances. In this paper, we show that the plain dense polynomial multiplication can be efficiently parallelized on GPUs. Remarkably, it outperforms (highly optimized) FFT-based multiplication up to degree 2 12 while on CPU the same threshold is usually at 2 6 . We also report on a GPU implementation of the Euclidean Algorithm which is both work-efficient and runs in linear time for input polynomials up to degree 2 18 thus showing the performance of the GCD algorithm based on systolic arrays.
CPU and GPU (Cuda) Template Matching Comparison
Evaldas Borcovas; Gintautas Daunys
2014-01-01
Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I), NVidia GeForce GT320M CUDAcompliable graphics card (GPU I) ...
Fast polynomial multiplication on a GPU
International Nuclear Information System (INIS)
Maza, Marc Moreno; Pan Wei
2010-01-01
We present CUDA implementations of Fast Fourier Transforms over finite fields. This allows us to develop GPU support for dense univariate polynomial multiplication leading to speedup factors in the range 21 - 37 with respect to the best serial C-code available to us, for our largest input data sets. Since dense univariate polynomial multiplication is a core routine in symbolic computation, this is promising result for the integration of GPU support into computer algebra systems.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
International Nuclear Information System (INIS)
Gong Chunye; Liu Jie; Chi Lihua; Huang Haowei; Fang Jingyue; Gong Zhenghu
2011-01-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (S n ) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
AVIST: A GPU-Centric Design for Visual Exploration of Large Multidimensional Datasets
Directory of Open Access Journals (Sweden)
Peng Mi
2016-10-01
Full Text Available This paper presents the Animated VISualization Tool (AVIST, an exploration-oriented data visualization tool that enables rapidly exploring and filtering large time series multidimensional datasets. AVIST highlights interactive data exploration by revealing fine data details. This is achieved through the use of animation and cross-filtering interactions. To support interactive exploration of big data, AVIST features a GPU (Graphics Processing Unit-centric design. Two key aspects are emphasized on the GPU-centric design: (1 both data management and computation are implemented on the GPU to leverage its parallel computing capability and fast memory bandwidth; (2 a GPU-based directed acyclic graph is proposed to characterize data transformations triggered by users’ demands. Moreover, we implement AVIST based on the Model-View-Controller (MVC architecture. In the implementation, we consider two aspects: (1 user interaction is highlighted to slice big data into small data; and (2 data transformation is based on parallel computing. Two case studies demonstrate how AVIST can help analysts identify abnormal behaviors and infer new hypotheses by exploring big datasets. Finally, we summarize lessons learned about GPU-based solutions in interactive information visualization with big data.
Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.
Mei, Gang; Xu, Nengxiong; Xu, Liangliang
2016-01-01
This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.
Fast MPEG-CDVS Encoder With GPU-CPU Hybrid Computing.
Duan, Ling-Yu; Sun, Wei; Zhang, Xinfeng; Wang, Shiqi; Chen, Jie; Yin, Jianxiong; See, Simon; Huang, Tiejun; Kot, Alex C; Gao, Wen
2018-05-01
The compact descriptors for visual search (CDVS) standard from ISO/IEC moving pictures experts group has succeeded in enabling the interoperability for efficient and effective image retrieval by standardizing the bitstream syntax of compact feature descriptors. However, the intensive computation of a CDVS encoder unfortunately hinders its widely deployment in industry for large-scale visual search. In this paper, we revisit the merits of low complexity design of CDVS core techniques and present a very fast CDVS encoder by leveraging the massive parallel execution resources of graphics processing unit (GPU). We elegantly shift the computation-intensive and parallel-friendly modules to the state-of-the-arts GPU platforms, in which the thread block allocation as well as the memory access mechanism are jointly optimized to eliminate performance loss. In addition, those operations with heavy data dependence are allocated to CPU for resolving the extra but non-necessary computation burden for GPU. Furthermore, we have demonstrated the proposed fast CDVS encoder can work well with those convolution neural network approaches which enables to leverage the advantages of GPU platforms harmoniously, and yield significant performance improvements. Comprehensive experimental results over benchmarks are evaluated, which has shown that the fast CDVS encoder using GPU-CPU hybrid computing is promising for scalable visual search.
A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU
Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha
2018-03-01
Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.
GAPD: a GPU-accelerated atom-based polychromatic diffraction simulation code.
E, J C; Wang, L; Chen, S; Zhang, Y Y; Luo, S N
2018-03-01
GAPD, a graphics-processing-unit (GPU)-accelerated atom-based polychromatic diffraction simulation code for direct, kinematics-based, simulations of X-ray/electron diffraction of large-scale atomic systems with mono-/polychromatic beams and arbitrary plane detector geometries, is presented. This code implements GPU parallel computation via both real- and reciprocal-space decompositions. With GAPD, direct simulations are performed of the reciprocal lattice node of ultralarge systems (∼5 billion atoms) and diffraction patterns of single-crystal and polycrystalline configurations with mono- and polychromatic X-ray beams (including synchrotron undulator sources), and validation, benchmark and application cases are presented.
High Performance Processing and Analysis of Geospatial Data Using CUDA on GPU
Directory of Open Access Journals (Sweden)
STOJANOVIC, N.
2014-11-01
Full Text Available In this paper, the high-performance processing of massive geospatial data on many-core GPU (Graphic Processing Unit is presented. We use CUDA (Compute Unified Device Architecture programming framework to implement parallel processing of common Geographic Information Systems (GIS algorithms, such as viewshed analysis and map-matching. Experimental evaluation indicates the improvement in performance with respect to CPU-based solutions and shows feasibility of using GPU and CUDA for parallel implementation of GIS algorithms over large-scale geospatial datasets.
Real-Time GPU Implementation of Transverse Oscillation Vector Velocity Flow Imaging
DEFF Research Database (Denmark)
Bradway, David; Pihl, Michael Johannes; Krebs, Andreas
2014-01-01
Rapid estimation of blood velocity and visualization of complex flow patterns are important for clinical use of diagnostic ultrasound. This paper presents real-time processing for two-dimensional (2-D) vector flow imaging which utilizes an off-the-shelf graphics processing unit (GPU). In this work...... vector flow acquisition takes 2.3 milliseconds seconds on an Advanced Micro Devices Radeon HD 7850 GPU card. The detected velocities are accurate to within the precision limit of the output format of the display routine. Because this tool was developed as a module external to the scanner’s built...
GPU Lossless Hyperspectral Data Compression System
Aranki, Nazeeh I.; Keymeulen, Didier; Kiely, Aaron B.; Klimesh, Matthew A.
2014-01-01
Hyperspectral imaging systems onboard aircraft or spacecraft can acquire large amounts of data, putting a strain on limited downlink and storage resources. Onboard data compression can mitigate this problem but may require a system capable of a high throughput. In order to achieve a high throughput with a software compressor, a graphics processing unit (GPU) implementation of a compressor was developed targeting the current state-of-the-art GPUs from NVIDIA(R). The implementation is based on the fast lossless (FL) compression algorithm reported in "Fast Lossless Compression of Multispectral-Image Data" (NPO- 42517), NASA Tech Briefs, Vol. 30, No. 8 (August 2006), page 26, which operates on hyperspectral data and achieves excellent compression performance while having low complexity. The FL compressor uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. The new Consultative Committee for Space Data Systems (CCSDS) Standard for Lossless Multispectral & Hyperspectral image compression (CCSDS 123) is based on the FL compressor. The software makes use of the highly-parallel processing capability of GPUs to achieve a throughput at least six times higher than that of a software implementation running on a single-core CPU. This implementation provides a practical real-time solution for compression of data from airborne hyperspectral instruments.
International Nuclear Information System (INIS)
Ha, Woo Seok; Kim, Soo Mee; Park, Min Jae; Lee, Dong Soo; Lee, Jae Sung
2009-01-01
The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 sec, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 sec, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries
Persoon, Lucas C G G; Podesta, Mark; van Elmpt, Wouter J C; Nijsten, Sebastiaan M J J G; Verhaegen, Frank
2011-07-01
A widely accepted method to quantify differences in dose distributions is the gamma (gamma) evaluation. Currently, almost all gamma implementations utilize the central processing unit (CPU). Recently, the graphics processing unit (GPU) has become a powerful platform for specific computing tasks. In this study, we describe the implementation of a 3D gamma evaluation using a GPU to improve calculation time. The gamma evaluation algorithm was implemented on an NVIDIA Tesla C2050 GPU using the compute unified device architecture (CUDA). First, several cubic virtual phantoms were simulated. These phantoms were tested with varying dose cube sizes and set-ups, introducing artificial dose differences. Second, to show applicability in clinical practice, five patient cases have been evaluated using the 3D dose distribution from a treatment planning system as the reference and the delivered dose determined during treatment as the comparison. A calculation time comparison between the CPU and GPU was made with varying thread-block sizes including the option of using texture or global memory. A GPU over CPU speed-up of 66 +/- 12 was achieved for the virtual phantoms. For the patient cases, a speed-up of 57 +/- 15 using the GPU was obtained. A thread-block size of 16 x 16 performed best in all cases. The use of texture memory improved the total calculation time, especially when interpolation was applied. Differences between the CPU and GPU gammas were negligible. The GPU and its features, such as texture memory, decreased the calculation time for gamma evaluations considerably without loss of accuracy.
GPU-Based Techniques for Global Illumination Effects
Szirmay-Kalos, László; Sbert, Mateu
2008-01-01
This book presents techniques to render photo-realistic images by programming the Graphics Processing Unit (GPU). We discuss effects such as mirror reflections, refractions, caustics, diffuse or glossy indirect illumination, radiosity, single or multiple scattering in participating media, tone reproduction, glow, and depth of field. This book targets game developers, graphics programmers, and also students with some basic understanding of computer graphics algorithms, rendering APIs like Direct3D or OpenGL, and shader programming. In order to make this book self-contained, the most important c
Ramses-GPU: Second order MUSCL-Handcock finite volume fluid solver
Kestener, Pierre
2017-10-01
RamsesGPU is a reimplementation of RAMSES (ascl:1011.007) which drops the adaptive mesh refinement (AMR) features to optimize 3D uniform grid algorithms for modern graphics processor units (GPU) to provide an efficient software package for astrophysics applications that do not need AMR features but do require a very large number of integration time steps. RamsesGPU provides an very efficient C++/CUDA/MPI software implementation of a second order MUSCL-Handcock finite volume fluid solver for compressible hydrodynamics as a magnetohydrodynamics solver based on the constraint transport technique. Other useful modules includes static gravity, dissipative terms (viscosity, resistivity), and forcing source term for turbulence studies, and special care was taken to enhance parallel input/output performance by using state-of-the-art libraries such as HDF5 and parallel-netcdf.
Protein-protein docking on hardware accelerators: comparison of GPU and MIC architectures
2015-01-01
Background The hardware accelerators will provide solutions to computationally complex problems in bioinformatics fields. However, the effect of acceleration depends on the nature of the application, thus selection of an appropriate accelerator requires some consideration. Results In the present study, we compared the effects of acceleration using graphics processing unit (GPU) and many integrated core (MIC) on the speed of fast Fourier transform (FFT)-based protein-protein docking calculation. The GPU implementation performed the protein-protein docking calculations approximately five times faster than the MIC offload mode implementation. The MIC native mode implementation has the advantage in the implementation costs. However, the performance was worse with larger protein pairs because of memory limitations. Conclusion The results suggest that GPU is more suitable than MIC for accelerating FFT-based protein-protein docking applications. PMID:25707855
An Approach in Radiation Therapy Treatment Planning: A Fast, GPU-Based Monte Carlo Method.
Karbalaee, Mojtaba; Shahbazi-Gahrouei, Daryoush; Tavakoli, Mohammad B
2017-01-01
An accurate and fast radiation dose calculation is essential for successful radiation radiotherapy. The aim of this study was to implement a new graphic processing unit (GPU) based radiation therapy treatment planning for accurate and fast dose calculation in radiotherapy centers. A program was written for parallel running based on GPU. The code validation was performed by EGSnrc/DOSXYZnrc. Moreover, a semi-automatic, rotary, asymmetric phantom was designed and produced using a bone, the lung, and the soft tissue equivalent materials. All measurements were performed using a Mapcheck dosimeter. The accuracy of the code was validated using the experimental data, which was obtained from the anthropomorphic phantom as the gold standard. The findings showed that, compared with those of DOSXYZnrc in the virtual phantom and for most of the voxels (>95%), GPU-based Monte Carlo method in dose calculation may be useful in routine radiation therapy centers as the core and main component of a treatment planning verification system.
Energy Technology Data Exchange (ETDEWEB)
Gallarno, George [Christian Brothers University; Rogers, James H [ORNL; Maxwell, Don E [ORNL
2015-01-01
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
Interior Point Methods on GPU with application to Model Predictive Control
DEFF Research Database (Denmark)
Gade-Nielsen, Nicolai Fog
The goal of this thesis is to investigate the application of interior point methods to solve dynamical optimization problems, using a graphical processing unit (GPU) with a focus on problems arising in Model Predictice Control (MPC). Multi-core processors have been available for over ten years now......, and demonstrate that our implementation can reduce the solution time substantially. There are multiple software packages available for solving optimization problems with interior point methods, such as GLPK, IPOPT, MOSEK and many more. However, none of these support the GPU yet. With this thesis, we include a new...... software package called GPUOPT, available under the non-restrictive MIT license. GPUOPT includes includes a primal-dual interior-point method, which supports both the CPU and the GPU. It is implemented as multiple components, where the matrix operations and solver for the Newton directions is separated...
Compact multimode fiber beam-shaping system based on GPU accelerated digital holography.
Plöschner, Martin; Čižmár, Tomáš
2015-01-15
Real-time, on-demand, beam shaping at the end of the multimode fiber has recently been made possible by exploiting the computational power of rapidly evolving graphics processing unit (GPU) technology [Opt. Express 22, 2933 (2014)]. However, the current state-of-the-art system requires the presence of an acousto-optic deflector (AOD) to produce images at the end of the fiber without interference effects between neighboring output points. Here, we present a system free from the AOD complexity where we achieve the removal of the undesired interference effects computationally using GPU implemented Gerchberg-Saxton and Yang-Gu algorithms. The GPU implementation is two orders of magnitude faster than the CPU implementation which allows video-rate image control at the distal end of the fiber virtually free of interference effects.
GPU Acceleration for Density Functional Theory with Slater-Type Orbitals
van Schoot, H.; Visscher, L.; Walker, R.C.; Gotz, A.W.
2016-01-01
This chapter describes the graphics processing unit (GPU) acceleration of density functional theory (DFT) calculations with Slater-type orbital (STOs) as developed for the Amsterdam Density Functional (ADF) program. This implementation is focused on accelerating the numerical integration step in
GPU acceleration of the stochastic grid bundling method for early-exercise options
A. Leitao Rodriguez (Álvaro); C.W. Oosterlee (Cornelis)
2015-01-01
htmlabstractIn this work, a parallel graphics processing units (GPU) version of the Monte Carlo stochastic grid bundling method (SGBM) for pricing multi-dimensional early-exercise options is presented. To extend the method’s applicability, the problem dimensions and the number of bundles will be
BLAZE-DEM: A GPU based Polyhedral DEM particle transport code
CSIR Research Space (South Africa)
Govender, Nicolin
2013-05-01
Full Text Available This paper introduces the BLAZE-DEM code that is based on the Discrete Element Method (DEM) and specifically targeted for Graphical Processing Unit (GPU) platforms. BLAZE-DEM uses actual polyhedral particle representations as opposed to multi...
GPU accelerated population annealing algorithm
Barash, Lev Yu.; Weigel, Martin; Borovský, Michal; Janke, Wolfhard; Shchur, Lev N.
2017-11-01
steps and multi-histogram reweighting. Additional comments: Code repository at https://github.com/LevBarash/PAising. The system size and size of the population of replicas are limited depending on the memory of the GPU device used. For the default parameter values used in the sample programs, L = 64, θ = 100, β0 = 0, βf = 1, Δβ = 0 . 005, R = 20 000, a typical run time on an NVIDIA Tesla K80 GPU is 151 seconds for the single spin coded (SSC) and 17 seconds for the multi-spin coded (MSC) program (see Section 2 for a description of these parameters).
Ng, C M
2013-10-01
The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
Xu, Daguang; Huang, Yong; Kang, Jin U
2014-06-16
We implemented the graphics processing unit (GPU) accelerated compressive sensing (CS) non-uniform in k-space spectral domain optical coherence tomography (SD OCT). Kaiser-Bessel (KB) function and Gaussian function are used independently as the convolution kernel in the gridding-based non-uniform fast Fourier transform (NUFFT) algorithm with different oversampling ratios and kernel widths. Our implementation is compared with the GPU-accelerated modified non-uniform discrete Fourier transform (MNUDFT) matrix-based CS SD OCT and the GPU-accelerated fast Fourier transform (FFT)-based CS SD OCT. It was found that our implementation has comparable performance to the GPU-accelerated MNUDFT-based CS SD OCT in terms of image quality while providing more than 5 times speed enhancement. When compared to the GPU-accelerated FFT based-CS SD OCT, it shows smaller background noise and less side lobes while eliminating the need for the cumbersome k-space grid filling and the k-linear calibration procedure. Finally, we demonstrated that by using a conventional desktop computer architecture having three GPUs, real-time B-mode imaging can be obtained in excess of 30 fps for the GPU-accelerated NUFFT based CS SD OCT with frame size 2048(axial) × 1,000(lateral).
Xiao, Kai; Chen, Danny Z; Hu, X Sharon; Zhou, Bo
2012-12-01
The three-dimensional digital differential analyzer (3D-DDA) algorithm is a widely used ray traversal method, which is also at the core of many convolution∕superposition (C∕S) dose calculation approaches. However, porting existing C∕S dose calculation methods onto graphics processing unit (GPU) has brought challenges to retaining the efficiency of this algorithm. In particular, straightforward implementation of the original 3D-DDA algorithm inflicts a lot of branch divergence which conflicts with the GPU programming model and leads to suboptimal performance. In this paper, an efficient GPU implementation of the 3D-DDA algorithm is proposed, which effectively reduces such branch divergence and improves performance of the C∕S dose calculation programs running on GPU. The main idea of the proposed method is to convert a number of conditional statements in the original 3D-DDA algorithm into a set of simple operations (e.g., arithmetic, comparison, and logic) which are better supported by the GPU architecture. To verify and demonstrate the performance improvement, this ray traversal method was integrated into a GPU-based collapsed cone convolution∕superposition (CCCS) dose calculation program. The proposed method has been tested using a water phantom and various clinical cases on an NVIDIA GTX570 GPU. The CCCS dose calculation program based on the efficient 3D-DDA ray traversal implementation runs 1.42 ∼ 2.67× faster than the one based on the original 3D-DDA implementation, without losing any accuracy. The results show that the proposed method can effectively reduce branch divergence in the original 3D-DDA ray traversal algorithm and improve the performance of the CCCS program running on GPU. Considering the wide utilization of the 3D-DDA algorithm, various applications can benefit from this implementation method.
FastGCN: a GPU accelerated tool for fast gene co-expression networks.
Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
2015-01-01
Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU
Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong
2014-03-01
A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNRmobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
FastGCN: a GPU accelerated tool for fast gene co-expression networks.
Directory of Open Access Journals (Sweden)
Meimei Liang
Full Text Available Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Molecular dynamics simulations through GPU video games technologies.
Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations.
Cucheb: A GPU implementation of the filtered Lanczos procedure
Aurentz, Jared L.; Kalantzis, Vassilis; Saad, Yousef
2017-11-01
This paper describes the software package Cucheb, a GPU implementation of the filtered Lanczos procedure for the solution of large sparse symmetric eigenvalue problems. The filtered Lanczos procedure uses a carefully chosen polynomial spectral transformation to accelerate convergence of the Lanczos method when computing eigenvalues within a desired interval. This method has proven particularly effective for eigenvalue problems that arise in electronic structure calculations and density functional theory. We compare our implementation against an equivalent CPU implementation and show that using the GPU can reduce the computation time by more than a factor of 10. Program Summary Program title: Cucheb Program Files doi:http://dx.doi.org/10.17632/rjr9tzchmh.1 Licensing provisions: MIT Programming language: CUDA C/C++ Nature of problem: Electronic structure calculations require the computation of all eigenvalue-eigenvector pairs of a symmetric matrix that lie inside a user-defined real interval. Solution method: To compute all the eigenvalues within a given interval a polynomial spectral transformation is constructed that maps the desired eigenvalues of the original matrix to the exterior of the spectrum of the transformed matrix. The Lanczos method is then used to compute the desired eigenvectors of the transformed matrix, which are then used to recover the desired eigenvalues of the original matrix. The bulk of the operations are executed in parallel using a graphics processing unit (GPU). Runtime: Variable, depending on the number of eigenvalues sought and the size and sparsity of the matrix. Additional comments: Cucheb is compatible with CUDA Toolkit v7.0 or greater.
Solving Bivariate Polynomial Systems on a GPU
International Nuclear Information System (INIS)
Moreno Maza, Marc; Pan Wei
2012-01-01
We present a CUDA implementation of dense multivariate polynomial arithmetic based on Fast Fourier Transforms over finite fields. Our core routine computes on the device (GPU) the subresultant chain of two polynomials with respect to a given variable. This subresultant chain is encoded by values on a FFT grid and is manipulated from the host (CPU) in higher-level procedures. We have realized a bivariate polynomial system solver supported by our GPU code. Our experimental results (including detailed profiling information and benchmarks against a serial polynomial system solver implementing the same algorithm) demonstrate that our strategy is well suited for GPU implementation and provides large speedup factors with respect to pure CPU code.
Length-Bounded Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection
Directory of Open Access Journals (Sweden)
Yi-Shan Lin
2017-01-01
Full Text Available Since frequent communication between applications takes place in high speed networks, deep packet inspection (DPI plays an important role in the network application awareness. The signature-based network intrusion detection system (NIDS contains a DPI technique that examines the incoming packet payloads by employing a pattern matching algorithm that dominates the overall inspection performance. Existing studies focused on implementing efficient pattern matching algorithms by parallel programming on software platforms because of the advantages of lower cost and higher scalability. Either the central processing unit (CPU or the graphic processing unit (GPU were involved. Our studies focused on designing a pattern matching algorithm based on the cooperation between both CPU and GPU. In this paper, we present an enhanced design for our previous work, a length-bounded hybrid CPU/GPU pattern matching algorithm (LHPMA. In the preliminary experiment, the performance and comparison with the previous work are displayed, and the experimental results show that the LHPMA can achieve not only effective CPU/GPU cooperation but also higher throughput than the previous method.
Implementation of GPU accelerated SPECT reconstruction with Monte Carlo-based scatter correction.
Bexelius, Tobias; Sohlberg, Antti
2018-03-21
Statistical SPECT reconstruction can be very time-consuming especially when compensations for collimator and detector response, attenuation, and scatter are included in the reconstruction. This work proposes an accelerated SPECT reconstruction algorithm based on graphics processing unit (GPU) processing. Ordered subset expectation maximization (OSEM) algorithm with CT-based attenuation modelling, depth-dependent Gaussian convolution-based collimator-detector response modelling, and Monte Carlo-based scatter compensation was implemented using OpenCL. The OpenCL implementation was compared against the existing multi-threaded OSEM implementation running on a central processing unit (CPU) in terms of scatter-to-primary ratios, standardized uptake values (SUVs), and processing speed using mathematical phantoms and clinical multi-bed bone SPECT/CT studies. The difference in scatter-to-primary ratios, visual appearance, and SUVs between GPU and CPU implementations was minor. On the other hand, at its best, the GPU implementation was noticed to be 24 times faster than the multi-threaded CPU version on a normal 128 × 128 matrix size 3 bed bone SPECT/CT data set when compensations for collimator and detector response, attenuation, and scatter were included. GPU SPECT reconstructions show great promise as an every day clinical reconstruction tool.
GPU-accelerated Kernel Regression Reconstruction for Freehand 3D Ultrasound Imaging.
Wen, Tiexiang; Li, Ling; Zhu, Qingsong; Qin, Wenjian; Gu, Jia; Yang, Feng; Xie, Yaoqin
2017-07-01
Volume reconstruction method plays an important role in improving reconstructed volumetric image quality for freehand three-dimensional (3D) ultrasound imaging. By utilizing the capability of programmable graphics processing unit (GPU), we can achieve a real-time incremental volume reconstruction at a speed of 25-50 frames per second (fps). After incremental reconstruction and visualization, hole-filling is performed on GPU to fill remaining empty voxels. However, traditional pixel nearest neighbor-based hole-filling fails to reconstruct volume with high image quality. On the contrary, the kernel regression provides an accurate volume reconstruction method for 3D ultrasound imaging but with the cost of heavy computational complexity. In this paper, a GPU-based fast kernel regression method is proposed for high-quality volume after the incremental reconstruction of freehand ultrasound. The experimental results show that improved image quality for speckle reduction and details preservation can be obtained with the parameter setting of kernel window size of [Formula: see text] and kernel bandwidth of 1.0. The computational performance of the proposed GPU-based method can be over 200 times faster than that on central processing unit (CPU), and the volume with size of 50 million voxels in our experiment can be reconstructed within 10 seconds.
Directory of Open Access Journals (Sweden)
M. Huang
2015-09-01
Full Text Available The planetary boundary layer (PBL is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds, and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF model of which the Yonsei University (YSU scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.
GPU based framework for geospatial analyses
Cosmin Sandric, Ionut; Ionita, Cristian; Dardala, Marian; Furtuna, Titus
2017-04-01
Parallel processing on multiple CPU cores is already used at large scale in geocomputing, but parallel processing on graphics cards is just at the beginning. Being able to use an simple laptop with a dedicated graphics card for advanced and very fast geocomputation is an advantage that each scientist wants to have. The necessity to have high speed computation in geosciences has increased in the last 10 years, mostly due to the increase in the available datasets. These datasets are becoming more and more detailed and hence they require more space to store and more time to process. Distributed computation on multicore CPU's and GPU's plays an important role by processing one by one small parts from these big datasets. These way of computations allows to speed up the process, because instead of using just one process for each dataset, the user can use all the cores from a CPU or up to hundreds of cores from GPU The framework provide to the end user a standalone tools for morphometry analyses at multiscale level. An important part of the framework is dedicated to uncertainty propagation in geospatial analyses. The uncertainty may come from the data collection or may be induced by the model or may have an infinite sources. These uncertainties plays important roles when a spatial delineation of the phenomena is modelled. Uncertainty propagation is implemented inside the GPU framework using Monte Carlo simulations. The GPU framework with the standalone tools proved to be a reliable tool for modelling complex natural phenomena The framework is based on NVidia Cuda technology and is written in C++ programming language. The code source will be available on github at https://github.com/sandricionut/GeoRsGPU Acknowledgement: GPU framework for geospatial analysis, Young Researchers Grant (ICUB-University of Bucharest) 2016, director Ionut Sandric
CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis
Directory of Open Access Journals (Sweden)
Federico Raimondo
2012-01-01
Full Text Available In recent years, Independent Component Analysis (ICA has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation.
Stacked-Bloch-wave electron diffraction simulations using GPU acceleration
International Nuclear Information System (INIS)
Pennington, Robert S.; Wang, Feng; Koch, Christoph T.
2014-01-01
In this paper, we discuss the advantages for Bloch-wave simulations performed using graphics processing units (GPUs), based on approximating the matrix exponential directly instead of performing a matrix diagonalization. Our direct matrix-exponential algorithm yields a functionally identical electron scattering matrix to that generated with matrix diagonalization. Using the matrix-exponential scaling-and-squaring method with a Padé approximation, direct GPU-based matrix-exponential double-precision calculations are up to 20× faster than CPU-based calculations and up to approximately 70× faster than matrix diagonalization. We compare precision and runtime of scaling and squaring methods with either the Padé approximation or a Taylor expansion. We also discuss the stacked-Bloch-wave method, and show that our stacked-Bloch-wave implementation yields the same electron scattering matrix as traditional Bloch-wave matrix diagonalization. - Highlights: • Bloch-wave and stacked-Bloch-wave calculations can be accelerated with GPUs. • Direct approximation of the matrix exponential can be faster than diagonalization. • GPU-based direct approximation can be ≈70× faster than CPU diagonalization. • Larger matrices benefit more from this approach than smaller ones. • Stacked-Bloch-wave scattering results are functionally identical to diagonalization
A review of GPU-based medical image reconstruction.
Després, Philippe; Jia, Xun
2017-10-01
Tomographic image reconstruction is a computationally demanding task, even more so when advanced models are used to describe a more complete and accurate picture of the image formation process. Such advanced modeling and reconstruction algorithms can lead to better images, often with less dose, but at the price of long calculation times that are hardly compatible with clinical workflows. Fortunately, reconstruction tasks can often be executed advantageously on Graphics Processing Units (GPUs), which are exploited as massively parallel computational engines. This review paper focuses on recent developments made in GPU-based medical image reconstruction, from a CT, PET, SPECT, MRI and US perspective. Strategies and approaches to get the most out of GPUs in image reconstruction are presented as well as innovative applications arising from an increased computing capacity. The future of GPU-based image reconstruction is also envisioned, based on current trends in high-performance computing. Copyright © 2017 Associazione Italiana di Fisica Medica. Published by Elsevier Ltd. All rights reserved.
Large Data Visualization on Distributed Memory Mulit-GPU Clusters
Energy Technology Data Exchange (ETDEWEB)
Childs, Henry R.
2010-03-01
Data sets of immense size are regularly generated on large scale computing resources. Even among more traditional methods for acquisition of volume data, such as MRI and CT scanners, data which is too large to be effectively visualization on standard workstations is now commonplace. One solution to this problem is to employ a 'visualization cluster,' a small to medium scale cluster dedicated to performing visualization and analysis of massive data sets generated on larger scale supercomputers. These clusters are designed to fit a different need than traditional supercomputers, and therefore their design mandates different hardware choices, such as increased memory, and more recently, graphics processing units (GPUs). While there has been much previous work on distributed memory visualization as well as GPU visualization, there is a relative dearth of algorithms which effectively use GPUs at a large scale in a distributed memory environment. In this work, we study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets.
Graph coarsening and clustering on the GPU
Fagginger Auer, B.O.; Bisseling, R.H.
2013-01-01
Agglomerative clustering is an effective greedy way to quickly generate graph clusterings of high modularity in a small amount of time. In an effort to use the power offered by multi-core CPU and GPU hardware to solve the clustering problem, we introduce a fine-grained sharedmemory parallel graph
Use of GPU Computing for Uncertainty Quantification in Computational Mechanics: A Case Study
Directory of Open Access Journals (Sweden)
Gaurav
2011-01-01
Full Text Available Graphics processing units (GPUs are rapidly emerging as a more economical and highly competitive alternative to CPU-based parallel computing. As the degree of software control of GPUs has increased, many researchers have explored their use in non-gaming applications. Recent studies have shown that GPUs consistently outperform their best corresponding CPU-based parallel computing alternatives in single-instruction multiple-data (SIMD strategies. This study explores the use of GPUs for uncertainty quantification in computational mechanics. Five types of analysis procedures that are frequently utilized for uncertainty quantification of mechanical and dynamical systems have been considered and their GPU implementations have been developed. The numerical examples presented in this study show that considerable gains in computational efficiency can be obtained for these procedures. It is expected that the GPU implementations presented in this study will serve as initial bases for further developments in the use of GPUs in the field of uncertainty quantification and will (i aid the understanding of the performance constraints on the relevant GPU kernels and (ii provide some guidance regarding the computational and the data structures to be utilized in these novel GPU implementations.
GPU-based ultra-fast direct aperture optimization for online adaptive radiation therapy
Men, Chunhua; Jia, Xun; Jiang, Steve B.
2010-08-01
Online adaptive radiation therapy (ART) has great promise to significantly reduce normal tissue toxicity and/or improve tumor control through real-time treatment adaptations based on the current patient anatomy. However, the major technical obstacle for clinical realization of online ART, namely the inability to achieve real-time efficiency in treatment re-planning, has yet to be solved. To overcome this challenge, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) direct aperture optimization (DAO) algorithm on the graphics processing unit (GPU) based on our previous work on the CPU. We formulate the DAO problem as a large-scale convex programming problem, and use an exact method called the column generation approach to deal with its extremely large dimensionality on the GPU. Five 9-field prostate and five 5-field head-and-neck IMRT clinical cases with 5 × 5 mm2 beamlet size and 2.5 × 2.5 × 2.5 mm3 voxel size were tested to evaluate our algorithm on the GPU. It takes only 0.7-3.8 s for our implementation to generate high-quality treatment plans on an NVIDIA Tesla C1060 GPU card. Our work has therefore solved a major problem in developing ultra-fast (re-)planning technologies for online ART.
DOPA: GPU-based protein alignment using database and memory access optimizations
Directory of Open Access Journals (Sweden)
Hasan Laiq
2011-07-01
Full Text Available Abstract Background Smith-Waterman (S-W algorithm is an optimal sequence alignment method for biological databases, but its computational complexity makes it too slow for practical purposes. Heuristics based approximate methods like FASTA and BLAST provide faster solutions but at the cost of reduced accuracy. Also, the expanding volume and varying lengths of sequences necessitate performance efficient restructuring of these databases. Thus to come up with an accurate and fast solution, it is highly desired to speed up the S-W algorithm. Findings This paper presents a high performance protein sequence alignment implementation for Graphics Processing Units (GPUs. The new implementation improves performance by optimizing the database organization and reducing the number of memory accesses to eliminate bandwidth bottlenecks. The implementation is called Database Optimized Protein Alignment (DOPA and it achieves a performance of 21.4 Giga Cell Updates Per Second (GCUPS, which is 1.13 times better than the fastest GPU implementation to date. Conclusions In the new GPU-based implementation for protein sequence alignment (DOPA, the database is organized in equal length sequence sets. This equally distributes the workload among all the threads on the GPU's multiprocessors. The result is an improved performance which is better than the fastest available GPU implementation.
DOPA: GPU-based protein alignment using database and memory access optimizations.
Hasan, Laiq; Kentie, Marijn; Al-Ars, Zaid
2011-07-28
Smith-Waterman (S-W) algorithm is an optimal sequence alignment method for biological databases, but its computational complexity makes it too slow for practical purposes. Heuristics based approximate methods like FASTA and BLAST provide faster solutions but at the cost of reduced accuracy. Also, the expanding volume and varying lengths of sequences necessitate performance efficient restructuring of these databases. Thus to come up with an accurate and fast solution, it is highly desired to speed up the S-W algorithm. This paper presents a high performance protein sequence alignment implementation for Graphics Processing Units (GPUs). The new implementation improves performance by optimizing the database organization and reducing the number of memory accesses to eliminate bandwidth bottlenecks. The implementation is called Database Optimized Protein Alignment (DOPA) and it achieves a performance of 21.4 Giga Cell Updates Per Second (GCUPS), which is 1.13 times better than the fastest GPU implementation to date. In the new GPU-based implementation for protein sequence alignment (DOPA), the database is organized in equal length sequence sets. This equally distributes the workload among all the threads on the GPU's multiprocessors. The result is an improved performance which is better than the fastest available GPU implementation.
Peng, Bo; Wang, Yuqi; Hall, Timothy J; Jiang, Jingfeng
2017-04-01
Our primary objective of this paper was to extend a previously published 2-D coupled subsample tracking algorithm for 3-D speckle tracking in the framework of ultrasound breast strain elastography. In order to overcome heavy computational cost, we investigated the use of a graphic processing unit (GPU) to accelerate the 3-D coupled subsample speckle tracking method. The performance of the proposed GPU implementation was tested using a tissue-mimicking phantom and in vivo breast ultrasound data. The performance of this 3-D subsample tracking algorithm was compared with the conventional 3-D quadratic subsample estimation algorithm. On the basis of these evaluations, we concluded that the GPU implementation of this 3-D subsample estimation algorithm can provide high-quality strain data (i.e., high correlation between the predeformation and the motion-compensated postdeformation radio frequency echo data and high contrast-to-noise ratio strain images), as compared with the conventional 3-D quadratic subsample algorithm. Using the GPU implementation of the 3-D speckle tracking algorithm, volumetric strain data can be achieved relatively fast (approximately 20 s per volume [2.5 cm ×2.5 cm ×2.5 cm]).
Monte Carlo dose calculations for high-dose-rate brachytherapy using GPU-accelerated processing.
Tian, Z; Zhang, M; Hrycushko, B; Albuquerque, K; Jiang, S B; Jia, X
2016-01-01
Current clinical brachytherapy dose calculations are typically based on the Association of American Physicists in Medicine Task Group report 43 (TG-43) guidelines, which approximate patient geometry as an infinitely large water phantom. This ignores patient and applicator geometries and heterogeneities, causing dosimetric errors. Although Monte Carlo (MC) dose calculation is commonly recognized as the most accurate method, its associated long computational time is a major bottleneck for routine clinical applications. This article presents our recent developments of a fast MC dose calculation package for high-dose-rate (HDR) brachytherapy, gBMC, built on a graphics processing unit (GPU) platform. gBMC-simulated photon transport in voxelized geometry with physics in (192)Ir HDR brachytherapy energy range considered. A phase-space file was used as a source model. GPU-based parallel computation was used to simultaneously transport multiple photons, one on a GPU thread. We validated gBMC by comparing the dose calculation results in water with that computed TG-43. We also studied heterogeneous phantom cases and a patient case and compared gBMC results with Acuros BV results. Radial dose function in water calculated by gBMC showed GPU-based MC dose calculation package, gBMC, for HDR brachytherapy make it attractive for clinical applications. Copyright © 2016 American Brachytherapy Society. Published by Elsevier Inc. All rights reserved.
cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.
Zhang, Jing; Wang, Hao; Feng, Wu-Chun
2017-01-01
BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.
Parallel computing in cluster of GPU applied to a problem of nuclear engineering
International Nuclear Information System (INIS)
Moraes, Sergio Ricardo S.; Heimlich, Adino; Resende, Pedro
2013-01-01
Cluster computing has been widely used as a low cost alternative for parallel processing in scientific applications. With the use of Message-Passing Interface (MPI) protocol development became even more accessible and widespread in the scientific community. A more recent trend is the use of Graphic Processing Unit (GPU), which is a powerful co-processor able to perform hundreds of instructions in parallel, reaching a capacity of hundreds of times the processing of a CPU. However, a standard PC does not allow, in general, more than two GPUs. Hence, it is proposed in this work development and evaluation of a hybrid low cost parallel approach to the solution to a nuclear engineering typical problem. The idea is to use clusters parallelism technology (MPI) together with GPU programming techniques (CUDA - Compute Unified Device Architecture) to simulate neutron transport through a slab using Monte Carlo method. By using a cluster comprised by four quad-core computers with 2 GPU each, it has been developed programs using MPI and CUDA technologies. Experiments, applying different configurations, from 1 to 8 GPUs has been performed and results were compared with the sequential (non-parallel) version. A speed up of about 2.000 times has been observed when comparing the 8-GPU with the sequential version. Results here presented are discussed and analyzed with the objective of outlining gains and possible limitations of the proposed approach. (author)
Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil
2013-04-04
The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+. CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at http://cudasw.sourceforge.net.
Directory of Open Access Journals (Sweden)
Iqbal Al Amri
2012-01-01
Full Text Available Inversely planned intensity-modulated radiotherapy (IMRT and stereotactic small field radiotherapy should be verified before treatment execution. A second verification is carried out for planned treatments in IMRT and 3D conformal radiotherapy (3D-CRT using a monitor verification commercial dose calculation management software (DCMS. For the same reference point the ion-chamber measured doses are compared for IMRT plans. DCMS (Diamond computes dose based on modified Clarkson integration, accounting for multi-leaf collimators (MLC transmission and measured collimator scatter factors. DCMS was validated with treatment planning system (TPS (Eclipse 6.5 Version, Varian, USA separately. Treatment plans computed from TPS are exported to DCMS using DICOM interface. Doses are re-calculated at selected points for fields delivered to IMRT phantom (IBA Scanditronix Wellhofer in high-energy linac (Clinac 2300 CD, Varian. Doses measured at central axis, for the same points using CC13 (0.13 cc ion chamber with Dose 1 Electrometer (Scanditronix Wellhofer are compared with calculated data on DCMS and TPS. The data of 53 IMRT patients with fields ranging from 5 to 9 are reported. The computed dose for selected monitor units (MU by Diamond showed good agreement with planned doses by TPS. DCMS dose prediction matched well in 3D-CRT forward plans (0.8 ± 1.3%, n = 37 and in IMRT inverse plans (−0.1 ± 2.2%, n = 37. Ion chamber measurements agreed well with Eclipse planned doses (−2.1 ± 2.0%, n = 53 and re-calculated DCMS doses (−1.5 ± 2.6%, n = 37 in phantom. DCMS dose validation is in reasonable agreement with TPS. DCMS calculations corroborate well with ionometric measured doses in most of the treatment plans.
Al Amri, Iqbal; Ravichandran, Ramamoorthy; Sivakumar, Somangili Satyamoorthi; Binukumar, Johnson Pichi; Davis, Chirayathmanjiyil Antony; Al Rahbi, Zakia; Al Shukeili, Khalsa; Al Kindi, Fatima
2012-01-01
Inversely planned intensity-modulated radiotherapy (IMRT) and stereotactic small field radiotherapy should be verified before treatment execution. A second verification is carried out for planned treatments in IMRT and 3D conformal radiotherapy (3D-CRT) using a monitor verification commercial dose calculation management software (DCMS). For the same reference point the ion-chamber measured doses are compared for IMRT plans. DCMS (Diamond) computes dose based on modified Clarkson integration, accounting for multi-leaf collimators (MLC) transmission and measured collimator scatter factors. DCMS was validated with treatment planning system (TPS) (Eclipse 6.5 Version, Varian, USA) separately. Treatment plans computed from TPS are exported to DCMS using DICOM interface. Doses are re-calculated at selected points for fields delivered to IMRT phantom (IBA Scanditronix Wellhofer) in high-energy linac (Clinac 2300 CD, Varian). Doses measured at central axis, for the same points using CC13 (0.13 cc) ion chamber with Dose 1 Electrometer (Scanditronix Wellhofer) are compared with calculated data on DCMS and TPS. The data of 53 IMRT patients with fields ranging from 5 to 9 are reported. The computed dose for selected monitor units (MU) by Diamond showed good agreement with planned doses by TPS. DCMS dose prediction matched well in 3D-CRT forward plans (0.8 ± 1.3%, n = 37) and in IMRT inverse plans (–0.1 ± 2.2%, n = 37). Ion chamber measurements agreed well with Eclipse planned doses (–2.1 ± 2.0%, n = 53) and re-calculated DCMS doses (–1.5 ± 2.6%, n = 37) in phantom. DCMS dose validation is in reasonable agreement with TPS. DCMS calculations corroborate well with ionometric measured doses in most of the treatment plans. PMID:23293456
A Fast Poisson Solver with Periodic Boundary Conditions for GPU Clusters in Various Configurations
Rattermann, Dale Nicholas
Fast Poisson solvers using the Fast Fourier Transform on uniform grids are especially suited for parallel implementation, making them appropriate for portability on graphical processing unit (GPU) devices. The goal of the following work was to implement, test, and evaluate a fast Poisson solver for periodic boundary conditions for use on a variety of GPU configurations. The solver used in this research was FLASH, an immersed-boundary-based method, which is well suited for complex, time-dependent geometries, has robust adaptive mesh refinement/de-refinement capabilities to capture evolving flow structures, and has been successfully implemented on conventional, parallel supercomputers. However, these solvers are still computationally costly to employ, and the total solver time is dominated by the solution of the pressure Poisson equation using state-of-the-art multigrid methods. FLASH improves the performance of its multigrid solvers by integrating a parallel FFT solver on a uniform grid during a coarse level. This hybrid solver could then be theoretically improved by replacing the highly-parallelizable FFT solver with one that utilizes GPUs, and, thus, was the motivation for my research. In the present work, the CPU-utilizing parallel FFT solver (PFFT) used in the base version of FLASH for solving the Poisson equation on uniform grids has been modified to enable parallel execution on CUDA-enabled GPU devices. New algorithms have been implemented to replace the Poisson solver that decompose the computational domain and send each new block to a GPU for parallel computation. One-dimensional (1-D) decomposition of the computational domain minimizes the amount of network traffic involved in this bandwidth-intensive computation by limiting the amount of all-to-all communication required between processes. Advanced techniques have been incorporated and implemented in a GPU-centric code design, while allowing end users the flexibility of parameter control at runtime in
High performance MRI simulations of motion on multi-GPU systems.
Xanthis, Christos G; Venetis, Ioannis E; Aletras, Anthony H
2014-07-04
MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer multi-GPU configuration. The incorporation
Chi, Yujie; Tian, Zhen; Jia, Xun
2016-08-07
Monte Carlo (MC) particle transport simulation on a graphics-processing unit (GPU) platform has been extensively studied recently due to the efficiency advantage achieved via massive parallelization. Almost all of the existing GPU-based MC packages were developed for voxelized geometry. This limited application scope of these packages. The purpose of this paper is to develop a module to model parametric geometry and integrate it in GPU-based MC simulations. In our module, each continuous region was defined by its bounding surfaces that were parameterized by quadratic functions. Particle navigation functions in this geometry were developed. The module was incorporated to two previously developed GPU-based MC packages and was tested in two example problems: (1) low energy photon transport simulation in a brachytherapy case with a shielded cylinder applicator and (2) MeV coupled photon/electron transport simulation in a phantom containing several inserts of different shapes. In both cases, the calculated dose distributions agreed well with those calculated in the corresponding voxelized geometry. The averaged dose differences were 1.03% and 0.29%, respectively. We also used the developed package to perform simulations of a Varian VS 2000 brachytherapy source and generated a phase-space file. The computation time under the parameterized geometry depended on the memory location storing the geometry data. When the data was stored in GPU's shared memory, the highest computational speed was achieved. Incorporation of parameterized geometry yielded a computation time that was ~3 times of that in the corresponding voxelized geometry. We also developed a strategy to use an auxiliary index array to reduce frequency of geometry calculations and hence improve efficiency. With this strategy, the computational time ranged in 1.75-2.03 times of the voxelized geometry for coupled photon/electron transport depending on the voxel dimension of the auxiliary index array, and in 0
A GPU-Parallelized Eigen-Based Clutter Filter Framework for Ultrasound Color Flow Imaging.
Chee, Adrian J Y; Yiu, Billy Y S; Yu, Alfred C H
2017-01-01
Eigen-filters with attenuation response adapted to clutter statistics in color flow imaging (CFI) have shown improved flow detection sensitivity in the presence of tissue motion. Nevertheless, its practical adoption in clinical use is not straightforward due to the high computational cost for solving eigendecompositions. Here, we provide a pedagogical description of how a real-time computing framework for eigen-based clutter filtering can be developed through a single-instruction, multiple data (SIMD) computing approach that can be implemented on a graphical processing unit (GPU). Emphasis is placed on the single-ensemble-based eigen-filtering approach (Hankel singular value decomposition), since it is algorithmically compatible with GPU-based SIMD computing. The key algebraic principles and the corresponding SIMD algorithm are explained, and annotations on how such algorithm can be rationally implemented on the GPU are presented. Real-time efficacy of our framework was experimentally investigated on a single GPU device (GTX Titan X), and the computing throughput for varying scan depths and slow-time ensemble lengths was studied. Using our eigen-processing framework, real-time video-range throughput (24 frames/s) can be attained for CFI frames with full view in azimuth direction (128 scanlines), up to a scan depth of 5 cm ( λ pixel axial spacing) for slow-time ensemble length of 16 samples. The corresponding CFI image frames, with respect to the ones derived from non-adaptive polynomial regression clutter filtering, yielded enhanced flow detection sensitivity in vivo, as demonstrated in a carotid imaging case example. These findings indicate that the GPU-enabled eigen-based clutter filtering can improve CFI flow detection performance in real time.
Heterogeneous Gpu&Cpu Cluster For High Performance Computing In Cryptography
Directory of Open Access Journals (Sweden)
Michał Marks
2012-01-01
Full Text Available This paper addresses issues associated with distributed computing systems andthe application of mixed GPU&CPU technology to data encryption and decryptionalgorithms. We describe a heterogenous cluster HGCC formed by twotypes of nodes: Intel processor with NVIDIA graphics processing unit and AMDprocessor with AMD graphics processing unit (formerly ATI, and a novel softwareframework that hides the heterogeneity of our cluster and provides toolsfor solving complex scientific and engineering problems. Finally, we present theresults of numerical experiments. The considered case study is concerned withparallel implementations of selected cryptanalysis algorithms. The main goal ofthe paper is to show the wide applicability of the GPU&CPU technology tolarge scale computation and data processing.
Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS.
Davis, Nicholas A; Pandey, Ahwan; McKinney, B A
2011-01-15
Bioinformatics researchers have a variety of programming languages and architectures at their disposal, and recent advances in graphics processing unit (GPU) computing have added a promising new option. However, many performance comparisons inflate the actual advantages of GPU technology. In this study, we carry out a realistic performance evaluation of SNPrank, a network centrality algorithm that ranks single nucleotide polymorhisms (SNPs) based on their importance in the context of a phenotype-specific interaction network. Our goal is to identify the best computational engine for the SNPrank web application and to provide a variety of well-tested implementations of SNPrank for Bioinformaticists to integrate into their research. Using SNP data from the Wellcome Trust Case Control Consortium genome-wide association study of Bipolar Disorder, we compare multiple SNPrank implementations, including Python, Matlab and Java as well as CPU versus GPU implementations. When compared with naïve, single-threaded CPU implementations, the GPU yields a large improvement in the execution time. However, with comparable effort, multi-threaded CPU implementations negate the apparent advantage of GPU implementations. The SNPrank code is open source and available at http://insilico.utulsa.edu/snprank.
Haptic Feedback for the GPU-based Surgical Simulator
DEFF Research Database (Denmark)
Sørensen, Thomas Sangild; Mosegaard, Jesper
2006-01-01
The GPU has proven to be a powerful processor to compute spring-mass based surgical simulations. It has not previously been shown however, how to effectively implement haptic interaction with a simulation running entirely on the GPU. This paper describes a method to calculate haptic feedback...... with limited performance cost. It allows easy balancing of the GPU workload between calculations of simulation, visualisation, and the haptic feedback....
GPU-computing in econophysics and statistical physics
Preis, T.
2011-03-01
A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples. In particular computationally expensive analyses employed in financial market context are coded on a graphics card architecture which leads to a significant reduction of computing time. In order to demonstrate the wide range of possible applications, a standard model in statistical physics - the Ising model - is ported to a graphics card architecture as well, resulting in large speedup values.
Explicit integration with GPU acceleration for large kinetic networks
International Nuclear Information System (INIS)
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike
2015-01-01
We demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. This orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
Multi-GPU based acceleration of a list-mode DRAMA toward real-time OpenPET imaging
Energy Technology Data Exchange (ETDEWEB)
Kinouchi, Shoko [Chiba Univ. (Japan); National Institute of Radiological Sciences, Chiba (Japan); Yamaya, Taiga; Yoshida, Eiji; Tashima, Hideaki [National Institute of Radiological Sciences, Chiba (Japan); Kudo, Hiroyuki [Tsukuba Univ., Ibaraki (Japan); Suga, Mikio [Chiba Univ. (Japan)
2011-07-01
OpenPET, which has a physical gap between two detector rings, is our new PET geometry. In order to realize future radiation therapy guided by OpenPET, real-time imaging is required. Therefore we developed a list-mode image reconstruction method using general purpose graphic processing units (GPUs). For GPU implementation, the efficiency of acceleration depends on the implementation method which is required to avoid conditional statements. Therefore, in our previous study, we developed a new system model which was suited for the GPU implementation. In this paper, we implemented our image reconstruction method using 4 GPUs to get further acceleration. We applied the developed reconstruction method to a small OpenPET prototype. We obtained calculation times of total iteration using 4 GPUs that were 3.4 times faster than using a single GPU. Compared to using a single CPU, we achieved the reconstruction time speed-up of 142 times using 4 GPUs. (orig.)
Directory of Open Access Journals (Sweden)
Nicholas P. Bailey, Trond S. Ingebrigtsen, Jesper Schmidt Hansen, Arno A. Veldhorst, Lasse Bøhling, Claire A. Lemarchand, Andreas E. Olsen, Andreas K. Bacher, Lorenzo Costigliola, Ulf R. Pedersen, Heine Larsen, Jeppe C. Dyre, Thomas B. Schrøder
2017-12-01
Full Text Available RUMD is a general purpose, high-performance molecular dynamics (MD simulation package running on graphical processing units (GPU's. RUMD addresses the challenge of utilizing the many-core nature of modern GPU hardware when simulating small to medium system sizes (roughly from a few thousand up to hundred thousand particles. It has a performance that is comparable to other GPU-MD codes at large system sizes and substantially better at smaller sizes.RUMD is open-source and consists of a library written in C++ and the CUDA extension to C, an easy-to-use Python interface, and a set of tools for set-up and post-simulation data analysis. The paper describes RUMD's main features, optimizations and performance benchmarks.
A GPU implementation of the Simulated Annealing Heuristic for the Quadratic Assignment Problem
Paul, Gerald
2012-01-01
The quadratic assignment problem (QAP) is one of the most difficult combinatorial optimization problems. An effective heuristic for obtaining approximate solutions to the QAP is simulated annealing (SA). Here we describe an SA implementation for the QAP which runs on a graphics processing unit (GPU). GPUs are composed of low cost commodity graphics chips which in combination provide a powerful platform for general purpose parallel computing. For SA runs with large numbers of iterations, we fi...
DEFF Research Database (Denmark)
Bailey, Nicholas; Ingebrigtsen, Trond; Hansen, Jesper Schmidt
2017-01-01
RUMD is a general purpose, high-performance molecular dynamics (MD) simulation package running on graphical processing units (GPU’s). RUMD addresses the challenge of utilizing the many-core nature of modern GPU hardware when simulating small to medium system sizes (roughly from a few thousand up...
Large Scale Simulations of the Euler Equations on GPU Clusters
Liebmann, Manfred
2010-08-01
The paper investigates the scalability of a parallel Euler solver, using the Vijayasundaram method, on a GPU cluster with 32 Nvidia Geforce GTX 295 boards. The aim of this research is to enable large scale fluid dynamics simulations with up to one billion elements. We investigate communication protocols for the GPU cluster to compensate for the slow Gigabit Ethernet network between the GPU compute nodes and to maintain overall efficiency. A diesel engine intake-port and a nozzle, meshed in different resolutions, give good real world examples for the scalability tests on the GPU cluster. © 2010 IEEE.
GPU-based parallel clustered differential pulse code modulation
Wu, Jiaji; Li, Wenze; Kong, Wanqiu
2015-10-01
Hyperspectral remote sensing technology is widely used in marine remote sensing, geological exploration, atmospheric and environmental remote sensing. Owing to the rapid development of hyperspectral remote sensing technology, resolution of hyperspectral image has got a huge boost. Thus data size of hyperspectral image is becoming larger. In order to reduce their saving and transmission cost, lossless compression for hyperspectral image has become an important research topic. In recent years, large numbers of algorithms have been proposed to reduce the redundancy between different spectra. Among of them, the most classical and expansible algorithm is the Clustered Differential Pulse Code Modulation (CDPCM) algorithm. This algorithm contains three parts: first clusters all spectral lines, then trains linear predictors for each band. Secondly, use these predictors to predict pixels, and get the residual image by subtraction between original image and predicted image. Finally, encode the residual image. However, the process of calculating predictors is timecosting. In order to improve the processing speed, we propose a parallel C-DPCM based on CUDA (Compute Unified Device Architecture) with GPU. Recently, general-purpose computing based on GPUs has been greatly developed. The capacity of GPU improves rapidly by increasing the number of processing units and storage control units. CUDA is a parallel computing platform and programming model created by NVIDIA. It gives developers direct access to the virtual instruction set and memory of the parallel computational elements in GPUs. Our core idea is to achieve the calculation of predictors in parallel. By respectively adopting global memory, shared memory and register memory, we finally get a decent speedup.
Moving-Target Position Estimation Using GPU-Based Particle Filter for IoT Sensing Applications
Directory of Open Access Journals (Sweden)
Seongseop Kim
2017-11-01
Full Text Available A particle filter (PF has been introduced for effective position estimation of moving targets for non-Gaussian and nonlinear systems. The time difference of arrival (TDOA method using acoustic sensor array has normally been used to for estimation by concealing the location of a moving target, especially underwater. In this paper, we propose a GPU -based acceleration of target position estimation using a PF and propose an efficient system and software architecture. The proposed graphic processing unit (GPU-based algorithm has more advantages in applying PF signal processing to a target system, which consists of large-scale Internet of Things (IoT-driven sensors because of the parallelization which is scalable. For the TDOA measurement from the acoustic sensor array, we use the generalized cross correlation phase transform (GCC-PHAT method to obtain the correlation coefficient of the signal using Fast Fourier Transform (FFT, and we try to accelerate the calculations of GCC-PHAT based TDOA measurements using FFT with GPU compute unified device architecture (CUDA. The proposed approach utilizes a parallelization method in the target position estimation algorithm using GPU-based PF processing. In addition, it could efficiently estimate sudden movement change of the target using GPU-based parallel computing which also can be used for multiple target tracking. It also provides scalability in extending the detection algorithm according to the increase of the number of sensors. Therefore, the proposed architecture can be applied in IoT sensing applications with a large number of sensors. The target estimation algorithm was verified using MATLAB and implemented using GPU CUDA. We implemented the proposed signal processing acceleration system using target GPU to analyze in terms of execution time. The execution time of the algorithm is reduced by 55% from to the CPU standalone operation in target embedded board, NVIDIA Jetson TX1. Also, to apply large
Accelerating Dense Linear Algebra on the GPU
DEFF Research Database (Denmark)
Sørensen, Hans Henrik Brandenborg
and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. The target hardware is the most recent NVIDIA Tesla 20-series (Fermi...... architecture). Most of the techniques I discuss for accelerating dense linear algebra are applicable to memory-bound GPU algorithms in general....
Topping, T. Russell; French, James; Hancock, Monte F., Jr.
2010-04-01
Working with the Naval Research Laboratory, Celestech has implemented advanced non-linear hyperspectral image (HSI) processing algorithms optimized for Graphics Processing Units (GPU). These algorithms have demonstrated performance improvements of nearly 2 orders of magnitude over optimal CPU-based implementations. The paper briefly covers the architecture of the NIVIDIA GPU to provide a basis for discussing GPU optimization challenges and strategies. The paper then covers optimization approaches employed to extract performance from the GPU implementation of Dr. Bachmann's algorithms including memory utilization and process thread optimization considerations. The paper goes on to discuss strategies for deploying GPU-enabled servers into enterprise service oriented architectures. Also discussed are Celestech's on-going work in the area of middleware frameworks to provide an optimized multi-GPU utilization and scheduling approach that supports both multiple GPUs in a single computer as well as across multiple computers. This paper is a complementary work to the paper submitted by Dr. Charles Bachmann entitled "A Scalable Approach to Modeling Nonlinear Structure in Hyperspectral Imagery and Other High-Dimensional Data Using Manifold Coordinate Representations". Dr. Bachmann's paper covers the algorithmic and theoretical basis for the HSI processing approach.
High energy electromagnetic particle transportation on the GPU
Energy Technology Data Exchange (ETDEWEB)
Canal, P. [Fermilab; Elvira, D. [Fermilab; Jun, S. Y. [Fermilab; Kowalkowski, J. [Fermilab; Paterno, M. [Fermilab; Apostolakis, J. [CERN
2014-01-01
We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems
Energy Technology Data Exchange (ETDEWEB)
Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; van Dam, Hubertus JJ; Apra, Edoardo; Kowalski, Karol
2013-04-09
A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithm is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.
A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics
Bard, Christopher; Dorelli, John C.
2013-01-01
We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
Accelerating Biomedical Signal Processing Using GPU: A Case Study of Snore Sound Feature Extraction.
Guo, Jian; Qian, Kun; Zhang, Gongxuan; Xu, Huijie; Schuller, Björn
2017-12-01
The advent of 'Big Data' and 'Deep Learning' offers both, a great challenge and a huge opportunity for personalised health-care. In machine learning-based biomedical data analysis, feature extraction is a key step for 'feeding' the subsequent classifiers. With increasing numbers of biomedical data, extracting features from these 'big' data is an intensive and time-consuming task. In this case study, we employ a Graphics Processing Unit (GPU) via Python to extract features from a large corpus of snore sound data. Those features can subsequently be imported into many well-known deep learning training frameworks without any format processing. The snore sound data were collected from several hospitals (20 subjects, with 770-990 MB per subject - in total 17.20 GB). Experimental results show that our GPU-based processing significantly speeds up the feature extraction phase, by up to seven times, as compared to the previous CPU system.
GPU based real-time instrument tracking with three dimensional ultrasound.
Novotny, Paul M; Stoll, Jeffrey A; Vasilyev, Nikolay V; del Nido, Pedro J; Dupont, Pierre E; Howe, Robert D
2006-01-01
Real-time 3D ultrasound can enable new image-guided surgical procedures, but high data rates prohibit the use of traditional tracking techniques. We present a new method based on the modified Radon transform that identifies the axis of instrument shafts as bright patterns in planar projections. Instrument rotation and tip location are then determined using fiducial markers. These techniques are amenable to rapid execution on the current generation of personal computer graphics processor units (GPU). Our GPU implementation detected a surgical instrument in 31 ms, sufficient for real-time tracking at the 26 volumes per second rate of the ultrasound machine. A water tank experiment found instrument tip position errors of less than 0.2 mm, and an in vivo study tracked an instrument inside a beating porcine heart. The tracking results showed good correspondence to the actual movements of the instrument.
Zhmurov, A; Dima, R I; Kholodov, Y; Barsegov, V
2010-11-01
Theoretical exploration of fundamental biological processes involving the forced unraveling of multimeric proteins, the sliding motion in protein fibers and the mechanical deformation of biomolecular assemblies under physiological force loads is challenging even for distributed computing systems. Using a C(α)-based coarse-grained self organized polymer (SOP) model, we implemented the Langevin simulations of proteins on graphics processing units (SOP-GPU program). We assessed the computational performance of an end-to-end application of the program, where all the steps of the algorithm are running on a GPU, by profiling the simulation time and memory usage for a number of test systems. The ∼90-fold computational speedup on a GPU, compared with an optimized central processing unit program, enabled us to follow the dynamics in the centisecond timescale, and to obtain the force-extension profiles using experimental pulling speeds (v(f) = 1-10 μm/s) employed in atomic force microscopy and in optical tweezers-based dynamic force spectroscopy. We found that the mechanical molecular response critically depends on the conditions of force application and that the kinetics and pathways for unfolding change drastically even upon a modest 10-fold increase in v(f). This implies that, to resolve accurately the free energy landscape and to relate the results of single-molecule experiments in vitro and in silico, molecular simulations should be carried out under the experimentally relevant force loads. This can be accomplished in reasonable wall-clock time for biomolecules of size as large as 10(5) residues using the SOP-GPU package. © 2010 Wiley-Liss, Inc.
GPU-based RFA simulation for minimally invasive cancer treatment of liver tumours.
Mariappan, Panchatcharam; Weir, Phil; Flanagan, Ronan; Voglreiter, Philip; Alhonnoro, Tuomas; Pollari, Mika; Moche, Michael; Busse, Harald; Futterer, Jurgen; Portugaller, Horst Rupert; Sequeiros, Roberto Blanco; Kolesnik, Marina
2017-01-01
Radiofrequency ablation (RFA) is one of the most popular and well-standardized minimally invasive cancer treatments (MICT) for liver tumours, employed where surgical resection has been contraindicated. Less-experienced interventional radiologists (IRs) require an appropriate planning tool for the treatment to help avoid incomplete treatment and so reduce the tumour recurrence risk. Although a few tools are available to predict the ablation lesion geometry, the process is computationally expensive. Also, in our implementation, a few patient-specific parameters are used to improve the accuracy of the lesion prediction. Advanced heterogeneous computing using personal computers, incorporating the graphics processing unit (GPU) and the central processing unit (CPU), is proposed to predict the ablation lesion geometry. The most recent GPU technology is used to accelerate the finite element approximation of Penne's bioheat equation and a three state cell model. Patient-specific input parameters are used in the bioheat model to improve accuracy of the predicted lesion. A fast GPU-based RFA solver is developed to predict the lesion by doing most of the computational tasks in the GPU, while reserving the CPU for concurrent tasks such as lesion extraction based on the heat deposition at each finite element node. The solver takes less than 3 min for a treatment duration of 26 min. When the model receives patient-specific input parameters, the deviation between real and predicted lesion is below 3 mm. A multi-centre retrospective study indicates that the fast RFA solver is capable of providing the IR with the predicted lesion in the short time period before the intervention begins when the patient has been clinically prepared for the treatment.
GPU Pro advanced rendering techniques
Engel, Wolfgang
2010-01-01
This book covers essential tools and techniques for programming the graphics processing unit. Brought to you by Wolfgang Engel and the same team of editors who made the ShaderX series a success, this volume covers advanced rendering techniques, engine design, GPGPU techniques, related mathematical techniques, and game postmortems. A special emphasis is placed on handheld programming to account for the increased importance of graphics on mobile devices, especially the iPhone and iPod touch.Example programs and source code can be downloaded from the book's CRC Press web page.
Strategies for regular segmented reductions on GPU
DEFF Research Database (Denmark)
Larsen, Rasmus Wriedt; Henriksen, Troels
2017-01-01
We present and evaluate an implementation technique for regular segmented reductions on GPUs. Existing techniques tend to be either consistent in performance but relatively inefficient in absolute terms, or optimised for specific workloads and thereby exhibiting bad performance for certain input...... is in the context of the Futhark compiler, the implementation technique is applicable to any library or language that has a need for segmented reductions. We evaluate the technique on four microbenchmarks, two of which we also compare to implementations in the CUB library for GPU programming, as well as on two...
Better Faster Noise with the GPU
DEFF Research Database (Denmark)
Wyvill, Geoff; Frisvad, Jeppe Revall
Filtered noise [Perlin 1985] has, for twenty years, been a fundamental tool for creating functional texture and it has many other applications; for example, animating water waves or the motion of grass waving in the wind. Perlin noise suffers from a number of defects and there have been many atte...... attempts to create better or faster noise but Perlin’s ‘Gradient Noise’ has consistently proved to be the best compromise between speed and quality. Our objective was to create a better noise cheaply by use of the GPU....
International Nuclear Information System (INIS)
Huang Bormin; Mielikainen, Jarno; Oh, Hyunjong; Allen Huang, Hung-Lung
2011-01-01
Satellite-observed radiance is a nonlinear functional of surface properties and atmospheric temperature and absorbing gas profiles as described by the radiative transfer equation (RTE). In the era of hyperspectral sounders with thousands of high-resolution channels, the computation of the radiative transfer model becomes more time-consuming. The radiative transfer model performance in operational numerical weather prediction systems still limits the number of channels we can use in hyperspectral sounders to only a few hundreds. To take the full advantage of such high-resolution infrared observations, a computationally efficient radiative transfer model is needed to facilitate satellite data assimilation. In recent years the programmable commodity graphics processing unit (GPU) has evolved into a highly parallel, multi-threaded, many-core processor with tremendous computational speed and very high memory bandwidth. The radiative transfer model is very suitable for the GPU implementation to take advantage of the hardware's efficiency and parallelism where radiances of many channels can be calculated in parallel in GPUs. In this paper, we develop a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) launched in 2006 onboard the first European meteorological polar-orbiting satellites, METOP-A. Each IASI spectrum has 8461 spectral channels. The IASI radiative transfer model consists of three modules. The first module for computing the regression predictors takes less than 0.004% of CPU time, while the second module for transmittance computation and the third module for radiance computation take approximately 92.5% and 7.5%, respectively. Our GPU-based IASI radiative transfer model is developed to run on a low-cost personal supercomputer with four GPUs with total 960 compute cores, delivering near 4 TFlops theoretical peak performance. By massively parallelizing the second and third modules, we reached 364x
Pizette, Patrick; Govender, Nicolin; Wilke, Daniel N.; Abriak, Nor-Edine
2017-06-01
The use of the Discrete Element Method (DEM) for industrial civil engineering industrial applications is currently limited due to the computational demands when large numbers of particles are considered. The graphics processing unit (GPU) with its highly parallelized hardware architecture shows potential to enable solution of civil engineering problems using discrete granular approaches. We demonstrate in this study the pratical utility of a validated GPU-enabled DEM modeling environment to simulate industrial scale granular problems. As illustration, the flow discharge of storage silos using 8 and 17 million particles is considered. DEM simulations have been performed to investigate the influence of particle size (equivalent size for the 20/40-mesh gravel) and induced shear stress for two hopper shapes. The preliminary results indicate that the shape of the hopper significantly influences the discharge rates for the same material. Specifically, this work shows that GPU-enabled DEM modeling environments can model industrial scale problems on a single portable computer within a day for 30 seconds of process time.
Directory of Open Access Journals (Sweden)
Pizette Patrick
2017-01-01
Full Text Available The use of the Discrete Element Method (DEM for industrial civil engineering industrial applications is currently limited due to the computational demands when large numbers of particles are considered. The graphics processing unit (GPU with its highly parallelized hardware architecture shows potential to enable solution of civil engineering problems using discrete granular approaches. We demonstrate in this study the pratical utility of a validated GPU-enabled DEM modeling environment to simulate industrial scale granular problems. As illustration, the flow discharge of storage silos using 8 and 17 million particles is considered. DEM simulations have been performed to investigate the influence of particle size (equivalent size for the 20/40-mesh gravel and induced shear stress for two hopper shapes. The preliminary results indicate that the shape of the hopper significantly influences the discharge rates for the same material. Specifically, this work shows that GPU-enabled DEM modeling environments can model industrial scale problems on a single portable computer within a day for 30 seconds of process time.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
Lyakh, Dmitry I.
2015-04-01
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
Li, Pengcheng; Liu, Celong; Li, Xianpeng; He, Honghui; Ma, Hui
2016-09-20
In earlier studies, we developed scattering models and the corresponding CPU-based Monte Carlo simulation programs to study the behavior of polarized photons as they propagate through complex biological tissues. Studying the simulation results in high degrees of freedom that created a demand for massive simulation tasks. In this paper, we report a parallel implementation of the simulation program based on the compute unified device architecture running on a graphics processing unit (GPU). Different schemes for sphere-only simulations and sphere-cylinder mixture simulations were developed. Diverse optimizing methods were employed to achieve the best acceleration. The final-version GPU program is hundreds of times faster than the CPU version. Dependence of the performance on input parameters and precision were also studied. It is shown that using single precision in the GPU simulations results in very limited losses in accuracy. Consumer-level graphics cards, even those in laptop computers, are more cost-effective than scientific graphics cards for single-precision computation.
A cache-friendly sampling strategy for texture-based volume rendering on GPU
Directory of Open Access Journals (Sweden)
Junpeng Wang
2017-06-01
Full Text Available The texture-based volume rendering is a memory-intensive algorithm. Its performance relies heavily on the performance of the texture cache. However, most existing texture-based volume rendering methods blindly map computational resources to texture memory and result in incoherent memory access patterns, causing low cache hit rates in certain cases. The distance between samples taken by threads of an atomic scheduling unit (e.g. a warp of 32 threads in CUDA of the GPU is a crucial factor that affects the texture cache performance. Based on this fact, we present a new sampling strategy, called Warp Marching, for the ray-casting algorithm of texture-based volume rendering. The effects of different sample organizations and different thread-pixel mappings in the ray-casting algorithm are thoroughly analyzed. Also, a pipeline manner color blending approach is introduced and the power of warp-level GPU operations is leveraged to improve the efficiency of parallel executions on the GPU. In addition, the rendering performance of the Warp Marching is view-independent, and it outperforms existing empty space skipping techniques in scenarios that need to render large dynamic volumes in a low resolution image. Through a series of micro-benchmarking and real-life data experiments, we rigorously analyze our sampling strategies and demonstrate significant performance enhancements over existing sampling methods.
GPU-Based FFT Computation for Multi-Gigabit WirelessHD Baseband Processing
Directory of Open Access Journals (Sweden)
Nicholas Hinitt
2010-01-01
Full Text Available The next generation Graphics Processing Units (GPUs are being considered for non-graphics applications. Millimeter wave (60 Ghz wireless networks that are capable of multi-gigabit per second (Gbps transfer rates require a significant baseband throughput. In this work, we consider the baseband of WirelessHD, a 60 GHz communications system, which can provide a data rate of up to 3.8 Gbps over a short range wireless link. Thus, we explore the feasibility of achieving gigabit baseband throughput using the GPUs. One of the most computationally intensive functions commonly used in baseband communications, the Fast Fourier Transform (FFT algorithm, is implemented on an NVIDIA GPU using their general-purpose computing platform called the Compute Unified Device Architecture (CUDA. The paper, first, investigates the implementation of an FFT algorithm using the GPU hardware and exploiting the computational capability available. It then outlines the limitations discovered and the methods used to overcome these challenges. Finally a new algorithm to compute FFT is proposed, which reduces interprocessor communication. It is further optimized by improving memory access, enabling the processing rate to exceed 4 Gbps, achieving a processing time of a 512-point FFT in less than 200 ns using a two-GPU solution.
GPURFSCREEN: a GPU based virtual screening tool using random forest classifier.
Jayaraj, P B; Ajay, Mathias K; Nufail, M; Gopakumar, G; Jaleel, U C A
2016-01-01
In-silico methods are an integral part of modern drug discovery paradigm. Virtual screening, an in-silico method, is used to refine data models and reduce the chemical space on which wet lab experiments need to be performed. Virtual screening of a ligand data model requires large scale computations, making it a highly time consuming task. This process can be speeded up by implementing parallelized algorithms on a Graphical Processing Unit (GPU). Random Forest is a robust classification algorithm that can be employed in the virtual screening. A ligand based virtual screening tool (GPURFSCREEN) that uses random forests on GPU systems has been proposed and evaluated in this paper. This tool produces optimized results at a lower execution time for large bioassay data sets. The quality of results produced by our tool on GPU is same as that on a regular serial environment. Considering the magnitude of data to be screened, the parallelized virtual screening has a significantly lower running time at high throughput. The proposed parallel tool outperforms its serial counterpart by successfully screening billions of molecules in training and prediction phases.
3D brain tumor localization and parameter estimation using thermographic approach on GPU.
Bousselham, Abdelmajid; Bouattane, Omar; Youssfi, Mohamed; Raihani, Abdelhadi
2018-01-01
The aim of this paper is to present a GPU parallel algorithm for brain tumor detection to estimate its size and location from surface temperature distribution obtained by thermography. The normal brain tissue is modeled as a rectangular cube including spherical tumor. The temperature distribution is calculated using forward three dimensional Pennes bioheat transfer equation, it's solved using massively parallel Finite Difference Method (FDM) and implemented on Graphics Processing Unit (GPU). Genetic Algorithm (GA) was used to solve the inverse problem and estimate the tumor size and location by minimizing an objective function involving measured temperature on the surface to those obtained by numerical simulation. The parallel implementation of Finite Difference Method reduces significantly the time of bioheat transfer and greatly accelerates the inverse identification of brain tumor thermophysical and geometrical properties. Experimental results show significant gains in the computational speed on GPU and achieve a speedup of around 41 compared to the CPU. The analysis performance of the estimation based on tumor size inside brain tissue also presented. Copyright © 2017 Elsevier Ltd. All rights reserved.
A GPU Implementation of Local Search Operators for Symmetric Travelling Salesman Problem
Directory of Open Access Journals (Sweden)
Juraj Fosin
2013-06-01
Full Text Available The Travelling Salesman Problem (TSP is one of the most studied combinatorial optimization problem which is significant in many practical applications in transportation problems. The TSP problem is NP-hard problem and requires large computation power to be solved by the exact algorithms. In the past few years, fast development of general-purpose Graphics Processing Units (GPUs has brought huge improvement in decreasing the applications’ execution time. In this paper, we implement 2-opt and 3-opt local search operators for solving the TSP on the GPU using CUDA. The novelty presented in this paper is a new parallel iterated local search approach with 2-opt and 3-opt operators for symmetric TSP, optimized for the execution on GPUs. With our implementation large TSP problems (up to 85,900 cities can be solved using the GPU. We will show that our GPU implementation can be up to 20x faster without losing quality for all TSPlib problems as well as for our CRO TSP problem.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Directory of Open Access Journals (Sweden)
Jinwei Wang
2014-01-01
Full Text Available The active appearance model (AAM is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA on the Nvidia’s GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics
Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.
2017-05-01
We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.
A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing
Guerrero, Ginés D.; Imbernón, Baldomero; García, José M.
2014-01-01
Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC) platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs) has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO). This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor. PMID:25025055
GPU-Based 3D Cone-Beam CT Image Reconstruction for Large Data Volume
Directory of Open Access Journals (Sweden)
Xing Zhao
2009-01-01
Full Text Available Currently, 3D cone-beam CT image reconstruction speed is still a severe limitation for clinical application. The computational power of modern graphics processing units (GPUs has been harnessed to provide impressive acceleration of 3D volume image reconstruction. For extra large data volume exceeding the physical graphic memory of GPU, a straightforward compromise is to divide data volume into blocks. Different from the conventional Octree partition method, a new partition scheme is proposed in this paper. This method divides both projection data and reconstructed image volume into subsets according to geometric symmetries in circular cone-beam projection layout, and a fast reconstruction for large data volume can be implemented by packing the subsets of projection data into the RGBA channels of GPU, performing the reconstruction chunk by chunk and combining the individual results in the end. The method is evaluated by reconstructing 3D images from computer-simulation data and real micro-CT data. Our results indicate that the GPU implementation can maintain original precision and speed up the reconstruction process by 110–120 times for circular cone-beam scan, as compared to traditional CPU implementation.
On-the-fly generation and rendering of infinite cities on the GPU
Steinberger, Markus
2014-05-01
In this paper, we present a new approach for shape-grammar-based generation and rendering of huge cities in real-time on the graphics processing unit (GPU). Traditional approaches rely on evaluating a shape grammar and storing the geometry produced as a preprocessing step. During rendering, the pregenerated data is then streamed to the GPU. By interweaving generation and rendering, we overcome the problems and limitations of streaming pregenerated data. Using our methods of visibility pruning and adaptive level of detail, we are able to dynamically generate only the geometry needed to render the current view in real-time directly on the GPU. We also present a robust and efficient way to dynamically update a scene\\'s derivation tree and geometry, enabling us to exploit frame-to-frame coherence. Our combined generation and rendering is significantly faster than all previous work. For detailed scenes, we are capable of generating geometry more rapidly than even just copying pregenerated data from main memory, enabling us to render cities with thousands of buildings at up to 100 frames per second, even with the camera moving at supersonic speed. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
GPU Optimizations for a Production Molecular Docking Code*
Landaverde, Raphael; Herbordt, Martin C.
2015-01-01
Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users. PMID:26594667
GPU Optimizations for a Production Molecular Docking Code.
Landaverde, Raphael; Herbordt, Martin C
2014-09-01
Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users.
GRAVIDY, a GPU modular, parallel direct-summation N-body integrator: dynamics with softening
Maureira-Fredes, Cristián; Amaro-Seoane, Pau
2018-01-01
A wide variety of outstanding problems in astrophysics involve the motion of a large number of particles under the force of gravity. These include the global evolution of globular clusters, tidal disruptions of stars by a massive black hole, the formation of protoplanets and sources of gravitational radiation. The direct-summation of N gravitational forces is a complex problem with no analytical solution and can only be tackled with approximations and numerical methods. To this end, the Hermite scheme is a widely used integration method. With different numerical techniques and special-purpose hardware, it can be used to speed up the calculations. But these methods tend to be computationally slow and cumbersome to work with. We present a new graphics processing unit (GPU), direct-summation N-body integrator written from scratch and based on this scheme, which includes relativistic corrections for sources of gravitational radiation. GRAVIDY has high modularity, allowing users to readily introduce new physics, it exploits available computational resources and will be maintained by regular updates. GRAVIDY can be used in parallel on multiple CPUs and GPUs, with a considerable speed-up benefit. The single-GPU version is between one and two orders of magnitude faster than the single-CPU version. A test run using four GPUs in parallel shows a speed-up factor of about 3 as compared to the single-GPU version. The conception and design of this first release is aimed at users with access to traditional parallel CPU clusters or computational nodes with one or a few GPU cards.
Fast GPU-based spot extraction for energy-dispersive X-ray Laue diffraction
International Nuclear Information System (INIS)
Alghabi, F.; Schipper, U.; Kolb, A.; Send, S.; Abboud, A.; Pashniak, N.; Pietsch, U.
2014-01-01
This paper describes a novel method for fast online analysis of X-ray Laue spots taken by means of an energy-dispersive X-ray 2D detector. Current pnCCD detectors typically operate at some 100 Hz (up to a maximum of 400 Hz) and have a resolution of 384 × 384 pixels, future devices head for even higher pixel counts and frame rates. The proposed online data analysis is based on a computer utilizing multiple Graphics Processing Units (GPUs), which allow for fast and parallel data processing. Our multi-GPU based algorithm is compliant with the rules of stream-based data processing, for which GPUs are optimized. The paper's main contribution is therefore an alternative algorithm for the determination of spot positions and energies over the full sequence of pnCCD data frames. Furthermore, an improved background suppression algorithm is presented.The resulting system is able to process data at the maximum acquisition rate of 400 Hz. We present a detailed analysis of the spot positions and energies deduced from a prior (single-core) CPU-based and the novel GPU-based data processing, showing that the parallel computed results using the GPU implementation are at least of the same quality as prior CPU-based results. Furthermore, the GPU-based algorithm is able to speed up the data processing by a factor of 7 (in comparison to single-core CPU-based algorithm) which effectively makes the detector system more suitable for online data processing
GPU-accelerated depth map generation for X-ray simulations of complex CAD geometries
Grandin, Robert J.; Young, Gavin; Holland, Stephen D.; Krishnamurthy, Adarsh
2018-04-01
Interactive x-ray simulations of complex computer-aided design (CAD) models can provide valuable insights for better interpretation of the defect signatures such as porosity from x-ray CT images. Generating the depth map along a particular direction for the given CAD geometry is the most compute-intensive step in x-ray simulations. We have developed a GPU-accelerated method for real-time generation of depth maps of complex CAD geometries. We preprocess complex components designed using commercial CAD systems using a custom CAD module and convert them into a fine user-defined surface tessellation. Our CAD module can be used by different simulators as well as handle complex geometries, including those that arise from complex castings and composite structures. We then make use of a parallel algorithm that runs on a graphics processing unit (GPU) to convert the finely-tessellated CAD model to a voxelized representation. The voxelized representation can enable heterogeneous modeling of the volume enclosed by the CAD model by assigning heterogeneous material properties in specific regions. The depth maps are generated from this voxelized representation with the help of a GPU-accelerated ray-casting algorithm. The GPU-accelerated ray-casting method enables interactive (> 60 frames-per-second) generation of the depth maps of complex CAD geometries. This enables arbitrarily rotation and slicing of the CAD model, leading to better interpretation of the x-ray images by the user. In addition, the depth maps can be used to aid directly in CT reconstruction algorithms.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
A GPU accelerated PDF transparency engine
Recker, John; Lin, I.-Jong; Tastl, Ingeborg
2011-01-01
As commercial printing presses become faster, cheaper and more efficient, so too must the Raster Image Processors (RIP) that prepare data for them to print. Digital press RIPs, however, have been challenged to on the one hand meet the ever increasing print performance of the latest digital presses, and on the other hand process increasingly complex documents with transparent layers and embedded ICC profiles. This paper explores the challenges encountered when implementing a GPU accelerated driver for the open source Ghostscript Adobe PostScript and PDF language interpreter targeted at accelerating PDF transparency for high speed commercial presses. It further describes our solution, including an image memory manager for tiling input and output images and documents, a PDF compatible multiple image layer blending engine, and a GPU accelerated ICC v4 compatible color transformation engine. The result, we believe, is the foundation for a scalable, efficient, distributed RIP system that can meet current and future RIP requirements for a wide range of commercial digital presses.
GPU Linear algebra extensions for GNU/Octave
International Nuclear Information System (INIS)
Bosi, L B; Mariotti, M; Santocchia, A
2012-01-01
Octave is one of the most widely used open source tools for numerical analysis and liner algebra. Our project aims to improve Octave by introducing support for GPU computing in order to speed up some linear algebra operations. The core of our work is a C library that executes some BLAS operations concerning vector- vector, vector matrix and matrix-matrix functions on the GPU. OpenCL functions are used to program GPU kernels, which are bound within the GNU/octave framework. We report the project implementation design and some preliminary results about performance.
Parallelization and checkpointing of GPU applications through program transformation
Energy Technology Data Exchange (ETDEWEB)
Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
High Speed 3D Tomography on CPU, GPU, and FPGA
Directory of Open Access Journals (Sweden)
Dominique Houzet
2009-02-01
Full Text Available Back-projection (BP is a costly computational step in tomography image reconstruction such as positron emission tomography (PET. To reduce the computation time, this paper presents a pipelined, prefetch, and parallelized architecture for PET BP (3PA-PET. The key feature of this architecture is its original memory access strategy, masking the high latency of the external memory. Indeed, the pattern of the memory references to the data acquired hinders the processing unit. The memory access bottleneck is overcome by an efficient use of the intrinsic temporal and spatial locality of the BP algorithm. A loop reordering allows an efficient use of general purpose processor's caches, for software implementation, as well as the 3D predictive and adaptive cache (3D-AP cache, when considering hardware implementations. Parallel hardware pipelines are also efficient thanks to a hierarchical 3D-AP cache: each pipeline performs a memory reference in about one clock cycle to reach a computational throughput close to 100%. The 3PA-PET architecture is prototyped on a system on programmable chip (SoPC to validate the system and to measure its expected performances. Time performances are compared with a desktop PC, a workstation, and a graphic processor unit (GPU.
High Speed 3D Tomography on CPU, GPU, and FPGA
Directory of Open Access Journals (Sweden)
GAC Nicolas
2008-01-01
Full Text Available Abstract Back-projection (BP is a costly computational step in tomography image reconstruction such as positron emission tomography (PET. To reduce the computation time, this paper presents a pipelined, prefetch, and parallelized architecture for PET BP (3PA-PET. The key feature of this architecture is its original memory access strategy, masking the high latency of the external memory. Indeed, the pattern of the memory references to the data acquired hinders the processing unit. The memory access bottleneck is overcome by an efficient use of the intrinsic temporal and spatial locality of the BP algorithm. A loop reordering allows an efficient use of general purpose processor's caches, for software implementation, as well as the 3D predictive and adaptive cache (3D-AP cache, when considering hardware implementations. Parallel hardware pipelines are also efficient thanks to a hierarchical 3D-AP cache: each pipeline performs a memory reference in about one clock cycle to reach a computational throughput close to 100%. The 3PA-PET architecture is prototyped on a system on programmable chip (SoPC to validate the system and to measure its expected performances. Time performances are compared with a desktop PC, a workstation, and a graphic processor unit (GPU.
Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S
2014-03-11
Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.
An analytic linear accelerator source model for GPU-based Monte Carlo dose calculations
Tian, Zhen; Li, Yongbao; Folkerts, Michael; Shi, Feng; Jiang, Steve B.; Jia, Xun
2015-10-01
Recently, there has been a lot of research interest in developing fast Monte Carlo (MC) dose calculation methods on graphics processing unit (GPU) platforms. A good linear accelerator (linac) source model is critical for both accuracy and efficiency considerations. In principle, an analytical source model should be more preferred for GPU-based MC dose engines than a phase-space file-based model, in that data loading and CPU-GPU data transfer can be avoided. In this paper, we presented an analytical field-independent source model specifically developed for GPU-based MC dose calculations, associated with a GPU-friendly sampling scheme. A key concept called phase-space-ring (PSR) was proposed. Each PSR contained a group of particles that were of the same type, close in energy and reside in a narrow ring on the phase-space plane located just above the upper jaws. The model parameterized the probability densities of particle location, direction and energy for each primary photon PSR, scattered photon PSR and electron PSR. Models of one 2D Gaussian distribution or multiple Gaussian components were employed to represent the particle direction distributions of these PSRs. A method was developed to analyze a reference phase-space file and derive corresponding model parameters. To efficiently use our model in MC dose calculations on GPU, we proposed a GPU-friendly sampling strategy, which ensured that the particles sampled and transported simultaneously are of the same type and close in energy to alleviate GPU thread divergences. To test the accuracy of our model, dose distributions of a set of open fields in a water phantom were calculated using our source model and compared to those calculated using the reference phase-space files. For the high dose gradient regions, the average distance-to-agreement (DTA) was within 1 mm and the maximum DTA within 2 mm. For relatively low dose gradient regions, the root-mean-square (RMS) dose difference was within 1.1% and the maximum
An analytic linear accelerator source model for GPU-based Monte Carlo dose calculations.
Tian, Zhen; Li, Yongbao; Folkerts, Michael; Shi, Feng; Jiang, Steve B; Jia, Xun
2015-10-21
Recently, there has been a lot of research interest in developing fast Monte Carlo (MC) dose calculation methods on graphics processing unit (GPU) platforms. A good linear accelerator (linac) source model is critical for both accuracy and efficiency considerations. In principle, an analytical source model should be more preferred for GPU-based MC dose engines than a phase-space file-based model, in that data loading and CPU-GPU data transfer can be avoided. In this paper, we presented an analytical field-independent source model specifically developed for GPU-based MC dose calculations, associated with a GPU-friendly sampling scheme. A key concept called phase-space-ring (PSR) was proposed. Each PSR contained a group of particles that were of the same type, close in energy and reside in a narrow ring on the phase-space plane located just above the upper jaws. The model parameterized the probability densities of particle location, direction and energy for each primary photon PSR, scattered photon PSR and electron PSR. Models of one 2D Gaussian distribution or multiple Gaussian components were employed to represent the particle direction distributions of these PSRs. A method was developed to analyze a reference phase-space file and derive corresponding model parameters. To efficiently use our model in MC dose calculations on GPU, we proposed a GPU-friendly sampling strategy, which ensured that the particles sampled and transported simultaneously are of the same type and close in energy to alleviate GPU thread divergences. To test the accuracy of our model, dose distributions of a set of open fields in a water phantom were calculated using our source model and compared to those calculated using the reference phase-space files. For the high dose gradient regions, the average distance-to-agreement (DTA) was within 1 mm and the maximum DTA within 2 mm. For relatively low dose gradient regions, the root-mean-square (RMS) dose difference was within 1.1% and the maximum
Local alignment tool based on Hadoop framework and GPU architecture.
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
GPU credit reduced, tie to TMI-1 cheating discounted
International Nuclear Information System (INIS)
Utroska, D.
1981-01-01
The recent reduction of credit available to General Public Utilities (GPU) Nuclear may be linked to a cheating incident involving two reactor operators at the Three Mile Island-1 (TMI-1) reactor. The incident caused the Nuclear Regulatory Commission to reopen the managerial portion of the restart hearings and may delay the restart. The delay and the lower credit line will worsen GPU's financial position. Banks claim that misgivings about TMI-1 influence them more than the cheating, although GPU had been gradually improving its financial situation since the TMI-2 accident. The new agreement gives GPU $150 million in immediate credit, but lowers the interim ceiling from $292 million to $200 million. A spokesman from the Office of Management and Budget acknowledges that administration plans to limit the federal role to research and development softened under political pressure
Synergia CUDA: GPU-accelerated accelerator modeling package
International Nuclear Information System (INIS)
Lu, Q; Amundson, J
2014-01-01
Synergia is a parallel, 3-dimensional space-charge particle-in-cell accelerator modeling code. We present our work porting the purely MPI-based version of the code to a hybrid of CPU and GPU computing kernels. The hybrid code uses the CUDA platform in the same framework as the pure MPI solution. We have implemented a lock-free collaborative charge-deposition algorithm for the GPU, as well as other optimizations, including local communication avoidance for GPUs, a customized FFT, and fine-tuned memory access patterns. On a small GPU cluster (up to 4 Tesla C1070 GPUs), our benchmarks exhibit both superior peak performance and better scaling than a CPU cluster with 16 nodes and 128 cores. We also compare the code performance on different GPU architectures, including C1070 Tesla and K20 Kepler.
Local Alignment Tool Based on Hadoop Framework and GPU Architecture
Directory of Open Access Journals (Sweden)
Che-Lun Hung
2014-01-01
Full Text Available With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
The performances of R GPU implementations of the GMRES method
Directory of Open Access Journals (Sweden)
Bogdan Oancea
2018-03-01
Full Text Available Although the performance of commodity computers has improved drastically with the introduction of multicore processors and GPU computing, the standard R distribution is still based on single-threaded model of computation, using only a small fraction of the computational power available now for most desktops and laptops. Modern statistical software packages rely on high performance implementations of the linear algebra routines there are at the core of several important leading edge statistical methods. In this paper we present a GPU implementation of the GMRES iterative method for solving linear systems. We compare the performance of this implementation with a pure single threaded version of the CPU. We also investigate the performance of our implementation using different GPU packages available now for R such as gmatrix, gputools or gpuR which are based on CUDA or OpenCL frameworks.
GPU-based large-scale visualization
Hadwiger, Markus
2013-11-19
Recent advances in image and volume acquisition as well as computational advances in simulation have led to an explosion of the amount of data that must be visualized and analyzed. Modern techniques combine the parallel processing power of GPUs with out-of-core methods and data streaming to enable the interactive visualization of giga- and terabytes of image and volume data. A major enabler for interactivity is making both the computational and the visualization effort proportional to the amount of data that is actually visible on screen, decoupling it from the full data size. This leads to powerful display-aware multi-resolution techniques that enable the visualization of data of almost arbitrary size. The course consists of two major parts: An introductory part that progresses from fundamentals to modern techniques, and a more advanced part that discusses details of ray-guided volume rendering, novel data structures for display-aware visualization and processing, and the remote visualization of large online data collections. You will learn how to develop efficient GPU data structures and large-scale visualizations, implement out-of-core strategies and concepts such as virtual texturing that have only been employed recently, as well as how to use modern multi-resolution representations. These approaches reduce the GPU memory requirements of extremely large data to a working set size that fits into current GPUs. You will learn how to perform ray-casting of volume data of almost arbitrary size and how to render and process gigapixel images using scalable, display-aware techniques. We will describe custom virtual texturing architectures as well as recent hardware developments in this area. We will also describe client/server systems for distributed visualization, on-demand data processing and streaming, and remote visualization. We will describe implementations using OpenGL as well as CUDA, exploiting parallelism on GPUs combined with additional asynchronous
Real-Time Incompressible Fluid Simulation on the GPU
Directory of Open Access Journals (Sweden)
Xiao Nie
2015-01-01
Full Text Available We present a parallel framework for simulating incompressible fluids with predictive-corrective incompressible smoothed particle hydrodynamics (PCISPH on the GPU in real time. To this end, we propose an efficient GPU streaming pipeline to map the entire computational task onto the GPU, fully exploiting the massive computational power of state-of-the-art GPUs. In PCISPH-based simulations, neighbor search is the major performance obstacle because this process is performed several times at each time step. To eliminate this bottleneck, an efficient parallel sorting method for this time-consuming step is introduced. Moreover, we discuss several optimization techniques including using fast on-chip shared memory to avoid global memory bandwidth limitations and thus further improve performance on modern GPU hardware. With our framework, the realism of real-time fluid simulation is significantly improved since our method enforces incompressibility constraint which is typically ignored due to efficiency reason in previous GPU-based SPH methods. The performance results illustrate that our approach can efficiently simulate realistic incompressible fluid in real time and results in a speed-up factor of up to 23 on a high-end NVIDIA GPU in comparison to single-threaded CPU-based implementation.
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun
2015-10-07
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs.
Directory of Open Access Journals (Sweden)
Ahmed Shamsul Arefin
Full Text Available BACKGROUND: The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers. An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU, can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. RESULTS: We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50-60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. CONCLUSION: Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL at https://sourceforge.net/p/gpufsknn/.
GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs.
Arefin, Ahmed Shamsul; Riveros, Carlos; Berretta, Regina; Moscato, Pablo
2012-01-01
The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers). An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU), can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN) search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour) for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50-60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN) provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL) at https://sourceforge.net/p/gpufsknn/.
GPU Acceleration of Particle-In-Cell Methods
Cowan, Benjamin; Averkin, Sergey; Cary, John; Leddy, Jarrod; Sides, Scott; Werner, Gregory
2017-10-01
Graphics processing units (GPUs) have become key components in many supercomputing systems, as they can provide more computations relative to their cost and power consumption than conventional processors. However, to take full advantage of this capability, they require a strict programming model which involves single-instruction multiple-data execution as well as significant constraints on memory access. To bring the full power of GPUs to bear on plasma physics problems, we must adapt the computational methods to this new programming model. We have developed a GPU implementation of the particle-in-cell (PIC) method, one of the mainstays of plasma physics simulation. This framework is highly general and enables advanced PIC features such as high order particles and absorbing boundary conditions. The main elements of the PIC loop, including field interpolation and particle deposition, are designed to optimize memory access. We describe recent progress in these algorithms, including arbitrary grid types and multiple GPUs per node. Work supported by DARPA Contract No. W31P4Q-16-C-0009.
GPU accelerated hough transform for high level trigger application
International Nuclear Information System (INIS)
Mohanty, S.; Hristov, P.; )
2012-01-01
The charged particle trajectory recognition is a very complex and time consuming process particularly when the multiplicity becomes very large during the heavy ion collisions at relativistic energies. Since all the events may not be of interest to record, it is essential to have a fast algorithm to select the desired events through online triggering which is known as high level trigger (HLT). ALICE (A Large Hadron Collider Experiment) is a general purpose detector, being used at CERN, LHC to study nucleus-nucleus and proton-proton collisions at different centre of mass energies. The task of the ALICE High level trigger (HLT) is to select the events of interest using various fast track reconstruction algorithms. A fast track-reconstruction algorithm for the time projection chamber of ALICE detector has been discussed based on linear Hough Transform (HT). In this work, a similar implementation of HT is done using graphic processing unit (GPU), using both circular and linear HT with simulated data as discussed below
GALARIO: a GPU Accelerated Library for Analysing Radio Interferometer Observations
Tazzari, Marco; Beaujean, Frederik; Testi, Leonardo
2018-02-01
We present GALARIO, a computational library that exploits the power of modern graphical processing units (GPUs) to accelerate the analysis of observations from radio interferometers like ALMA or the VLA. GALARIO speeds up the computation of synthetic visibilities from a generic 2D model image or a radial brightness profile (for axisymmetric sources). On a GPU, GALARIO is 150 faster than standard Python and 10 times faster than serial C++ code on a CPU. Highly modular, easy to use and to adopt in existing code, GALARIO comes as two compiled libraries, one for Nvidia GPUs and one for multicore CPUs, where both have the same functions with identical interfaces. GALARIO comes with Python bindings but can also be directly used in C or C++. The versatility and the speed of GALARIO open new analysis pathways that otherwise would be prohibitively time consuming, e.g. fitting high resolution observations of large number of objects, or entire spectral cubes of molecular gas emission. It is a general tool that can be applied to any field that uses radio interferometer observations. The source code is available online at http://github.com/mtazzari/galario under the open source GNU Lesser General Public License v3.
GPU-based Branchless Distance-Driven Projection and Backprojection
Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong
2017-01-01
Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm. PMID:29333480
Fast magnetic field computation in fusion technology using GPU technology
Energy Technology Data Exchange (ETDEWEB)
Chiariello, Andrea Gaetano [Ass. EURATOM/ENEA/CREATE, Dipartimento di Ingegneria Industriale e dell’Informazione, Seconda Università di Napoli, Via Roma 29, Aversa (CE) (Italy); Formisano, Alessandro, E-mail: Alessandro.Formisano@unina2.it [Ass. EURATOM/ENEA/CREATE, Dipartimento di Ingegneria Industriale e dell’Informazione, Seconda Università di Napoli, Via Roma 29, Aversa (CE) (Italy); Martone, Raffaele [Ass. EURATOM/ENEA/CREATE, Dipartimento di Ingegneria Industriale e dell’Informazione, Seconda Università di Napoli, Via Roma 29, Aversa (CE) (Italy)
2013-10-15
Highlights: ► The paper deals with high accuracy numerical simulations of high field magnets. ► The porting of existing codes of High Performance Computing architectures allowed to obtain a relevant speedup while not reducing computational accuracy. ► Some examples of applications, referred to ITER-like magnets, are reported. -- Abstract: One of the main issues in the simulation of Tokamaks functioning is the reliable and accurate computation of actual field maps in the plasma chamber. In this paper a tool able to accurately compute magnetic field maps produced by active coils of any 3D shape, wound with high number of conductors, is presented. Under linearity assumption, the coil winding is modeled by means of “sticks”, following each conductor's shape, and the contribution of each stick is computed using high speed Graphic Computing Units (GPU's). Relevant speed enhancements with respect to standard parallel computing environment are achieved in this way.
Accelerated finite element elastodynamic simulations using the GPU
International Nuclear Information System (INIS)
Huthwaite, Peter
2014-01-01
An approach is developed to perform explicit time domain finite element simulations of elastodynamic problems on the graphical processing unit, using Nvidia's CUDA. Of critical importance for this problem is the arrangement of nodes in memory, allowing data to be loaded efficiently and minimising communication between the independently executed blocks of threads. The initial stage of memory arrangement is partitioning the mesh; both a well established ‘greedy’ partitioner and a new, more efficient ‘aligned’ partitioner are investigated. A method is then developed to efficiently arrange the memory within each partition. The software is applied to three models from the fields of non-destructive testing, vibrations and geophysics, demonstrating a memory bandwidth of very close to the card's maximum, reflecting the bandwidth-limited nature of the algorithm. Comparison with Abaqus, a widely used commercial CPU equivalent, validated the accuracy of the results and demonstrated a speed improvement of around two orders of magnitude. A software package, Pogo, incorporating these developments, is released open source, downloadable from (http://www.pogo-fea.com/) to benefit the community. -- Highlights: •A novel memory arrangement approach is discussed for finite elements on the GPU. •The mesh is partitioned then nodes are arranged efficiently within each partition. •Models from ultrasonics, vibrations and geophysics are run. •The code is significantly faster than an equivalent commercial CPU package. •Pogo, the new software package, is released open source
GPU-enabled projectile guidance for impact area constraints
Rogers, Jonathan
2013-05-01
Guided projectile engagement scenarios often involve impact area constraints, in which it may be less desirable to incur miss distance on one side of a target or within a specified boundary near the target area. Current projectile guidance schemes such as impact point predictors cannot handle these constraints within the guidance loop, and may produce dispersion patterns that are insensitive to these constraints. In this paper, a new projectile guidance law is proposed that leverages real-time Monte Carlo impact point prediction to continually evaluate the probability of violating impact area constraints. The desired aim point is then adjusted accordingly. Real-time Monte Carlo simulation is enabled within the feedback loop through use of graphics processing units (GPU's), which provide parallel pipelines through which a dispersion pattern can routinely be predicted. The result is a guidance law that can achieve minimum miss distance while avoiding impact area constraints. The new guidance law is described and formulated as a nonlinear optimization problem which is solved in real-time through massively-parallel Monte Carlo simulation. An example simulation is shown in which impact area constraints are enforced and the methodology of stochastic guidance is demonstrated. Finally, Monte Carlo simulations are shown which demonstrate the ability of the stochastic guidance scheme to avoid an arbitrary set of impact area constraints, generating an impact probability density function that optimally trades miss distance within the restricted impact area. The proposed guidance scheme has applications beyond smart weapons to include missiles, UAV's, and other autonomous systems.
Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture
Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek
2015-01-01
This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
cellGPU: Massively parallel simulations of dynamic vertex models
Sussman, Daniel M.
2017-10-01
Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation
A survey and measurement study of GPU DVFS on energy conservation
Directory of Open Access Journals (Sweden)
Xinxin Mei
2017-05-01
Full Text Available Energy efficiency has become one of the top design criteria for current computing systems. The dynamic voltage and frequency scaling (DVFS has been widely adopted by laptop computers, servers, and mobile devices to conserve energy, while the GPU DVFS is still at a certain early age. This paper aims at exploring the impact of GPU DVFS on the application performance and power consumption, and furthermore, on energy conservation. We survey the state-of-the-art GPU DVFS characterizations, and then summarize recent research works on GPU power and performance models. We also conduct real GPU DVFS experiments on NVIDIA Fermi and Maxwell GPUs. According to our experimental results, GPU DVFS has significant potential for energy saving. The effect of scaling core voltage/frequency and memory voltage/frequency depends on not only the GPU architectures, but also the characteristic of GPU applications.
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU.
Xia, Yong; Wang, Kuanquan; Zhang, Henggui
2015-01-01
Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU
Directory of Open Access Journals (Sweden)
Yong Xia
2015-01-01
Full Text Available Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation and the other is the diffusion term of the monodomain model (partial differential equation. Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.
Accelerating large-scale phase-field simulations with GPU
Directory of Open Access Journals (Sweden)
Xiaoming Shi
2017-10-01
Full Text Available A new package for accelerating large-scale phase-field simulations was developed by using GPU based on the semi-implicit Fourier method. The package can solve a variety of equilibrium equations with different inhomogeneity including long-range elastic, magnetostatic, and electrostatic interactions. Through using specific algorithm in Compute Unified Device Architecture (CUDA, Fourier spectral iterative perturbation method was integrated in GPU package. The Allen-Cahn equation, Cahn-Hilliard equation, and phase-field model with long-range interaction were solved based on the algorithm running on GPU respectively to test the performance of the package. From the comparison of the calculation results between the solver executed in single CPU and the one on GPU, it was found that the speed on GPU is enormously elevated to 50 times faster. The present study therefore contributes to the acceleration of large-scale phase-field simulations and provides guidance for experiments to design large-scale functional devices.
Methods for pretreating biomass
Balan, Venkatesh; Dale, Bruce E; Chundawat, Shishir; Sousa, Leonardo
2017-05-09
A method for pretreating biomass is provided, which includes, in a reactor, allowing gaseous ammonia to condense on the biomass and react with water present in the biomass to produce pretreated biomass, wherein reactivity of polysaccharides in the biomass is increased during subsequent biological conversion as compared to the reactivity of polysaccharides in biomass which has not been pretreated. A method for pretreating biomass with a liquid ammonia and recovering the liquid ammonia is also provided. Related systems which include a biochemical or biofuel production facility are also disclosed.
Linear Bregman algorithm implemented in parallel GPU
Li, Pengyan; Ke, Jue; Sui, Dong; Wei, Ping
2015-08-01
At present, most compressed sensing (CS) algorithms have poor converging speed, thus are difficult to run on PC. To deal with this issue, we use a parallel GPU, to implement a broadly used compressed sensing algorithm, the Linear Bregman algorithm. Linear iterative Bregman algorithm is a reconstruction algorithm proposed by Osher and Cai. Compared with other CS reconstruction algorithms, the linear Bregman algorithm only involves the vector and matrix multiplication and thresholding operation, and is simpler and more efficient for programming. We use C as a development language and adopt CUDA (Compute Unified Device Architecture) as parallel computing architectures. In this paper, we compared the parallel Bregman algorithm with traditional CPU realized Bregaman algorithm. In addition, we also compared the parallel Bregman algorithm with other CS reconstruction algorithms, such as OMP and TwIST algorithms. Compared with these two algorithms, the result of this paper shows that, the parallel Bregman algorithm needs shorter time, and thus is more convenient for real-time object reconstruction, which is important to people's fast growing demand to information technology.
A GPU-based large-scale Monte Carlo simulation method for systems with long-range interactions
Liang, Yihao; Xing, Xiangjun; Li, Yaohang
2017-06-01
In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures, and adopts the sequential updating scheme of Metropolis algorithm. It makes no approximation in the computation of energy, and reaches a remarkable 440-fold speedup, compared with the serial implementation on CPU. We further use this method to simulate primitive model electrolytes, and measure very precisely all ion-ion pair correlation functions at high concentrations. From these data, we extract the renormalized Debye length, renormalized valences of constituent ions, and renormalized dielectric constants. These results demonstrate unequivocally physics beyond the classical Poisson-Boltzmann theory.
Energy Technology Data Exchange (ETDEWEB)
Sharp, G C [Department of Radiation Oncology, Massachusetts General Hospital, Boston, MA 02114 (United States); Kandasamy, N [Electrical and Computer Engineering Department, Drexel University, Philadelphia, PA 19104 (United States); Singh, H [Electrical and Computer Engineering Department, Drexel University, Philadelphia, PA 19104 (United States); Folkert, M [Department of Radiation Oncology, Massachusetts General Hospital, Boston, MA 02114 (United States)
2007-09-21
This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup-up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.
Carbonic Acid Pretreatment of Biomass
Energy Technology Data Exchange (ETDEWEB)
G. Peter van Walsum; Kemantha Jayawardhana; Damon Yourchisin; Robert McWilliams; Vanessa Castleberry
2003-05-31
This project sought to address six objectives, outlined below. The objectives were met through the completion of ten tasks. 1) Solidify the theoretical understanding of the binary CO2/H2O system at reaction temperatures and pressures. The thermodynamics of pH prediction have been improved to include a more rigorous treatment of non-ideal gas phases. However it was found that experimental attempts to confirm theoretical pH predictions were still off by a factor of about 1.8 pH units. Arrhenius experiments were carried out and the activation energy for carbonic acid appears to be substantially similar to sulfuric acid. Titration experiments have not yet confirmed or quantified the buffering or acid suppression effects of carbonic acid on biomass. 2) Modify the carbonic acid pretreatment severity function to include the effect of endogenous acid formation and carbonate buffering, if necessary. It was found that the existing severity functions serve adequately to account for endogenous acid production and carbonate effects. 3) Quantify the production of soluble carbohydrates at different reaction conditions and severity. Results show that carbonic acid has little effect on increasing soluble carbohydrate concentrations for pretreated aspen wood, compared to pretreatment with water alone. This appears to be connected to the release of endogenous acids by the substrate. A less acidic substrate such as corn stover would derive benefit from the use of carbonic acid. 4) Quantify the production of microbial inhibitors at selected reaction conditions and severity. It was found that the release of inhibitors was correlated to reaction severity and that carbonic acid did not appear to increase or decrease inhibition compared to pretreatment with water alone. 5) Assess the reactivity to enzymatic hydrolysis of material pretreated at selected reaction conditions and severity. Enzymatic hydrolysis rates increased with severity, but no advantage was detected for the use of carbonic
GPU accelerated study of heat transfer and fluid flow by lattice Boltzmann method on CUDA
Ren, Qinlong
Lattice Boltzmann method (LBM) has been developed as a powerful numerical approach to simulate the complex fluid flow and heat transfer phenomena during the past two decades. As a mesoscale method based on the kinetic theory, LBM has several advantages compared with traditional numerical methods such as physical representation of microscopic interactions, dealing with complex geometries and highly parallel nature. Lattice Boltzmann method has been applied to solve various fluid behaviors and heat transfer process like conjugate heat transfer, magnetic and electric field, diffusion and mixing process, chemical reactions, multiphase flow, phase change process, non-isothermal flow in porous medium, microfluidics, fluid-structure interactions in biological system and so on. In addition, as a non-body-conformal grid method, the immersed boundary method (IBM) could be applied to handle the complex or moving geometries in the domain. The immersed boundary method could be coupled with lattice Boltzmann method to study the heat transfer and fluid flow problems. Heat transfer and fluid flow are solved on Euler nodes by LBM while the complex solid geometries are captured by Lagrangian nodes using immersed boundary method. Parallel computing has been a popular topic for many decades to accelerate the computational speed in engineering and scientific fields. Today, almost all the laptop and desktop have central processing units (CPUs) with multiple cores which could be used for parallel computing. However, the cost of CPUs with hundreds of cores is still high which limits its capability of high performance computing on personal computer. Graphic processing units (GPU) is originally used for the computer video cards have been emerged as the most powerful high-performance workstation in recent years. Unlike the CPUs, the cost of GPU with thousands of cores is cheap. For example, the GPU (GeForce GTX TITAN) which is used in the current work has 2688 cores and the price is only 1
A Novel CPU/GPU Simulation Environment for Large-Scale Biologically-Realistic Neural Modeling
Directory of Open Access Journals (Sweden)
Roger V Hoang
2013-10-01
Full Text Available Computational Neuroscience is an emerging field that provides unique opportunities to studycomplex brain structures through realistic neural simulations. However, as biological details are added tomodels, the execution time for the simulation becomes longer. Graphics Processing Units (GPUs are now being utilized to accelerate simulations due to their ability to perform computations in parallel. As such, they haveshown significant improvement in execution time compared to Central Processing Units (CPUs. Most neural simulators utilize either multiple CPUs or a single GPU for better performance, but still show limitations in execution time when biological details are not sacrificed. Therefore, we present a novel CPU/GPU simulation environment for large-scale biological networks,the NeoCortical Simulator version 6 (NCS6. NCS6 is a free, open-source, parallelizable, and scalable simula-tor, designed to run on clusters of multiple machines, potentially with high performance computing devicesin each of them. It has built-in leaky-integrate-and-fire (LIF and Izhikevich (IZH neuron models, but usersalso have the capability to design their own plug-in interface for different neuron types as desired. NCS6is currently able to simulate one million cells and 100 million synapses in quasi real time by distributing dataacross these heterogeneous clusters of CPUs and GPUs.
Quantum.Ligand.Dock: protein-ligand docking with quantum entanglement refinement on a GPU system.
Kantardjiev, Alexander A
2012-07-01
Quantum.Ligand.Dock (protein-ligand docking with graphic processing unit (GPU) quantum entanglement refinement on a GPU system) is an original modern method for in silico prediction of protein-ligand interactions via high-performance docking code. The main flavour of our approach is a combination of fast search with a special account for overlooked physical interactions. On the one hand, we take care of self-consistency and proton equilibria mutual effects of docking partners. On the other hand, Quantum.Ligand.Dock is the the only docking server offering such a subtle supplement to protein docking algorithms as quantum entanglement contributions. The motivation for development and proposition of the method to the community hinges upon two arguments-the fundamental importance of quantum entanglement contribution in molecular interaction and the realistic possibility to implement it by the availability of supercomputing power. The implementation of sophisticated quantum methods is made possible by parallelization at several bottlenecks on a GPU supercomputer. The high-performance implementation will be of use for large-scale virtual screening projects, structural bioinformatics, systems biology and fundamental research in understanding protein-ligand recognition. The design of the interface is focused on feasibility and ease of use. Protein and ligand molecule structures are supposed to be submitted as atomic coordinate files in PDB format. A customization section is offered for addition of user-specified charges, extra ionogenic groups with intrinsic pK(a) values or fixed ions. Final predicted complexes are ranked according to obtained scores and provided in PDB format as well as interactive visualization in a molecular viewer. Quantum.Ligand.Dock server can be accessed at http://87.116.85.141/LigandDock.html.
Quantum.Ligand.Dock: protein–ligand docking with quantum entanglement refinement on a GPU system
Kantardjiev, Alexander A.
2012-01-01
Quantum.Ligand.Dock (protein–ligand docking with graphic processing unit (GPU) quantum entanglement refinement on a GPU system) is an original modern method for in silico prediction of protein–ligand interactions via high-performance docking code. The main flavour of our approach is a combination of fast search with a special account for overlooked physical interactions. On the one hand, we take care of self-consistency and proton equilibria mutual effects of docking partners. On the other hand, Quantum.Ligand.Dock is the the only docking server offering such a subtle supplement to protein docking algorithms as quantum entanglement contributions. The motivation for development and proposition of the method to the community hinges upon two arguments—the fundamental importance of quantum entanglement contribution in molecular interaction and the realistic possibility to implement it by the availability of supercomputing power. The implementation of sophisticated quantum methods is made possible by parallelization at several bottlenecks on a GPU supercomputer. The high-performance implementation will be of use for large-scale virtual screening projects, structural bioinformatics, systems biology and fundamental research in understanding protein–ligand recognition. The design of the interface is focused on feasibility and ease of use. Protein and ligand molecule structures are supposed to be submitted as atomic coordinate files in PDB format. A customization section is offered for addition of user-specified charges, extra ionogenic groups with intrinsic pKa values or fixed ions. Final predicted complexes are ranked according to obtained scores and provided in PDB format as well as interactive visualization in a molecular viewer. Quantum.Ligand.Dock server can be accessed at http://87.116.85.141/LigandDock.html. PMID:22669908
Discrete shearlet transform on GPU with applications in anomaly detection and denoising
Gibert, Xavier; Patel, Vishal M.; Labate, Demetrio; Chellappa, Rama
2014-12-01
Shearlets have emerged in recent years as one of the most successful methods for the multiscale analysis of multidimensional signals. Unlike wavelets, shearlets form a pyramid of well-localized functions defined not only over a range of scales and locations, but also over a range of orientations and with highly anisotropic supports. As a result, shearlets are much more effective than traditional wavelets in handling the geometry of multidimensional data, and this was exploited in a wide range of applications from image and signal processing. However, despite their desirable properties, the wider applicability of shearlets is limited by the computational complexity of current software implementations. For example, denoising a single 512 × 512 image using a current implementation of the shearlet-based shrinkage algorithm can take between 10 s and 2 min, depending on the number of CPU cores, and much longer processing times are required for video denoising. On the other hand, due to the parallel nature of the shearlet transform, it is possible to use graphics processing units (GPU) to accelerate its implementation. In this paper, we present an open source stand-alone implementation of the 2D discrete shearlet transform using CUDA C++ as well as GPU-accelerated MATLAB implementations of the 2D and 3D shearlet transforms. We have instrumented the code so that we can analyze the running time of each kernel under different GPU hardware. In addition to denoising, we describe a novel application of shearlets for detecting anomalies in textured images. In this application, computation times can be reduced by a factor of 50 or more, compared to multicore CPU implementations.
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations.
Kutzner, Carsten; Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L; Grubmüller, Helmut
2015-10-05
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well-exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)-based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off-loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc.
Hammitzsch, M.; Spazier, J.; Reißland, S.
2014-12-01
Usually, tsunami early warning and mitigation systems (TWS or TEWS) are based on several software components deployed in a client-server based infrastructure. The vast majority of systems importantly include desktop-based clients with a graphical user interface (GUI) for the operators in early warning centers. However, in times of cloud computing and ubiquitous computing the use of concepts and paradigms, introduced by continuously evolving approaches in information and communications technology (ICT), have to be considered even for early warning systems (EWS). Based on the experiences and the knowledge gained in three research projects - 'German Indonesian Tsunami Early Warning System' (GITEWS), 'Distant Early Warning System' (DEWS), and 'Collaborative, Complex, and Critical Decision-Support in Evolving Crises' (TRIDEC) - new technologies are exploited to implement a cloud-based and web-based prototype to open up new prospects for EWS. This prototype, named 'TRIDEC Cloud', merges several complementary external and in-house cloud-based services into one platform for automated background computation with graphics processing units (GPU), for web-mapping of hazard specific geospatial data, and for serving relevant functionality to handle, share, and communicate threat specific information in a collaborative and distributed environment. The prototype in its current version addresses tsunami early warning and mitigation. The integration of GPU accelerated tsunami simulation computations have been an integral part of this prototype to foster early warning with on-demand tsunami predictions based on actual source parameters. However, the platform is meant for researchers around the world to make use of the cloud-based GPU computation to analyze other types of geohazards and natural hazards and react upon the computed situation picture with a web-based GUI in a web browser at remote sites. The current website is an early alpha version for demonstration purposes to give the
GGEMS-Brachy: GPU GEant4-based Monte Carlo simulation for brachytherapy applications
International Nuclear Information System (INIS)
Lemaréchal, Yannick; Bert, Julien; Schick, Ulrike; Pradier, Olivier; Garcia, Marie-Paule; Boussion, Nicolas; Visvikis, Dimitris; Falconnet, Claire; Després, Philippe; Valeri, Antoine
2015-01-01
In brachytherapy, plans are routinely calculated using the AAPM TG43 formalism which considers the patient as a simple water object. An accurate modeling of the physical processes considering patient heterogeneity using Monte Carlo simulation (MCS) methods is currently too time-consuming and computationally demanding to be routinely used. In this work we implemented and evaluated an accurate and fast MCS on Graphics Processing Units (GPU) for brachytherapy low dose rate (LDR) applications. A previously proposed Geant4 based MCS framework implemented on GPU (GGEMS) was extended to include a hybrid GPU navigator, allowing navigation within voxelized patient specific images and analytically modeled 125 I seeds used in LDR brachytherapy. In addition, dose scoring based on track length estimator including uncertainty calculations was incorporated. The implemented GGEMS-brachy platform was validated using a comparison with Geant4 simulations and reference datasets. Finally, a comparative dosimetry study based on the current clinical standard (TG43) and the proposed platform was performed on twelve prostate cancer patients undergoing LDR brachytherapy. Considering patient 3D CT volumes of 400 × 250 × 65 voxels and an average of 58 implanted seeds, the mean patient dosimetry study run time for a 2% dose uncertainty was 9.35 s (≈500 ms 10 −6 simulated particles) and 2.5 s when using one and four GPUs, respectively. The performance of the proposed GGEMS-brachy platform allows envisaging the use of Monte Carlo simulation based dosimetry studies in brachytherapy compatible with clinical practice. Although the proposed platform was evaluated for prostate cancer, it is equally applicable to other LDR brachytherapy clinical applications. Future extensions will allow its application in high dose rate brachytherapy applications. (paper)
Su, Xiaoquan; Wang, Xuetao; Jing, Gongchao; Ning, Kang
2014-04-01
The number of microbial community samples is increasing with exponential speed. Data-mining among microbial community samples could facilitate the discovery of valuable biological information that is still hidden in the massive data. However, current methods for the comparison among microbial communities are limited by their ability to process large amount of samples each with complex community structure. We have developed an optimized GPU-based software, GPU-Meta-Storms, to efficiently measure the quantitative phylogenetic similarity among massive amount of microbial community samples. Our results have shown that GPU-Meta-Storms would be able to compute the pair-wise similarity scores for 10 240 samples within 20 min, which gained a speed-up of >17 000 times compared with single-core CPU, and >2600 times compared with 16-core CPU. Therefore, the high-performance of GPU-Meta-Storms could facilitate in-depth data mining among massive microbial community samples, and make the real-time analysis and monitoring of temporal or conditional changes for microbial communities possible. GPU-Meta-Storms is implemented by CUDA (Compute Unified Device Architecture) and C++. Source code is available at http://www.computationalbioenergy.org/meta-storms.html.
Rapid earthquake detection through GPU-Based template matching
Mu, Dawei; Lee, En-Jui; Chen, Po
2017-12-01
The template-matching algorithm (TMA) has been widely adopted for improving the reliability of earthquake detection. The TMA is based on calculating the normalized cross-correlation coefficient (NCC) between a collection of selected template waveforms and the continuous waveform recordings of seismic instruments. In realistic applications, the computational cost of the TMA is much higher than that of traditional techniques. In this study, we provide an analysis of the TMA and show how the GPU architecture provides an almost ideal environment for accelerating the TMA and NCC-based pattern recognition algorithms in general. So far, our best-performing GPU code has achieved a speedup factor of more than 800 with respect to a common sequential CPU code. We demonstrate the performance of our GPU code using seismic waveform recordings from the ML 6.6 Meinong earthquake sequence in Taiwan.
Pretreatment of microbial sludges
Rivard, Christopher J.; Nagle, Nicholas J.
1995-01-01
Methods are described for pretreating microbial sludges to break cells and disrupt organic matter. One method involves the use of sonication, and another method involves the use of shear forces. The pretreatment of sludge enhances bioconversion of the organic fraction. This allows for efficient dewatering of the sludge and reduces the cost for final disposal of the waste.
GPU accelerated likelihoods for stereo-based articulated tracking
DEFF Research Database (Denmark)
Friborg, Rune Møllegaard; Hauberg, Søren; Erleben, Kenny
2010-01-01
For many years articulated tracking has been an active research topic in the computer vision community. While working solutions have been suggested, computational time is still problematic. We present a GPU implementation of a ray-casting based likelihood model that is orders of magnitude faster...... than a traditional CPU implementation. We explain the non-intuitive steps required to attain an optimized GPU implementation, where the dominant part is to hide the memory latency effectively. Benchmarks show that computations which previously required several minutes, are now performed in few seconds....
Work-Efficient Parallel Skyline Computation for the GPU
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira
2015-01-01
The skyline operator returns records in a dataset that provide optimal trade-offs of multiple dimensions. State-of-the-art skyline computation involves complex tree traversals, data-ordering, and conditional branching to minimize the number of point-to-point comparisons. Meanwhile, GPGPU computing...... offers the potential for parallelizing skyline computation across thousands of cores. However, attempts to port skyline algorithms to the GPU have prioritized throughput and failed to outperform sequential algorithms. In this paper, we introduce a new skyline algorithm, designed for the GPU, that uses...
Determinant Computation on the GPU using the Condensation Method
International Nuclear Information System (INIS)
Haque, Sardar Anisul; Maza, Marc Moreno
2012-01-01
We report on a GPU implementation of the condensation method designed by Abdelmalek Salem and Kouachi Said for computing the determinant of a matrix. We consider two types of coefficients: modular integers and floating point numbers. We evaluate the performance of our code by measuring its effective bandwidth and argue that it is numerical stable in the floating point number case. In addition, we compare our code with serial implementation of determinant computation from well-known mathematical packages. Our results suggest that a GPU implementation of the condensation method has a large potential for improving those packages in terms of running time and numerical stability.
Design and Testing of GPU based RTC for TMT NFIRAOS
Wang, Lianqi
2013-12-01
Graphical processing units (GPUs) are now gaining popularity in general computing applications due to their high computing power and high memory bandwidth (~10x of CPUs). For the same reason, GPUs are also suitable processors for the real time controllers (RTCs) of next generation adaptive optics (AO) systems. In this talk, we present a CPU+GPU based RTC design for the Thirty Meter Telescope (TMT) Narrow Field Infrared AO System (NFIRAOS), as part of an ongoing trade study of control algorithms and processor hardware options. We demonstrate that the system will meet the stringent latency requirement of first computing gradients for ~15500 laser guide star wavefront sensor sub-apertures, and then commands for ~7000 deformable mirror actuator at 800 Hz, using 12 Nvidia GTX 580 GPUs (2 GPUs per WFS). A classical matrix vector multiply reconstruction algorithm is used for its simplicity and parallelizability. Obtaining the conventional control matrix by inverting the forward influence matrix is impractical due to the large system size and sub-optimal performance due to lacking proper regularization. Instead, the control matrix implements a minimum variance wavefront reconstruction algorithm and is computed column-by-column using an iterative solver. We demonstrate that we can initialize the control matrix in about 1 minute and update it in 10 seconds as operating conditions vary to maintain optimal performance. Additionally, the weights used to compute the subaperture gradients are updated at a similar rate to track changes in the profile of the mesospheric sodium layer. These soft real time and background processes will largely be handled by CPUs. Finally,we will show a first version of the complete block diagram of data flow and mapping to hardware.
Parallel fuzzy connected image segmentation on GPU.
Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W
2011-07-01
Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.
Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method
Januszewski, M.; Kostur, M.
2014-09-01
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes. Catalogue identifier: AETA_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AETA_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Lesser General Public License, version 3 No. of lines in distributed program, including test data, etc.: 225864 No. of bytes in distributed program, including test data, etc.: 46861049 Distribution format: tar.gz Programming language: Python, CUDA C, OpenCL. Computer: Any with an OpenCL or CUDA-compliant GPU. Operating system: No limits (tested on Linux and Mac OS X). RAM: Hundreds of megabytes to tens of gigabytes for typical cases. Classification: 12, 6.5. External routines: PyCUDA/PyOpenCL, Numpy, Mako, ZeroMQ (for multi-GPU simulations), scipy, sympy Nature of problem: GPU-accelerated simulation of single- and multi-component fluid flows. Solution method: A wide range of relaxation models (LBGK, MRT, regularized LB, ELBM, Shan-Chen, free energy, free surface) and boundary conditions within the lattice
A nonvoxel-based dose convolution/superposition algorithm optimized for scalable GPU architectures
International Nuclear Information System (INIS)
Neylon, J.; Sheng, K.; Yu, V.; Low, D. A.; Kupelian, P.; Santhanam, A.; Chen, Q.
2014-01-01
Purpose: Real-time adaptive planning and treatment has been infeasible due in part to its high computational complexity. There have been many recent efforts to utilize graphics processing units (GPUs) to accelerate the computational performance and dose accuracy in radiation therapy. Data structure and memory access patterns are the key GPU factors that determine the computational performance and accuracy. In this paper, the authors present a nonvoxel-based (NVB) approach to maximize computational and memory access efficiency and throughput on the GPU. Methods: The proposed algorithm employs a ray-tracing mechanism to restructure the 3D data sets computed from the CT anatomy into a nonvoxel-based framework. In a process that takes only a few milliseconds of computing time, the algorithm restructured the data sets by ray-tracing through precalculated CT volumes to realign the coordinate system along the convolution direction, as defined by zenithal and azimuthal angles. During the ray-tracing step, the data were resampled according to radial sampling and parallel ray-spacing parameters making the algorithm independent of the original CT resolution. The nonvoxel-based algorithm presented in this paper also demonstrated a trade-off in computational performance and dose accuracy for different coordinate system configurations. In order to find the best balance between the computed speedup and the accuracy, the authors employed an exhaustive parameter search on all sampling parameters that defined the coordinate system configuration: zenithal, azimuthal, and radial sampling of the convolution algorithm, as well as the parallel ray spacing during ray tracing. The angular sampling parameters were varied between 4 and 48 discrete angles, while both radial sampling and parallel ray spacing were varied from 0.5 to 10 mm. The gamma distribution analysis method (γ) was used to compare the dose distributions using 2% and 2 mm dose difference and distance-to-agreement criteria
A nonvoxel-based dose convolution/superposition algorithm optimized for scalable GPU architectures.
Neylon, J; Sheng, K; Yu, V; Chen, Q; Low, D A; Kupelian, P; Santhanam, A
2014-10-01
Real-time adaptive planning and treatment has been infeasible due in part to its high computational complexity. There have been many recent efforts to utilize graphics processing units (GPUs) to accelerate the computational performance and dose accuracy in radiation therapy. Data structure and memory access patterns are the key GPU factors that determine the computational performance and accuracy. In this paper, the authors present a nonvoxel-based (NVB) approach to maximize computational and memory access efficiency and throughput on the GPU. The proposed algorithm employs a ray-tracing mechanism to restructure the 3D data sets computed from the CT anatomy into a nonvoxel-based framework. In a process that takes only a few milliseconds of computing time, the algorithm restructured the data sets by ray-tracing through precalculated CT volumes to realign the coordinate system along the convolution direction, as defined by zenithal and azimuthal angles. During the ray-tracing step, the data were resampled according to radial sampling and parallel ray-spacing parameters making the algorithm independent of the original CT resolution. The nonvoxel-based algorithm presented in this paper also demonstrated a trade-off in computational performance and dose accuracy for different coordinate system configurations. In order to find the best balance between the computed speedup and the accuracy, the authors employed an exhaustive parameter search on all sampling parameters that defined the coordinate system configuration: zenithal, azimuthal, and radial sampling of the convolution algorithm, as well as the parallel ray spacing during ray tracing. The angular sampling parameters were varied between 4 and 48 discrete angles, while both radial sampling and parallel ray spacing were varied from 0.5 to 10 mm. The gamma distribution analysis method (γ) was used to compare the dose distributions using 2% and 2 mm dose difference and distance-to-agreement criteria, respectively
Cai, Yong; Cui, Xiangyang; Li, Guangyao; Liu, Wenyang
2018-04-01
The edge-smooth finite element method (ES-FEM) can improve the computational accuracy of triangular shell elements and the mesh partition efficiency of complex models. In this paper, an approach is developed to perform explicit finite element simulations of contact-impact problems with a graphical processing unit (GPU) using a special edge-smooth triangular shell element based on ES-FEM. Of critical importance for this problem is achieving finer-grained parallelism to enable efficient data loading and to minimize communication between the device and host. Four kinds of parallel strategies are then developed to efficiently solve these ES-FEM based shell element formulas, and various optimization methods are adopted to ensure aligned memory access. Special focus is dedicated to developing an approach for the parallel construction of edge systems. A parallel hierarchy-territory contact-searching algorithm (HITA) and a parallel penalty function calculation method are embedded in this parallel explicit algorithm. Finally, the program flow is well designed, and a GPU-based simulation system is developed, using Nvidia's CUDA. Several numerical examples are presented to illustrate the high quality of the results obtained with the proposed methods. In addition, the GPU-based parallel computation is shown to significantly reduce the computing time.
Dardikman, Gili; Shaked, Natan T.
2016-03-01
We present highly parallel and efficient algorithms for real-time reconstruction of the quantitative three-dimensional (3-D) refractive-index maps of biological cells without labeling, as obtained from the interferometric projections acquired by tomographic phase microscopy (TPM). The new algorithms are implemented on the graphic processing unit (GPU) of the computer using CUDA programming environment. The reconstruction process includes two main parts. First, we used parallel complex wave-front reconstruction of the TPM-based interferometric projections acquired at various angles. The complex wave front reconstructions are done on the GPU in parallel, while minimizing the calculation time of the Fourier transforms and phase unwrapping needed. Next, we implemented on the GPU in parallel the 3-D refractive index map retrieval using the TPM filtered-back projection algorithm. The incorporation of algorithms that are inherently parallel with a programming environment such as Nvidia's CUDA makes it possible to obtain real-time processing rate, and enables high-throughput platform for label-free, 3-D cell visualization and diagnosis.
Multi-GPU based framework for real-time motion analysis and tracking in multi-user scenarios
Directory of Open Access Journals (Sweden)
Sidi Ahmed Mahmoudi
2015-02-01
Full Text Available Video processing algorithms present a necessary tool for various domains related to computer vision such as motion tracking, event detection and localization in multi-user scenarios (crowd videos, mobile camera, scenes with noise, etc.. However, the new video standards, especially those in high definitions require more computation since their treatment is applied on large video frames. As result, the current implementations, even running on modern hardware, cannot provide a real-time processing (25 frames per second, fps. Several solutions have been proposed to overcome this constraint, by exploiting graphic processing units (GPUs. Although they exploit GPU platforms, they are not able to provide a real-time processing of high definition video sequences. In this work, we propose a new framework that enables an efficient exploitation of single and multiple GPUs, in order to achieve real-time processing of Full HD or even 4K video standards. Moreover, the framework includes several GPU based primitive functions related to motion analysis and tracking methods, such as silhouette extraction, contours extraction, corners detection and tracking using optical flow estimation. Based on this framework, we developed several real-time and GPU based video processing applications such as motion detection using moving camera, event detection and event localization
Energy Technology Data Exchange (ETDEWEB)
Nieto, J., E-mail: jnieto@sec.upm.es [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M. [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Vega, J.; Castro, R. [Asociación EURATOM/CIEMAT para Fusión, Madrid (Spain)
2016-11-15
Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.
Ha, Sanghyun; Park, Junshin; You, Donghyun
2018-01-01
Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. The Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution methods used in the semi-implicit fractional-step method take advantage of multiple tridiagonal matrices whose inversion is known as the major bottleneck for acceleration on a typical multi-core machine. A novel implementation of the semi-implicit fractional-step method designed for GPU acceleration of the incompressible Navier-Stokes equations is presented. Aspects of the programing model of Compute Unified Device Architecture (CUDA), which are critical to the bandwidth-bound nature of the present method are discussed in detail. A data layout for efficient use of CUDA libraries is proposed for acceleration of tridiagonal matrix inversion and fast Fourier transform. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while the Navier-Stokes equations are computed on a GPU. Performance of the present method using CUDA is assessed by comparing the speed of solving three tridiagonal matrices using ADI with the speed of solving one heptadiagonal matrix using a conjugate gradient method. An overall speedup of 20 times is achieved using a Tesla K40 GPU in comparison with a single-core Xeon E5-2660 v3 CPU in simulations of turbulent boundary-layer flow over a flat plate conducted on over 134 million grids. Enhanced performance of 48 times speedup is reached for the same problem using a Tesla P100 GPU.
Sachetto Oliveira, Rafael; Martins Rocha, Bernardo; Burgarelli, Denise; Meira, Wagner; Constantinides, Christakis; Weber Dos Santos, Rodrigo
2018-02-01
The use of computer models as a tool for the study and understanding of the complex phenomena of cardiac electrophysiology has attained increased importance nowadays. At the same time, the increased complexity of the biophysical processes translates into complex computational and mathematical models. To speed up cardiac simulations and to allow more precise and realistic uses, 2 different techniques have been traditionally exploited: parallel computing and sophisticated numerical methods. In this work, we combine a modern parallel computing technique based on multicore and graphics processing units (GPUs) and a sophisticated numerical method based on a new space-time adaptive algorithm. We evaluate each technique alone and in different combinations: multicore and GPU, multicore and GPU and space adaptivity, multicore and GPU and space adaptivity and time adaptivity. All the techniques and combinations were evaluated under different scenarios: 3D simulations on slabs, 3D simulations on a ventricular mouse mesh, ie, complex geometry, sinus-rhythm, and arrhythmic conditions. Our results suggest that multicore and GPU accelerate the simulations by an approximate factor of 33×, whereas the speedups attained by the space-time adaptive algorithms were approximately 48. Nevertheless, by combining all the techniques, we obtained speedups that ranged between 165 and 498. The tested methods were able to reduce the execution time of a simulation by more than 498× for a complex cellular model in a slab geometry and by 165× in a realistic heart geometry simulating spiral waves. The proposed methods will allow faster and more realistic simulations in a feasible time with no significant loss of accuracy. Copyright © 2017 John Wiley & Sons, Ltd.
Directory of Open Access Journals (Sweden)
David S. Hardin
2013-04-01
Full Text Available As Graphics Processing Units (GPUs have gained in capability and GPU development environments have matured, developers are increasingly turning to the GPU to off-load the main host CPU of numerically-intensive, parallelizable computations. Modern GPUs feature hundreds of cores, and offer programming niceties such as double-precision floating point, and even limited recursion. This shift from CPU to GPU, however, raises the question: how do we know that these new GPU-based algorithms are correct? In order to explore this new verification frontier, we formalized a parallelizable all-pairs shortest path (APSP algorithm for weighted graphs, originally coded in NVIDIA's CUDA language, in ACL2. The ACL2 specification is written using a single-threaded object (stobj and tail recursion, as the stobj/tail recursion combination yields the most straightforward translation from imperative programming languages, as well as efficient, scalable executable specifications within ACL2 itself. The ACL2 version of the APSP algorithm can process millions of vertices and edges with little to no garbage generation, and executes at one-sixth the speed of a host-based version of APSP coded in C – a very respectable result for a theorem prover. In addition to formalizing the APSP algorithm (which uses Dijkstra's shortest path algorithm at its core, we have also provided capability that the original APSP code lacked, namely shortest path recovery. Path recovery is accomplished using a secondary ACL2 stobj implementing a LIFO stack, which is proven correct. To conclude the experiment, we ported the ACL2 version of the APSP kernels back to C, resulting in a less than 5% slowdown, and also performed a partial back-port to CUDA, which, surprisingly, yielded a slight performance increase.
Implementation of GPU parallel equilibrium reconstruction for plasma control in EAST
International Nuclear Information System (INIS)
Huang, Yao; Xiao, B.J.; Luo, Z.P.; Yuan, Q.P.; Pei, X.F.; Yue, X.N.
2016-01-01
Highlights: • We described parallel equilibrium reconstruction code P-EFIT running on GPU was integrated with EAST plasma control system. • Compared with RT-EFIT used in EAST, P-EFIT has better spatial resolution and full algorithm of EFIT per iteration. • With the data interface through RFM, 65 × 65 spatial grids P-EFIT can satisfy the accuracy and time feasibility requirements for plasma control. • Successful control using ISOFLUX/P-EFIT was established in the dedicated experiment during the EAST 2014 campaign. • This work is a stepping-stone towards versatile ISOFLUX/P-EFIT control, such as real-time equilibrium reconstruction with more diagnostics. - Abstract: Implementation of P-EFIT code for plasma control in EAST is described. P-EFIT is based on the EFIT framework, but built with the CUDA™ architecture to take advantage of massively parallel Graphical Processing Unit (GPU) cores to significantly accelerate the computation. 65 × 65 grid size P-EFIT can complete one reconstruction iteration in 300 μs, with one iteration strategy, it can satisfy the needs of real-time plasma shape control. Data interface between P-EFIT and PCS is realized and developed by transferring data through RFM. First application of P-EFIT to discharge control in EAST is described.
GPU-based real-time triggering in the NA62 experiment
Ammendola, R.; Cretaro, P.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P.S.; Pastorelli, E.; Piandani, R.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.
2016-01-01
Over the last few years the GPGPU (General-Purpose computing on Graphics Processing Units) paradigm represented a remarkable development in the world of computing. Computing for High-Energy Physics is no exception: several works have demonstrated the effectiveness of the integration of GPU-based systems in high level trigger of different experiments. On the other hand the use of GPUs in the low level trigger systems, characterized by stringent real-time constraints, such as tight time budget and high throughput, poses several challenges. In this paper we focus on the low level trigger in the CERN NA62 experiment, investigating the use of real-time computing on GPUs in this synchronous system. Our approach aimed at harvesting the GPU computing power to build in real-time refined physics-related trigger primitives for the RICH detector, as the the knowledge of Cerenkov rings parameters allows to build stringent conditions for data selection at trigger level. Latencies of all components of the trigger chain have...
GPU-based ultra-fast dose calculation using a finite size pencil beam model
Gu, Xuejun; Choi, Dongju; Men, Chunhua; Pan, Hubert; Majumdar, Amitava; Jiang, Steve B.
2009-10-01
Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity-modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation in the case of a water phantom and the case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200 to 400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27 GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a nine-field prostate IMRT plan with this new framework is less than 1 s. This indicates that the GPU-based FSPB algorithm is well suited for online re-planning for adaptive radiotherapy.
GPU Enhancement of the Trigger to Extend Physics Reach at the LHC
Lujan, P.; Halyo, V.; Hunt, A.; Jindal, P.; LeGresley, P.
2014-06-01
At the Large Hadron Collider (LHC), the trigger systems for the detectors must be able to process a very large amount of data in a very limited amount of time, so that the nominal collision rate of 40 MHz can be reduced to a data rate that can be stored and processed in a reasonable amount of time. This need for high performance places very stringent requirements on the complexity of the algorithms that can be used for identifying events of interest in the trigger system, which potentially limits the ability to trigger on signatures of various new physics models. In this paper, we present an alternative tracking algorithm, based on the Hough transform, which avoids many of the problems associated with the standard combinatorial track finding currently used. The Hough transform is also well-adapted for Graphics Processing Unit (GPU)-based computing, and such GPU-based systems could be easily integrated into the existing High-Level Trigger (HLT). This algorithm offers the ability to trigger on topological signatures of new physics currently not practical to reconstruct, such as events with jets or black holes significantly displaced from the primary vertex. This paper presents, for the first time, an implementation and preliminary performance results using NVIDIA Tesla C2075 and K20c GPUs.
GPU Enhancement of the Trigger to Extend Physics Reach at the LHC
Lujan, P.; Hunt, A.; Jindal, P.; LeGresley, P.
2014-01-01
At the Large Hadron Collider (LHC), the trigger systems for the detectors must be able to process a very large amount of data in a very limited amount of time, so that the nominal collision rate of 40 MHz can be reduced to a data rate that can be stored and processed in a reasonable amount of time. This need for high performance places very stringent requirements on the complexity of the algorithms that can be used for identifying events of interest in the trigger system, which potentially limits the ability to trigger on signatures of various new physics models. In this paper, we present an alternative tracking algorithm, based on the Hough transform, which avoids many of the problems associated with the standard combinatorial track finding currently used. The Hough transform is also well-adapted for Graphics Processing Unit (GPU)-based computing, and such GPU-based systems could be easily integrated into the existing High-Level Trigger (HLT). This algorithm offers the ability to trigger on topological signa...
A Survey on GPU-Based Implementation of Swarm Intelligence Algorithms.
Tan, Ying; Ding, Ke
2016-09-01
Inspired by the collective behavior of natural swarm, swarm intelligence algorithms (SIAs) have been developed and widely used for solving optimization problems. When applied to complex problems, a large number of fitness function evaluations are needed to obtain an acceptable solution. To tackle this vital issue, graphical processing units (GPUs) have been used to accelerate the optimization procedure of SIAs. Thanks to their inherent parallelism, SIAs are very suitable for parallel implementation under the GPU platform which have achieved a great success in recent years. This paper presents a comprehensive review of GPU-based parallel SIAs in accordance with a newly proposed taxonomy. Critical concerns for the efficient parallel implementation of SIAs are also described in detail. Moreover, novel criteria are also proposed to evaluate and compare the parallel implementation and algorithm performance universally. The rationality and practicability of the proposed optimization methodology and criteria are verified by careful case study. Finally, our opinions and perspectives on the trends and prospects on the relatively new research domain are also presented for future development.
Performance analysis of parallel gravitational N-body codes on large GPU clusters
International Nuclear Information System (INIS)
Huang, Si-Yi; Spurzem, Rainer; Berczik, Peter
2016-01-01
We compare the performance of two very different parallel gravitational N-body codes for astrophysical simulations on large Graphics Processing Unit (GPU) clusters, both of which are pioneers in their own fields as well as on certain mutual scales - NBODY6++ and Bonsai. We carry out benchmarks of the two codes by analyzing their performance, accuracy and efficiency through the modeling of structure decomposition and timing measurements. We find that both codes are heavily optimized to leverage the computational potential of GPUs as their performance has approached half of the maximum single precision performance of the underlying GPU cards. With such performance we predict that a speed-up of 200 – 300 can be achieved when up to 1k processors and GPUs are employed simultaneously. We discuss the quantitative information about comparisons of the two codes, finding that in the same cases Bonsai adopts larger time steps as well as larger relative energy errors than NBODY6++, typically ranging from 10 – 50 times larger, depending on the chosen parameters of the codes. Although the two codes are built for different astrophysical applications, in specified conditions they may overlap in performance at certain physical scales, thus allowing the user to choose either one by fine-tuning parameters accordingly. (paper)
Performance analysis of parallel gravitational N-body codes on large GPU clusters
Huang, Si-Yi; Spurzem, Rainer; Berczik, Peter
2016-01-01
We compare the performance of two very different parallel gravitational N-body codes for astrophysical simulations on large Graphics Processing Unit (GPU) clusters, both of which are pioneers in their own fields as well as on certain mutual scales - NBODY6++ and Bonsai. We carry out benchmarks of the two codes by analyzing their performance, accuracy and efficiency through the modeling of structure decomposition and timing measurements. We find that both codes are heavily optimized to leverage the computational potential of GPUs as their performance has approached half of the maximum single precision performance of the underlying GPU cards. With such performance we predict that a speed-up of 200 - 300 can be achieved when up to 1k processors and GPUs are employed simultaneously. We discuss the quantitative information about comparisons of the two codes, finding that in the same cases Bonsai adopts larger time steps as well as larger relative energy errors than NBODY6++, typically ranging from 10 - 50 times larger, depending on the chosen parameters of the codes. Although the two codes are built for different astrophysical applications, in specified conditions they may overlap in performance at certain physical scales, thus allowing the user to choose either one by fine-tuning parameters accordingly.
ROI-Based On-Board Compression for Hyperspectral Remote Sensing Images on GPU.
Giordano, Rossella; Guccione, Pietro
2017-05-19
In recent years, hyperspectral sensors for Earth remote sensing have become very popular. Such systems are able to provide the user with images having both spectral and spatial information. The current hyperspectral spaceborne sensors are able to capture large areas with increased spatial and spectral resolution. For this reason, the volume of acquired data needs to be reduced on board in order to avoid a low orbital duty cycle due to limited storage space. Recently, literature has focused the attention on efficient ways for on-board data compression. This topic is a challenging task due to the difficult environment (outer space) and due to the limited time, power and computing resources. Often, the hardware properties of Graphic Processing Units (GPU) have been adopted to reduce the processing time using parallel computing. The current work proposes a framework for on-board operation on a GPU, using NVIDIA's CUDA (Compute Unified Device Architecture) architecture. The algorithm aims at performing on-board compression using the target's related strategy. In detail, the main operations are: the automatic recognition of land cover types or detection of events in near real time in regions of interest (this is a user related choice) with an unsupervised classifier; the compression of specific regions with space-variant different bit rates including Principal Component Analysis (PCA), wavelet and arithmetic coding; and data volume management to the Ground Station. Experiments are provided using a real dataset taken from an AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) airborne sensor in a harbor area.
GPU based real-time instrument tracking with three-dimensional ultrasound.
Novotny, Paul M; Stoll, Jeff A; Vasilyev, Nikolay V; del Nido, Pedro J; Dupont, Pierre E; Zickler, Todd E; Howe, Robert D
2007-10-01
Real-time three-dimensional ultrasound enables new intracardiac surgical procedures, but the distorted appearance of instruments in ultrasound poses a challenge to surgeons. This paper presents a detection technique that identifies the position of the instrument within the ultrasound volume. The algorithm uses a form of the generalized Radon transform to search for long straight objects in the ultrasound image, a feature characteristic of instruments and not found in cardiac tissue. When combined with passive markers placed on the instrument shaft, the full position and orientation of the instrument is found in 3D space. This detection technique is amenable to rapid execution on the current generation of personal computer graphics processor units (GPU). Our GPU implementation detected a surgical instrument in 31 ms, sufficient for real-time tracking at the 25 volumes per second rate of the ultrasound machine. A water tank experiment found instrument orientation errors of 1.1 degrees and tip position errors of less than 1.8mm. Finally, an in vivo study demonstrated successful instrument tracking inside a beating porcine heart.
Rapid block matching based nonlinear registration on GPU for image guided radiation therapy
Wang, An; Disher, Brandon; Carnes, Greg; Peters, Terry M.
2010-02-01
To compensate for non-uniform deformation due to patient motion within and between fractions in image guided radiation therapy, a block matching technique was adapted and implemented on a standard graphics processing unit (GPU) to determine the displacement vector field that maps the nonlinear transformation between successive CT images. Normalized cross correlation (NCC) was chosen as the similarity metric for the matching step, with regularization of the displacement vector field being performed by Gaussian smoothing. A multi-resolution framework was adopted to further improve the performance of the algorithm. The nonlinear registration algorithm was first applied to estimate the intrafractional motion from 4D lung CT images. It was also used to calculate the inter-fractional organ deformation between planning CT (PCT) and Daily Cone Beam CT (CBCT) images of thorax. For both experiments, manual landmark-based evaluation was performed to quantify the registration performance. In 4D CT registration, the mean TRE of 5 cases was 1.75 mm. In PCT-CBCT registration, the TRE of one case was 2.26mm. Compared to the CPU-based AtamaiWarp program, our GPU-based implementation achieves comparable registration accuracy and is ~25 times faster. The results highlight the potential utility of our algorithm for online adaptive radiation treatment.
GPU-Based Block-Wise Nonlocal Means Denoising for 3D Ultrasound Images
Directory of Open Access Journals (Sweden)
Liu Li
2013-01-01
Full Text Available Speckle suppression plays an important role in improving ultrasound (US image quality. While lots of algorithms have been proposed for 2D US image denoising with remarkable filtering quality, there is relatively less work done on 3D ultrasound speckle suppression, where the whole volume data rather than just one frame needs to be considered. Then, the most crucial problem with 3D US denoising is that the computational complexity increases tremendously. The nonlocal means (NLM provides an effective method for speckle suppression in US images. In this paper, a programmable graphic-processor-unit- (GPU- based fast NLM filter is proposed for 3D ultrasound speckle reduction. A Gamma distribution noise model, which is able to reliably capture image statistics for Log-compressed ultrasound images, was used for the 3D block-wise NLM filter on basis of Bayesian framework. The most significant aspect of our method was the adopting of powerful data-parallel computing capability of GPU to improve the overall efficiency. Experimental results demonstrate that the proposed method can enormously accelerate the algorithm.
Edge-preserving image denoising via group coordinate descent on the GPU.
McGaffin, Madison Gray; Fessler, Jeffrey A
2015-04-01
Image denoising is a fundamental operation in image processing, and its applications range from the direct (photographic enhancement) to the technical (as a subproblem in image reconstruction algorithms). In many applications, the number of pixels has continued to grow, while the serial execution speed of computational hardware has begun to stall. New image processing algorithms must exploit the power offered by massively parallel architectures like graphics processing units (GPUs). This paper describes a family of image denoising algorithms well-suited to the GPU. The algorithms iteratively perform a set of independent, parallel 1D pixel-update subproblems. To match GPU memory limitations, they perform these pixel updates in-place and only store the noisy data, denoised image, and problem parameters. The algorithms can handle a wide range of edge-preserving roughness penalties, including differentiable convex penalties and anisotropic total variation. Both algorithms use the majorize-minimize framework to solve the 1D pixel update subproblem. Results from a large 2D image denoising problem and a 3D medical imaging denoising problem demonstrate that the proposed algorithms converge rapidly in terms of both iteration and run-time.
GPU acceleration of Eulerian-Lagrangian particle-laden turbulent flow simulations
Richter, David; Sweet, James; Thain, Douglas
2017-11-01
The Lagrangian point-particle approximation is a popular numerical technique for representing dispersed phases whose properties can substantially deviate from the local fluid. In many cases, particularly in the limit of one-way coupled systems, large numbers of particles are desired; this may be either because many physical particles are present (e.g. LES of an entire cloud), or because the use of many particles increases statistical convergence (e.g. high-order statistics). Solving the trajectories of very large numbers of particles can be problematic in traditional MPI implementations, however, and this study reports the benefits of using graphical processing units (GPUs) to integrate the particle equations of motion while preserving the original MPI version of the Eulerian flow solver. It is found that GPU acceleration becomes cost effective around one million particles, and performance enhancements of up to 15x can be achieved when O(108) particles are computed on the GPU rather than the CPU cluster. Optimizations and limitations will be discussed, as will prospects for expanding to two- and four-way coupled systems. ONR Grant No. N00014-16-1-2472.
GPU-Accelerated Foreground Segmentation and Labeling for Real-Time Video Surveillance
Directory of Open Access Journals (Sweden)
Wei Song
2016-09-01
Full Text Available Real-time and accurate background modeling is an important researching topic in the fields of remote monitoring and video surveillance. Meanwhile, effective foreground detection is a preliminary requirement and decision-making basis for sustainable energy management, especially in smart meters. The environment monitoring results provide a decision-making basis for energy-saving strategies. For real-time moving object detection in video, this paper applies a parallel computing technology to develop a feedback foreground–background segmentation method and a parallel connected component labeling (PCCL algorithm. In the background modeling method, pixel-wise color histograms in graphics processing unit (GPU memory is generated from sequential images. If a pixel color in the current image does not locate around the peaks of its histogram, it is segmented as a foreground pixel. From the foreground segmentation results, a PCCL algorithm is proposed to cluster the foreground pixels into several groups in order to distinguish separate blobs. Because the noisy spot and sparkle in the foreground segmentation results always contain a small quantity of pixels, the small blobs are removed as noise in order to refine the segmentation results. The proposed GPU-based image processing algorithms are implemented using the compute unified device architecture (CUDA toolkit. The testing results show a significant enhancement in both speed and accuracy.
Large-Scale Multi-Dimensional Document Clustering on GPU Clusters
Energy Technology Data Exchange (ETDEWEB)
Cui, Xiaohui [ORNL; Mueller, Frank [North Carolina State University; Zhang, Yongpeng [ORNL; Potok, Thomas E [ORNL
2010-01-01
Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteennode GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrates the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
Papaya Tree Detection with UAV Images Using a GPU-Accelerated Scale-Space Filtering Method
Directory of Open Access Journals (Sweden)
Hao Jiang
2017-07-01
Full Text Available The use of unmanned aerial vehicles (UAV can allow individual tree detection for forest inventories in a cost-effective way. The scale-space filtering (SSF algorithm is commonly used and has the capability of detecting trees of different crown sizes. In this study, we made two improvements with regard to the existing method and implementations. First, we incorporated SSF with a Lab color transformation to reduce over-detection problems associated with the original luminance image. Second, we ported four of the most time-consuming processes to the graphics processing unit (GPU to improve computational efficiency. The proposed method was implemented using PyCUDA, which enabled access to NVIDIA’s compute unified device architecture (CUDA through high-level scripting of the Python language. Our experiments were conducted using two images captured by the DJI Phantom 3 Professional and a most recent NVIDIA GPU GTX1080. The resulting accuracy was high, with an F-measure larger than 0.94. The speedup achieved by our parallel implementation was 44.77 and 28.54 for the first and second test image, respectively. For each 4000 × 3000 image, the total runtime was less than 1 s, which was sufficient for real-time performance and interactive application.
Reconstruction of the neutron spectrum using an artificial neural network in CPU and GPU
International Nuclear Information System (INIS)
Hernandez D, V. M.; Moreno M, A.; Ortiz L, M. A.; Vega C, H. R.; Alonso M, O. E.
2016-10-01
The increase in computing power in personal computers has been increasing, computers now have several processors in the CPU and in addition multiple CUDA cores in the graphics processing unit (GPU); both systems can be used individually or combined to perform scientific computation without resorting to processor or supercomputing arrangements. The Bonner sphere spectrometer is the most commonly used multi-element system for neutron detection purposes and its associated spectrum. Each sphere-detector combination gives a particular response that depends on the energy of the neutrons, and the total set of these responses is known like the responses matrix Rφ(E). Thus, the counting rates obtained with each sphere and the neutron spectrum is related to the Fredholm equation in its discrete version. For the reconstruction of the spectrum has a system of poorly conditioned equations with an infinite number of solutions and to find the appropriate solution, it has been proposed the use of artificial intelligence through neural networks with different platforms CPU and GPU. (Author)
A Generic High-performance GPU-based Library for PDE solvers
DEFF Research Database (Denmark)
Glimberg, Stefan Lemvig; Engsig-Karup, Allan Peter
Recent development of massively parallel processors such as graphical processing units (GPUs), has already proven to be very effective for a vast amount of scientic applications. One major benefit of the GPU, is that it is already a standard device in most affordable desktop computers. Thus......, the privilege of high-performance parallel computing is now in principle accessible for many scientific users, no matter their economic resources. Though being highly effective units, GPUs and parallel architectures in general, pose challenges for software developers to utilize their efficiency. Sequential...... legacy codes are not always easily parallelized and the time spent on conversion might not pay o in the end. We present a highly generic C++ library for fast assembling of partial differential equation (PDE) solvers, aiming at utilizing the computational resources of GPUs. The library requires a minimum...
High-Performance Matrix-Vector Multiplication on the GPU
DEFF Research Database (Denmark)
Sørensen, Hans Henrik Brandenborg
2012-01-01
In this paper, we develop a high-performance GPU kernel for one of the most popular dense linear algebra operations, the matrix-vector multiplication. The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for scientific computing...
GPU accelerated CT reconstruction for clinical use: quality driven performance
Vaz, Michael S.; Sneyders, Yuri; McLin, Matthew; Ricker, Alan; Kimpe, Tom
2007-03-01
We present performance and quality analysis of GPU accelerated FDK filtered backprojection for cone beam computed tomography (CBCT) reconstruction. Our implementation of the FDK CT reconstruction algorithm does not compromise fidelity at any stage and yields a result that is within 1 HU of a reference C++ implementation. Our streaming implementation is able to perform reconstruction as the images are acquired; it addresses low latency as well as fast throughput, which are key considerations for a "real-time" design. Further, it is scaleable to multiple GPUs for increased performance. The implementation does not place any constraints on image acquisition; it works effectively for arbitrary angular coverage with arbitrary angular spacing. As such, this GPU accelerated CT reconstruction solution may easily be used with scanners that are already deployed. We are able to reconstruct a 512 x 512 x 340 volume from 625 projections, each sized 1024 x 768, in less than 50 seconds. The quoted 50 second timing encompasses the entire reconstruction using bilinear interpolation and includes filtering on the CPU, uploading the filtered projections to the GPU, and also downloading the reconstructed volume from GPU memory to system RAM.
GPU-accelerated denoising of 3D magnetic resonance images
Energy Technology Data Exchange (ETDEWEB)
Howison, Mark; Wes Bethel, E.
2014-05-29
The raw computational power of GPU accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. In practice, applying these filtering operations requires setting multiple parameters. This study was designed to provide better guidance to practitioners for choosing the most appropriate parameters by answering two questions: what parameters yield the best denoising results in practice? And what tuning is necessary to achieve optimal performance on a modern GPU? To answer the first question, we use two different metrics, mean squared error (MSE) and mean structural similarity (MSSIM), to compare denoising quality against a reference image. Surprisingly, the best improvement in structural similarity with the bilateral filter is achieved with a small stencil size that lies within the range of real-time execution on an NVIDIA Tesla M2050 GPU. Moreover, inappropriate choices for parameters, especially scaling parameters, can yield very poor denoising performance. To answer the second question, we perform an autotuning study to empirically determine optimal memory tiling on the GPU. The variation in these results suggests that such tuning is an essential step in achieving real-time performance. These results have important implications for the real-time application of denoising to MR images in clinical settings that require fast turn-around times.
Multi-GPU maximum entropy image synthesis for radio astronomy
Cárcamo, M.; Román, P. E.; Casassus, S.; Moral, V.; Rannou, F. R.
2018-01-01
The maximum entropy method (MEM) is a well known deconvolution technique in radio-interferometry. This method solves a non-linear optimization problem with an entropy regularization term. Other heuristics such as CLEAN are faster but highly user dependent. Nevertheless, MEM has the following advantages: it is unsupervised, it has a statistical basis, it has a better resolution and better image quality under certain conditions. This work presents a high performance GPU version of non-gridding MEM, which is tested using real and simulated data. We propose a single-GPU and a multi-GPU implementation for single and multi-spectral data, respectively. We also make use of the Peer-to-Peer and Unified Virtual Addressing features of newer GPUs which allows to exploit transparently and efficiently multiple GPUs. Several ALMA data sets are used to demonstrate the effectiveness in imaging and to evaluate GPU performance. The results show that a speedup from 1000 to 5000 times faster than a sequential version can be achieved, depending on data and image size. This allows to reconstruct the HD142527 CO(6-5) short baseline data set in 2.1 min, instead of 2.5 days that takes a sequential version on CPU.
GPU Acceleration of Graph Matching, Clustering, and Partitioning
Fagginger Auer, B.O.
2013-01-01
We consider sequential algorithms for hypergraph partitioning and GPU (i.e., fine-grained shared-memory parallel) algorithms for graph partitioning and clustering. Our investigation into sequential hypergraph partitioning is concerned with the efficient construction of high-quality matchings for
STEM image simulation with hybrid CPU/GPU programming
International Nuclear Information System (INIS)
Yao, Y.; Ge, B.H.; Shen, X.; Wang, Y.G.; Yu, R.C.
2016-01-01
STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. - Highlights: • STEM image simulation is achieved by hybrid CPU/GPU programming under parallel algorithm architecture to speed up the calculation in the personal computer (PC). • In order to fully utilize the calculation power of the PC, the simulation is performed by GPU core and multi-CPU cores at the same time so efficiency is improved significantly. • GaSb and artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. The results reveal some unintuitive phenomena about the contrast variation with the atom numbers.
The GPU implementation of micro - Doppler period estimation
Yang, Liyuan; Wang, Junling; Bi, Ran
2018-03-01
Aiming at the problem that the computational complexity and the deficiency of real-time of the wideband radar echo signal, a program is designed to improve the performance of real-time extraction of micro-motion feature in this paper based on the CPU-GPU heterogeneous parallel structure. Firstly, we discuss the principle of the micro-Doppler effect generated by the rolling of the scattering points on the orbiting satellite, analyses how to use Kalman filter to compensate the translational motion of tumbling satellite and how to use the joint time-frequency analysis and inverse Radon transform to extract the micro-motion features from the echo after compensation. Secondly, the advantages of GPU in terms of real-time processing and the working principle of CPU-GPU heterogeneous parallelism are analysed, and a program flow based on GPU to extract the micro-motion feature from the radar echo signal of rolling satellite is designed. At the end of the article the results of extraction are given to verify the correctness of the program and algorithm.
MPI and GPU parallelization of novel SD algorithms
Indian Academy of Sciences (India)
the kernel code are brought in the memory of the GPU and also the relevant data, in the memory of each multi- ple processors. This way, the threads have direct access to the data relevant to them. Once all the threads are fi- nished and the kernel code has been executed, the results are sent to the processor and RAM, and ...
Kalantzis, Georgios; Tachibana, Hidenobu
2014-01-01
For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Chen, Hao; Wei, Anqi; Zhang, Ye
2017-10-01
To accelerate the massive remote sensing images (RSIs) coding in a ground service-oriented remote sensing system, this study proposes three-level (i.e., tree-level, bit-plane level, and byte-level) parallel-set partitioning in hierarchical trees (TP-SPIHT) coding on a collaborative central and graphic processing unit (CPU and GPU) to parallelize SPIHT by optimizing its dynamic processing with the linked list. Basic parallel SPIHT coding is presented with preprocessing, tree-level parallel coding, and bit-stream organization using three kinds of static marker matrices instead of the dynamic linked lists originally used to remove the data dependency of the original SPIHT. The bit-stream organization is implemented on CPU and other processes are implemented on GPU using GPU streams. The bit-stream organization is further divided into a bit-plane level parallel bit-plane stream extraction and a final bit-stream organization on a multicore CPU. Because no dependencies exist between the different byte operations in the final bit-stream organization, this organization is accelerated by byte-level parallelization on the GPU. Experimental results with different sized RSIs show that TP-SPIHT takes 292.03 ms to code a 2048×2048 image and achieves a 6.27 times speedup compared with an optimized CPU implementation. The speedup ratio improves as the image increases from 256×256 to 2048×2048.
Avaliação de desempenho e consumo energético para configurações de Wavefront pools de uma GPU AMD
Directory of Open Access Journals (Sweden)
Ariel Gustavo Zuquello
2016-07-01
Full Text Available O uso de sistemas heterogêneos CPU-GPU para atender à crescente demanda por aplicações com grande paralelismo de dados resulta na necessidade de estudar e avaliar tais arquiteturas para melhorá-las continuamente. Neste artigo foram feitas simulações da execução de uma suíte de benchmark em uma GPU AMD ATI RadeonTM HD 7970, de modo a avaliar o impacto sobre o desempenho e o consumo energético quando alterado o número de Wavefront Pools presentes em cada compute unit da GPU, que é 4 por padrão. O resultado mais significante evidencia um aumento de velocidade de cerca de 5,7% para a configuração com duas Wavefront Pools em conjunto com um aumento no consumo de energia de cerca de 5,1%. Todavia, as outras configurações avaliadas também representam opções para diferentes tipos de necessidades, conforme a categoria de demanda computacional.Palavras-chave: Sistemas heterogêneos. Simulações. Desempenho.Performance evaluation and energy consumption for settings of Wavefront pools of a GPU AMDAbstractThe use of CPU-GPU heterogeneous systems to meet the growing demand for applications with large data parallelism results in the need to study and evaluate these architectures in order to improve them continuously. In this paper we made simulations of running a benchmark suite on an AMD GPU ATI RadeonTM HD 7970 in order to assess the impact on performance and power consumption when tuning the number of Wavefront Pools present in each GPU compute unit, which is 4 by default. The most significant result shows a speedup of about 5.7% for configuration with two Wavefront Pools in conjunction with an increase of about 5.1% in the energy consumption. However, the other evaluated configuration also represent options for different kinds of needs, according to the computational demand.Keyworks: Heterogeneous systems. Simulation. Performance.
Fast Simulation of Dynamic Ultrasound Images Using the GPU.
Storve, Sigurd; Torp, Hans
2017-10-01
Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.
Belleman, R.G.; Bédorf, J.; Portegies Zwart, S.F.
2008-01-01
We present the results of gravitational direct N-body simulations using the graphics processing unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the N-body problem is implemented in "Compute Unified Device Architecture" (CUDA) using the GPU to
Blaze-DEMGPU: Modular high performance DEM framework for the GPU architecture
Govender, Nicolin; Wilke, Daniel N.; Kok, Schalk
Blaze-DEMGPU is a modular GPU based discrete element method (DEM) framework that supports polyhedral shaped particles. The high level performance is attributed to the light weight and Single Instruction Multiple Data (SIMD) that the GPU architecture offers. Blaze-DEMGPU offers suitable algorithms to conduct DEM simulations on the GPU and these algorithms can be extended and modified. Since a large number of scientific simulations are particle based, many of the algorithms and strategies for GPU implementation present in Blaze-DEMGPU can be applied to other fields. Blaze-DEMGPU will make it easier for new researchers to use high performance GPU computing as well as stimulate wider GPU research efforts by the DEM community.
Blaze-DEMGPU: Modular high performance DEM framework for the GPU architecture
Directory of Open Access Journals (Sweden)
Nicolin Govender
2016-01-01
Full Text Available Blaze-DEMGPU is a modular GPU based discrete element method (DEM framework that supports polyhedral shaped particles. The high level performance is attributed to the light weight and Single Instruction Multiple Data (SIMD that the GPU architecture offers. Blaze-DEMGPU offers suitable algorithms to conduct DEM simulations on the GPU and these algorithms can be extended and modified. Since a large number of scientific simulations are particle based, many of the algorithms and strategies for GPU implementation present in Blaze-DEMGPU can be applied to other fields. Blaze-DEMGPU will make it easier for new researchers to use high performance GPU computing as well as stimulate wider GPU research efforts by the DEM community.
Nalichowski, Adrian; Burmeister, Jay
2013-07-01
To compare optimization characteristics, plan quality, and treatment delivery efficiency between total marrow irradiation (TMI) plans using the new TomoTherapy graphic processing unit (GPU) based dose engine and CPU/cluster based dose engine. Five TMI plans created on an anthropomorphic phantom were optimized and calculated with both dose engines. The planning treatment volume (PTV) included all the bones from head to mid femur except for upper extremities. Evaluated organs at risk (OAR) consisted of lung, liver, heart, kidneys, and brain. The following treatment parameters were used to generate the TMI plans: field widths of 2.5 and 5 cm, modulation factors of 2 and 2.5, and pitch of either 0.287 or 0.43. The optimization parameters were chosen based on the PTV and OAR priorities and the plans were optimized with a fixed number of iterations. The PTV constraint was selected to ensure that at least 95% of the PTV received the prescription dose. The plans were evaluated based on D80 and D50 (dose to 80% and 50% of the OAR volume, respectively) and hotspot volumes within the PTVs. Gamma indices (Γ) were also used to compare planar dose distributions between the two modalities. The optimization and dose calculation times were compared between the two systems. The treatment delivery times were also evaluated. The results showed very good dosimetric agreement between the GPU and CPU calculated plans for any of the evaluated planning parameters indicating that both systems converge on nearly identical plans. All D80 and D50 parameters varied by less than 3% of the prescription dose with an average difference of 0.8%. A gamma analysis Γ(3%, 3 mm) plan resulted in over 90% of calculated voxels satisfying Γ plan. The average number of voxels meeting the Γ plans was 97%. In terms of dose optimization/calculation efficiency, there was a 20-fold reduction in planning time with the new GPU system. The average optimization/dose calculation time utilizing the traditional CPU
Wan Chan Tseung, H; Ma, J; Beltran, C
2015-06-01
Very fast Monte Carlo (MC) simulations of proton transport have been implemented recently on graphics processing units (GPUs). However, these MCs usually use simplified models for nonelastic proton-nucleus interactions. Our primary goal is to build a GPU-based proton transport MC with detailed modeling of elastic and nonelastic proton-nucleus collisions. Using the cuda framework, the authors implemented GPU kernels for the following tasks: (1) simulation of beam spots from our possible scanning nozzle configurations, (2) proton propagation through CT geometry, taking into account nuclear elastic scattering, multiple scattering, and energy loss straggling, (3) modeling of the intranuclear cascade stage of nonelastic interactions when they occur, (4) simulation of nuclear evaporation, and (5) statistical error estimates on the dose. To validate our MC, the authors performed (1) secondary particle yield calculations in proton collisions with therapeutically relevant nuclei, (2) dose calculations in homogeneous phantoms, (3) recalculations of complex head and neck treatment plans from a commercially available treatment planning system, and compared with (GEANT)4.9.6p2/TOPAS. Yields, energy, and angular distributions of secondaries from nonelastic collisions on various nuclei are in good agreement with the (GEANT)4.9.6p2 Bertini and Binary cascade models. The 3D-gamma pass rate at 2%-2 mm for treatment plan simulations is typically 98%. The net computational time on a NVIDIA GTX680 card, including all CPU-GPU data transfers, is ∼ 20 s for 1 × 10(7) proton histories. Our GPU-based MC is the first of its kind to include a detailed nuclear model to handle nonelastic interactions of protons with any nucleus. Dosimetric calculations are in very good agreement with (GEANT)4.9.6p2/TOPAS. Our MC is being integrated into a framework to perform fast routine clinical QA of pencil-beam based treatment plans, and is being used as the dose calculation engine in a clinically
Assessment of Efficiency and Performance in Tsunami Numerical Modeling with GPU
Yalciner, Bora; Zaytsev, Andrey
2017-04-01
Non-linear shallow water equations (NSWE) are used to solve the propagation and coastal amplification of long waves and tsunamis. Leap Frog scheme of finite difference technique is one of the satisfactory numerical methods which is widely used in these problems. Tsunami numerical models are necessary for not only academic but also operational purposes which need faster and accurate solutions. Recent developments in information technology provide considerably faster numerical solutions in this respect and are becoming one of the crucial requirements. Tsunami numerical code NAMI DANCE uses finite difference numerical method to solve linear and non-linear forms of shallow water equations for long wave problems, specifically for tsunamis. In this study, the new code is structured for Graphical Processing Unit (GPU) using CUDA API. The new code is applied to different (analytical, experimental and field) benchmark problems of tsunamis for tests. One of those applications is 2011 Great East Japan tsunami which was instrumentally recorded on various types of gauges including tide and wave gauges and offshore GPS buoys cabled Ocean Bottom Pressure (OBP) gauges and DART buoys. The accuracy of the results are compared with the measurements and fairly well agreements are obtained. The efficiency and performance of the code is also compared with the version using multi-core Central Processing Unit (CPU). Dependence of simulation speed with GPU on linear or non-linear solutions is also investigated. One of the results is that the simulation speed is increased up to 75 times comparing to the process time in the computer using single 4/8 thread multi-core CPU. The results are presented with comparisons and discussions. Furthermore how multi-dimensional finite difference problems fits towards GPU architecture is also discussed. The research leading to this study has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement No
Perceptual Performance Impact of GPU-Based WARP and Anti-Aliasing for Image Generators
2016-06-29
Journal Article 3. DATES COVERED (From – To) Feb 2016 – May 2016 4. TITLE AND SUBTITLE PERCEPTUAL PERFORMANCE IMPACT OF GPU -BASED WARP & ANTI...generator, perception, geometric distortion, geometric warp, NVIDIA warp API, GPU warp, vision assessment. 16. SECURITY CLASSIFICATION OF: 17. LIMITATION...OF GPU -BASED WARP & ANTI-ALIASING FOR IMAGE GENERATORS Logan Williams1, Charles Bullock1, Doug Traill2, Kevin Amaratunga3, James Gaska1, Marc
GPU-accelerated Block Matching Algorithm for Deformable Registration of Lung CT Images.
Li, Min; Xiang, Zhikang; Xiao, Liang; Castillo, Edward; Castillo, Richard; Guerrero, Thomas
2015-12-01
Deformable registration (DR) is a key technology in the medical field. However, many of the existing DR methods are time-consuming and the registration accuracy needs to be improved, which prevents their clinical applications. In this study, we propose a parallel block matching algorithm for lung CT image registration, in which the sum of squared difference metric is modified as the cost function and the moving least squares approach is used to generate the full displacement field. The algorithm is implemented on Graphic Processing Unit (GPU) with the Compute Unified Device Architecture (CUDA). Results show that the proposed parallel block matching method achieves a fast runtime while maintaining an average registration error (standard deviation) of 1.08 (0.69) mm.
Peker, Musa; Şen, Baha; Gürüler, Hüseyin
2015-02-01
The effect of anesthesia on the patient is referred to as depth of anesthesia. Rapid classification of appropriate depth level of anesthesia is a matter of great importance in surgical operations. Similarly, accelerating classification algorithms is important for the rapid solution of problems in the field of biomedical signal processing. However numerous, time-consuming mathematical operations are required when training and testing stages of the classification algorithms, especially in neural networks. In this study, to accelerate the process, parallel programming and computing platform (Nvidia CUDA) facilitates dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU) was utilized. The system was employed to detect anesthetic depth level on related electroencephalogram (EEG) data set. This dataset is rather complex and large. Moreover, the achieving more anesthetic levels with rapid response is critical in anesthesia. The proposed parallelization method yielded high accurate classification results in a faster time.
Aerodynamic optimization of supersonic compressor cascade using differential evolution on GPU
Aissa, Mohamed Hasanine; Verstraete, Tom; Vuik, Cornelis
2016-06-01
Differential Evolution (DE) is a powerful stochastic optimization method. Compared to gradient-based algorithms, DE is able to avoid local minima but requires at the same time more function evaluations. In turbomachinery applications, function evaluations are performed with time-consuming CFD simulation, which results in a long, non affordable, design cycle. Modern High Performance Computing systems, especially Graphic Processing Units (GPUs), are able to alleviate this inconvenience by accelerating the design evaluation itself. In this work we present a validated CFD Solver running on GPUs, able to accelerate the design evaluation and thus the entire design process. An achieved speedup of 20x to 30x enabled the DE algorithm to run on a high-end computer instead of a costly large cluster. The GPU-enhanced DE was used to optimize the aerodynamics of a supersonic compressor cascade, achieving an aerodynamic loss minimization of 20%.
Gpufit: An open-source toolkit for GPU-accelerated curve fitting.
Przybylski, Adrian; Thiel, Björn; Keller-Findeisen, Jan; Stock, Bernd; Bates, Mark
2017-11-16
We present a general purpose, open-source software library for estimation of non-linear parameters by the Levenberg-Marquardt algorithm. The software, Gpufit, runs on a Graphics Processing Unit (GPU) and executes computations in parallel, resulting in a significant gain in performance. We measured a speed increase of up to 42 times when comparing Gpufit with an identical CPU-based algorithm, with no loss of precision or accuracy. Gpufit is designed such that it is easily incorporated into existing applications or adapted for new ones. Multiple software interfaces, including to C, Python, and Matlab, ensure that Gpufit is accessible from most programming environments. The full source code is published as an open source software repository, making its function transparent to the user and facilitating future improvements and extensions. As a demonstration, we used Gpufit to accelerate an existing scientific image analysis package, yielding significantly improved processing times for super-resolution fluorescence microscopy datasets.
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
Energy Technology Data Exchange (ETDEWEB)
Ronald Babich, Michael Clark, Balint Joo
2010-11-01
Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
International Nuclear Information System (INIS)
Babich, Ronald; Clark, Michael; Joo, Balint
2010-01-01
Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the '9g' cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.
A massively parallel GPU-accelerated model for analysis of fully nonlinear free surface waves
DEFF Research Database (Denmark)
Engsig-Karup, Allan Peter; Madsen, Morten G.; Glimberg, Stefan Lemvig
2011-01-01
We implement and evaluate a massively parallel and scalable algorithm based on a multigrid preconditioned Defect Correction method for the simulation of fully nonlinear free surface flows. The simulations are based on a potential model that describes wave propagation over uneven bottoms in three...... space dimensions and is useful for fast analysis and prediction purposes in coastal and offshore engineering. A dedicated numerical model based on the proposed algorithm is executed in parallel by utilizing affordable modern special purpose graphics processing unit (GPU). The model is based on a low......-storage flexible-order accurate finite difference method that is known to be efficient and scalable on a CPU core (single thread). To achieve parallel performance of the relatively complex numerical model, we investigate a new trend in high-performance computing where many-core GPUs are utilized as high...
GPU-accelerated few-view CT reconstruction using the OSC and TV techniques
Energy Technology Data Exchange (ETDEWEB)
Matenine, Dmitri [Montreal Univ., QC (Canada). Dept. de Physique; Hissoiny, Sami [Ecole Polytechnique de Montreal, QC (Canada). Dept. de Genie Informatique et Genie Logiciel; Despres, Philippe [Centre Hospitalier Univ. de Quebec, QC (Canada). Dept. de Radio-Oncologie
2011-07-01
The present work proposes a promising iterative reconstruction technique designed specifically for X-ray transmission computed tomography (CT). The main objective is to reduce diagnostic radiation dose through the reduction of the number of CT projections, while preserving image quality. The second objective is to provide a fast implementation compatible with clinical activities. The proposed tomographic reconstruction technique is a combination of the Ordered Subsets Convex (OSC) algorithm and the Total Variation minimization (TV) regularization technique. The results in terms of image quality and computational speed are discussed. Using this technique, it was possible to obtain reconstructed slices of relatively good quality with as few as 100 projections, leading to potential dose reduction factors of up to an order of magnitude depending on the application. The algorithm was implemented on a Graphical Processing Unit (GPU) and yielded reconstruction times of approximately 185 ms per slice. (orig.)
GPU v. B and W lawsuit review and its effect on TMI-1 (Docket 50-289)
International Nuclear Information System (INIS)
1983-09-01
This report documents a review by the Nuclear Regulatory Commission (NRC) staff of the General Public Utilities Corporation, et al. v. the Babcock and Wilcox Company, et al. (GPU v. B and W) lawsuit record to assess whether any of the staff's previous conclusions or their principal bases presented at the Three Mile Island Unit 1 (TMI-1) restart hearing, supporting restart of TMI-1, should be amended in light of the information contained in the lawsuit record. Details of the lawsuit record are provided in the appendices contained in Volume II of this report
GPU v. B and W lawsuit review and its effect on TMI-1 (Docket 50-289)
Energy Technology Data Exchange (ETDEWEB)
1983-09-01
This report documents a review by the Nuclear Regulatory Commission (NRC) staff of the General Public Utilities Corporation, et al. v. the Babcock and Wilcox Company, et al. (GPU v. B and W) lawsuit record to assess whether any of the staff's previous conclusions or their principal bases presented at the Three Mile Island Unit 1 (TMI-1) restart hearing, supporting restart of TMI-1, should be amended in light of the information contained in the lawsuit record. Details of the lawsuit record are provided in the appendices contained in Volume II of this report.
Pelletier, Mathew G
2008-02-08
One of the main hurdles standing in the way of optimal cleaning of cotton lint isthe lack of sensing systems that can react fast enough to provide the control system withreal-time information as to the level of trash contamination of the cotton lint. This researchexamines the use of programmable graphic processing units (GPU) as an alternative to thePC's traditional use of the central processing unit (CPU). The use of the GPU, as analternative computation platform, allowed for the machine vision system to gain asignificant improvement in processing time. By improving the processing time, thisresearch seeks to address the lack of availability of rapid trash sensing systems and thusalleviate a situation in which the current systems view the cotton lint either well before, orafter, the cotton is cleaned. This extended lag/lead time that is currently imposed on thecotton trash cleaning control systems, is what is responsible for system operators utilizing avery large dead-band safety buffer in order to ensure that the cotton lint is not undercleaned.Unfortunately, the utilization of a large dead-band buffer results in the majority ofthe cotton lint being over-cleaned which in turn causes lint fiber-damage as well assignificant losses of the valuable lint due to the excessive use of cleaning machinery. Thisresearch estimates that upwards of a 30% reduction in lint loss could be gained through theuse of a tightly coupled trash sensor to the cleaning machinery control systems. Thisresearch seeks to improve processing times through the development of a new algorithm forcotton trash sensing that allows for implementation on a highly parallel architecture.Additionally, by moving the new parallel algorithm onto an alternative computing platform,the graphic processing unit "GPU", for processing of the cotton trash images, a speed up ofover 6.5 times, over optimized code running on the PC's central processing unit "CPU", wasgained. The new parallel algorithm operating on the
DEFF Research Database (Denmark)
Mosegaard, Jesper; Sørensen, Thomas Sangild
2005-01-01
Modern graphics processing units (GPUs) can be effectively used to solve physical systems. To use the GPUoptimally, the discretization of the physical system is often restricted to a regular grid. When grid values representspatial positions, a direct visualization can result in a jagged appearance....... In this paper we propose todecouple computation and visualization of such systems. We define mappings that enable the deformation of ahigh-resolution surface based on a physical simulation on a lower resolution uniform grid. More specifically weinvestigate new approaches for the visualization of a GPU based...
Resolution of the Vlasov-Maxwell system by PIC discontinuous Galerkin method on GPU with OpenCL
Directory of Open Access Journals (Sweden)
Crestetto Anaïs
2013-01-01
Full Text Available We present an implementation of a Vlasov-Maxwell solver for multicore processors. The Vlasov equation describes the evolution of charged particles in an electromagnetic field, solution of the Maxwell equations. The Vlasov equation is solved by a Particle-In-Cell method (PIC, while the Maxwell system is computed by a Discontinuous Galerkin method. We use the OpenCL framework, which allows our code to run on multicore processors or recent Graphic Processing Units (GPU. We present several numerical applications to two-dimensional test cases.
Energy Technology Data Exchange (ETDEWEB)
Adom, Felix K. [Argonne National Lab. (ANL), Argonne, IL (United States). Energy Systems Division; Dunn, Jennifer B. [Argonne National Lab. (ANL), Argonne, IL (United States). Energy Systems Division; Han, Jeongwoo [Argonne National Lab. (ANL), Argonne, IL (United States). Energy Systems Division
2014-09-01
A wide range of biofuels and biochemicals can be produced from cellulosic biomass via different pretreatment technologies that yield sugars. Process simulations of dilute acid and ammonia fiber expansion pretreatment processes and subsequent hydrolysis were developed in Aspen Plus for four lignocellulosic feedstocks (corn stover, miscanthus, switchgrass, and poplar). This processing yields sugars that can be subsequently converted to biofuels or biochemical. Material and energy consumption data from Aspen Plus were then compiled in a new Greenhouses Gases, Regulated Emissions, and Energy Use in Transportation (GREET^{TM}) pretreatment module. The module estimates the cradle-to-gate fossil energy consumption (FEC) and greenhouse gas (GHG) emissions associated with producing fermentable sugars. This report documents the data and methodology used to develop this module and the cradle-to-gate FEC and GHG emissions that result from producing fermentable sugars.
Multi-GPU hybrid programming accelerated three-dimensional phase-field model in binary alloy
Directory of Open Access Journals (Sweden)
Changsheng Zhu
2018-03-01
Full Text Available In the process of dendritic growth simulation, the computational efficiency and the problem scales have extremely important influence on simulation efficiency of three-dimensional phase-field model. Thus, seeking for high performance calculation method to improve the computational efficiency and to expand the problem scales has a great significance to the research of microstructure of the material. A high performance calculation method based on MPI+CUDA hybrid programming model is introduced. Multi-GPU is used to implement quantitative numerical simulations of three-dimensional phase-field model in binary alloy under the condition of multi-physical processes coupling. The acceleration effect of different GPU nodes on different calculation scales is explored. On the foundation of multi-GPU calculation model that has been introduced, two optimization schemes, Non-blocking communication optimization and overlap of MPI and GPU computing optimization, are proposed. The results of two optimization schemes and basic multi-GPU model are compared. The calculation results show that the use of multi-GPU calculation model can improve the computational efficiency of three-dimensional phase-field obviously, which is 13 times to single GPU, and the problem scales have been expanded to 8193. The feasibility of two optimization schemes is shown, and the overlap of MPI and GPU computing optimization has better performance, which is 1.7 times to basic multi-GPU model, when 21 GPUs are used.
Nagaoka, Tomoaki; Watanabe, Soichi
2012-01-01
Electromagnetic simulation with anatomically realistic computational human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the computational human model, we adapt three-dimensional FDTD code to a multi-GPU cluster environment with Compute Unified Device Architecture and Message Passing Interface. Our multi-GPU cluster system consists of three nodes. The seven GPU boards (NVIDIA Tesla C2070) are mounted on each node. We examined the performance of the FDTD calculation on multi-GPU cluster environment. We confirmed that the FDTD calculation on the multi-GPU clusters is faster than that on a multi-GPU (a single workstation), and we also found that the GPU cluster system calculate faster than a vector supercomputer. In addition, our GPU cluster system allowed us to perform the large-scale FDTD calculation because were able to use GPU memory of over 100 GB.
Multi-GPU hybrid programming accelerated three-dimensional phase-field model in binary alloy
Zhu, Changsheng; Liu, Jieqiong; Zhu, Mingfang; Feng, Li
2018-03-01
In the process of dendritic growth simulation, the computational efficiency and the problem scales have extremely important influence on simulation efficiency of three-dimensional phase-field model. Thus, seeking for high performance calculation method to improve the computational efficiency and to expand the problem scales has a great significance to the research of microstructure of the material. A high performance calculation method based on MPI+CUDA hybrid programming model is introduced. Multi-GPU is used to implement quantitative numerical simulations of three-dimensional phase-field model in binary alloy under the condition of multi-physical processes coupling. The acceleration effect of different GPU nodes on different calculation scales is explored. On the foundation of multi-GPU calculation model that has been introduced, two optimization schemes, Non-blocking communication optimization and overlap of MPI and GPU computing optimization, are proposed. The results of two optimization schemes and basic multi-GPU model are compared. The calculation results show that the use of multi-GPU calculation model can improve the computational efficiency of three-dimensional phase-field obviously, which is 13 times to single GPU, and the problem scales have been expanded to 8193. The feasibility of two optimization schemes is shown, and the overlap of MPI and GPU computing optimization has better performance, which is 1.7 times to basic multi-GPU model, when 21 GPUs are used.
A New Parallel Approach for Accelerating the GPU-Based Execution of Edge Detection Algorithms.
Emrani, Zahra; Bateni, Soroosh; Rabbani, Hossein
2017-01-01
Real-time image processing is used in a wide variety of applications like those in medical care and industrial processes. This technique in medical care has the ability to display important patient information graphi graphically, which can supplement and help the treatment process. Medical decisions made based on real-time images are more accurate and reliable. According to the recent researches, graphic processing unit (GPU) programming is a useful method for improving the speed and quality of medical image processing and is one of the ways of real-time image processing. Edge detection is an early stage in most of the image processing methods for the extraction of features and object segments from a raw image. The Canny method, Sobel and Prewitt filters, and the Roberts' Cross technique are some examples of edge detection algorithms that are widely used in image processing and machine vision. In this work, these algorithms are implemented using the Compute Unified Device Architecture (CUDA), Open Source Computer Vision (OpenCV), and Matrix Laboratory (MATLAB) platforms. An existing parallel method for Canny approach has been modified further to run in a fully parallel manner. This has been achieved by replacing the breadth- first search procedure with a parallel method. These algorithms have been compared by testing them on a database of optical coherence tomography images. The comparison of results shows that the proposed implementation of the Canny method on GPU using the CUDA platform improves the speed of execution by 2-100× compared to the central processing unit-based implementation using the OpenCV and MATLAB platforms.
Temperature Modelling of the Biomass Pretreatment Process
DEFF Research Database (Denmark)
Prunescu, Remus Mihail; Blanke, Mogens; Jensen, Jakob M.
2012-01-01
In a second generation biorefinery, the biomass pretreatment stage has an important contribution to the efficiency of the downstream processing units involved in biofuel production. Most of the pretreatment process occurs in a large pressurized thermal reactor that presents an irregular temperature...... distribution. Therefore, an accurate temperature model is critical for observing the biomass pretreatment. More than that, the biomass is also pushed with a constant horizontal speed along the reactor in order to ensure a continuous throughput. The goal of this paper is to derive a temperature model...... that captures the environmental temperature differences inside the reactor using distributed parameters. A Kalman filter is then added to account for any missing dynamics and the overall model is embedded into a temperature soft sensor. The operator of the plant will be able to observe the temperature in any...
Engineering a static verification tool for GPU kernels
Bardsley, E; Betts, A; Chong, N; Collingbourne, P; Deligiannis, P; Donaldson, AF; Ketema, J; Liew, D; Qadeer, S
2014-01-01
We report on practical experiences over the last 2.5 years related to the engineering of GPUVerify, a static verification tool for OpenCL and CUDA GPU kernels, plotting the progress of GPUVerify from a prototype to a fully functional and relatively efficient analysis tool. Our hope is that this experience report will serve the verification community by helping to inform future tooling efforts. ? 2014 Springer International Publishing.
Particle-based Powder-snow Avalanche Simulation Using GPU
Yndestad, Leif Kåre Hornnes
2011-01-01
The main focus of this thesis was the simulation of a powder-snow avalanche flow. The simulation were implemented using the particle-based simulation solution SPH, from a mathematical model describing powder-snow flow dynamics. The simulation was accelerated by applying the computational power of the GPU, in order to provide a faster simulation time than would have been achieved on the CPU.
Bédorf, Jeroen; Gaburov, Evghenii; Portegies Zwart, Simon
2012-12-01
Bonsai is a gravitational N-body tree-code that runs completely on the GPU. This reduces the amount of time spent on communication with the CPU. The code runs on NVIDIA GPUs and on a GTX480 it is able to integrate 2.8M particles per second. The tree construction and traverse algorithms are portable to many-core devices which have support for CUDA or OpenCL programming languages.
Accelerating Select Where And Select Join Queries On A Gpu
Directory of Open Access Journals (Sweden)
Marcin Pietron
2013-01-01
Full Text Available This paper presents implementations of a few selected SQL operations using theCUDA programming framework on the GPU platform. Nowadays, the GPU’sparallel architectures give a high speed-up on certain problems. Therefore, thenumber of non-graphical problems that can be run and sped-up on the GPUstill increases. Especially, there has been a lot of research in data mining onGPUs. In many cases it proves the advantage of oﬄoading processing fromthe CPU to the GPU. At the beginning of our project we chose the set ofSELECT WHERE and SELECT JOIN instructions as the most common op-erations used in databases. We parallelized these SQL operations using threemain mechanisms in CUDA: thread group hierarchy, shared memories, andbarrier synchronization. Our results show that the implemented highly parallelSELECT WHERE and SELECT JOIN operations on the GPU platform canbe signiﬁcantly faster than the sequential one in a database system run on theCPU.
A Kepler Workflow Tool for Reproducible AMBER GPU Molecular Dynamics.
Purawat, Shweta; Ieong, Pek U; Malmstrom, Robert D; Chan, Garrett J; Yeung, Alan K; Walker, Ross C; Altintas, Ilkay; Amaro, Rommie E
2017-06-20
With the drive toward high throughput molecular dynamics (MD) simulations involving ever-greater numbers of simulation replicates run for longer, biologically relevant timescales (microseconds), the need for improved computational methods that facilitate fully automated MD workflows gains more importance. Here we report the development of an automated workflow tool to perform AMBER GPU MD simulations. Our workflow tool capitalizes on the capabilities of the Kepler platform to deliver a flexible, intuitive, and user-friendly environment and the AMBER GPU code for a robust and high-performance simulation engine. Additionally, the workflow tool reduces user input time by automating repetitive processes and facilitates access to GPU clusters, whose high-performance processing power makes simulations of large numerical scale possible. The presented workflow tool facilitates the management and deployment of large sets of MD simulations on heterogeneous computing resources. The workflow tool also performs systematic analysis on the simulation outputs and enhances simulation reproducibility, execution scalability, and MD method development including benchmarking and validation. Copyright © 2017 Biophysical Society. Published by Elsevier Inc. All rights reserved.
Study on GPU Computing for SCOPE2 with CUDA
International Nuclear Information System (INIS)
Kodama, Yasuhiro; Tatsumi, Masahiro; Ohoka, Yasunori
2011-01-01
For improving safety and cost effectiveness of nuclear power plants, a core calculation code SCOPE2 has been developed, which adopts detailed calculation models such as the multi-group nodal SP3 transport calculation method in three-dimensional pin-by-pin geometry to achieve high predictability. However, it is difficult to apply the code to loading pattern optimizations since it requires much longer computation time than that of codes based on the nodal diffusion method which is widely used in core design calculations. In this study, we studied possibility of acceleration of SCOPE2 with GPU computing capability which has been recognized as one of the most promising direction of high performance computing. In the previous study with an experimental programming framework, it required much effort to convert the algorithms to ones which fit to GPU computation. It was found, however, that this conversion was tremendously difficult because of the complexity of algorithms and restrictions in implementation. In this study, to overcome this complexity, we utilized the CUDA programming environment provided by NVIDIA which is a versatile and flexible language as an extension to the C/C++ languages. It was confirmed that we could enjoy high performance without degradation of maintainability through test implementation of GPU kernels for neutron diffusion/simplified P3 equation solvers. (author)
GPU Lossless Hyperspectral Data Compression System for Space Applications
Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled
2012-01-01
On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
Bin recycling strategy for improving the histogram precision on GPU
Cárdenas-Montes, Miguel; Rodríguez-Vázquez, Juan José; Vega-Rodríguez, Miguel A.
2016-07-01
Histogram is an easily comprehensible way to present data and analyses. In the current scientific context with access to large volumes of data, the processing time for building histogram has dramatically increased. For this reason, parallel construction is necessary to alleviate the impact of the processing time in the analysis activities. In this scenario, GPU computing is becoming widely used for reducing until affordable levels the processing time of histogram construction. Associated to the increment of the processing time, the implementations are stressed on the bin-count accuracy. Accuracy aspects due to the particularities of the implementations are not usually taken into consideration when building histogram with very large data sets. In this work, a bin recycling strategy to create an accuracy-aware implementation for building histogram on GPU is presented. In order to evaluate the approach, this strategy was applied to the computation of the three-point angular correlation function, which is a relevant function in Cosmology for the study of the Large Scale Structure of Universe. As a consequence of the study a high-accuracy implementation for histogram construction on GPU is proposed.
THEWASP library. Thermodynamic water and steam properties library in GPU
International Nuclear Information System (INIS)
Waintraub, M.; Lapa, C.M.F.; Mol, A.C.A.; Heimlich, A.
2011-01-01
In this paper we present a new library for thermodynamic evaluation of water properties, THEWASP. This library consists of a C++ and CUDA based programs used to accelerate a function evaluation using GPU and GPU clusters. Global optimization problems need thousands of evaluations of the objective functions to nd the global optimum implying in several days of expensive processing. This problem motivates to seek a way to speed up our code, as well as to use MPI on Beowulf clusters, which however increases the cost in terms of electricity, air conditioning and others. The GPU based programming can accelerate the implementation up to 100 times and help increase the number of evaluations in global optimization problems using, for example, the PSO or DE Algorithms. THEWASP is based on Water-Steam formulations publish by the International Association for the properties of water and steam, Lucerne - Switzerland, and provides several temperature and pressure function evaluations, such as specific heat, specific enthalpy, specific entropy and also some inverse maps. In this study we evaluated the gain in speed and performance and compared it a CPU based processing library. (author)
Directory of Open Access Journals (Sweden)
Wang Kai
2011-05-01
Full Text Available Abstract Background Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs have multiple cores, whereas Graphics Processing Units (GPUs also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Findings Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1 the interaction of SNPs within it in parallel, and 2 the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. Conclusions GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/.
System matrix computation vs storage on GPU: A comparative study in cone beam CT.
Matenine, Dmitri; Côté, Geoffroi; Mascolo-Fortin, Julia; Goussard, Yves; Després, Philippe
2018-02-01
Iterative reconstruction algorithms in computed tomography (CT) require a fast method for computing the intersection distances between the trajectories of photons and the object, also called ray tracing or system matrix computation. This work focused on the thin-ray model is aimed at comparing different system matrix handling strategies using graphical processing units (GPUs). In this work, the system matrix is modeled by thin rays intersecting a regular grid of box-shaped voxels, known to be an accurate representation of the forward projection operator in CT. However, an uncompressed system matrix exceeds the random access memory (RAM) capacities of typical computers by one order of magnitude or more. Considering the RAM limitations of GPU hardware, several system matrix handling methods were compared: full storage of a compressed system matrix, on-the-fly computation of its coefficients, and partial storage of the system matrix with partial on-the-fly computation. These methods were tested on geometries mimicking a cone beam CT (CBCT) acquisition of a human head. Execution times of three routines of interest were compared: forward projection, backprojection, and ordered-subsets convex (OSC) iteration. A fully stored system matrix yielded the shortest backprojection and OSC iteration times, with a 1.52× acceleration for OSC when compared to the on-the-fly approach. Nevertheless, the maximum problem size was bound by the available GPU RAM and geometrical symmetries. On-the-fly coefficient computation did not require symmetries and was shown to be the fastest for forward projection. It also offered reasonable execution times of about 176.4 ms per view per OSC iteration for a detector of 512 × 448 pixels and a volume of 384 3 voxels, using commodity GPU hardware. Partial system matrix storage has shown a performance similar to the on-the-fly approach, while still relying on symmetries. Partial system matrix storage was shown to yield the lowest relative
A GPU-based symmetric non-rigid image registration method in human lung.
Haghighi, Babak; D Ellingwood, Nathan; Yin, Youbing; Hoffman, Eric A; Lin, Ching-Long
2018-03-01
Quantitative computed tomography (QCT) of the lungs plays an increasing role in identifying sub-phenotypes of pathologies previously lumped into broad categories such as chronic obstructive pulmonary disease and asthma. Methods for image matching and linking multiple lung volumes have proven useful in linking structure to function and in the identification of regional longitudinal changes. Here, we seek to improve the accuracy of image matching via the use of a symmetric multi-level non-rigid registration employing an inverse consistent (IC) transformation whereby images are registered both in the forward and reverse directions. To develop the symmetric method, two similarity measures, the sum of squared intensity difference (SSD) and the sum of squared tissue volume difference (SSTVD), were used. The method is based on a novel generic mathematical framework to include forward and backward transformations, simultaneously, eliminating the need to compute the inverse transformation. Two implementations were used to assess the proposed method: a two-dimensional (2-D) implementation using synthetic examples with SSD, and a multi-core CPU and graphics processing unit (GPU) implementation with SSTVD for three-dimensional (3-D) human lung datasets (six normal adults studied at total lung capacity (TLC) and functional residual capacity (FRC)). Success was evaluated in terms of the IC transformation consistency serving to link TLC to FRC. 2-D registration on synthetic images, using both symmetric and non-symmetric SSD methods, and comparison of displacement fields showed that the symmetric method gave a symmetrical grid shape and reduced IC errors, with the mean values of IC errors decreased by 37%. Results for both symmetric and non-symmetric transformations of human datasets showed that the symmetric method gave better results for IC errors in all cases, with mean values of IC errors for the symmetric method lower than the non-symmetric methods using both SSD and SSTVD
Advances in aluminum pretreatment
Energy Technology Data Exchange (ETDEWEB)
Sudour, Michel; Maintier, Philippe [PPG Industries France, 3 Z.A.E. Les Dix Muids, B.P. 89, F-59583 Marly (France); Simpson, Mark [PPG Industries Inc., 1200 Piedmont Troy, Michigan 48083 (United States); Quaglia, Paolo [PPG Industries Italia, Via Garavelli 21, I-15028 Quattordio (Italy)
2004-07-01
As automotive manufacturers continue to look for ways to reduce vehicle weight, aluminum is finding more utility as a body panel component. The substitution of cold-rolled steel and zinc-coated substrates with aluminum has led to new challenges in vehicle pretreatment. As a result, changes to traditional pretreatment chemistries and operating practices are necessary in order to produce an acceptable coating on aluminum body panels. These changes result in increased sludging and other undesirable characteristics. In addition to the chemistry changes, there are also process-related problems to consider. Many existing automotive pretreatment lines simply were not designed to handle aluminum and its increased demands on filtration and circulation equipment. To retrofit such a system is capital intensive and in addition to requiring a significant amount of downtime, may not be totally effective. Thus, the complexities of pre-treating aluminum body panels have actually had a negative effect on efforts to introduce more aluminum into new vehicle design programs. Recent research into ways of reducing the negative effects has led to a new understanding of the nature of zinc phosphate bath -aluminum interactions. Many of the issues associated with the pretreatment of aluminum have been identified and can be mitigated with only minor changes to the zinc phosphate bath chemistry. The use of low levels of soluble Fe ions, together with free fluoride, has been shown to dramatically improve the efficiency of a zinc phosphate system processing aluminum. Appearance of zinc phosphate coatings, coating weights and sludge are all benefited by this chemistry change. (authors)
GPU-Monte Carlo based fast IMRT plan optimization
Directory of Open Access Journals (Sweden)
Yongbao Li
2014-03-01
Full Text Available Purpose: Intensity-modulated radiation treatment (IMRT plan optimization needs pre-calculated beamlet dose distribution. Pencil-beam or superposition/convolution type algorithms are typically used because of high computation speed. However, inaccurate beamlet dose distributions, particularly in cases with high levels of inhomogeneity, may mislead optimization, hindering the resulting plan quality. It is desire to use Monte Carlo (MC methods for beamlet dose calculations. Yet, the long computational time from repeated dose calculations for a number of beamlets prevents this application. It is our objective to integrate a GPU-based MC dose engine in lung IMRT optimization using a novel two-steps workflow.Methods: A GPU-based MC code gDPM is used. Each particle is tagged with an index of a beamlet where the source particle is from. Deposit dose are stored separately for beamlets based on the index. Due to limited GPU memory size, a pyramid space is allocated for each beamlet, and dose outside the space is neglected. A two-steps optimization workflow is proposed for fast MC-based optimization. At first step, a rough dose calculation is conducted with only a few number of particle per beamlet. Plan optimization is followed to get an approximated fluence map. In the second step, more accurate beamlet doses are calculated, where sampled number of particles for a beamlet is proportional to the intensity determined previously. A second-round optimization is conducted, yielding the final result.Results: For a lung case with 5317 beamlets, 105 particles per beamlet in the first round, and 108 particles per beam in the second round are enough to get a good plan quality. The total simulation time is 96.4 sec.Conclusion: A fast GPU-based MC dose calculation method along with a novel two-step optimization workflow are developed. The high efficiency allows the use of MC for IMRT optimizations.--------------------------------Cite this article as: Li Y, Tian Z
GPU-based Integration with Application in Sensitivity Analysis
Atanassov, Emanouil; Ivanovska, Sofiya; Karaivanova, Aneta; Slavov, Dimitar
2010-05-01
The presented work is an important part of the grid application MCSAES (Monte Carlo Sensitivity Analysis for Environmental Studies) which aim is to develop an efficient Grid implementation of a Monte Carlo based approach for sensitivity studies in the domains of Environmental modelling and Environmental security. The goal is to study the damaging effects that can be caused by high pollution levels (especially effects on human health), when the main modeling tool is the Danish Eulerian Model (DEM). Generally speaking, sensitivity analysis (SA) is the study of how the variation in the output of a mathematical model can be apportioned to, qualitatively or quantitatively, different sources of variation in the input of a model. One of the important classes of methods for Sensitivity Analysis are Monte Carlo based, first proposed by Sobol, and then developed by Saltelli and his group. In MCSAES the general Saltelli procedure has been adapted for SA of the Danish Eulerian model. In our case we consider as factors the constants determining the speeds of the chemical reactions in the DEM and as output a certain aggregated measure of the pollution. Sensitivity simulations lead to huge computational tasks (systems with up to 4 × 109 equations at every time-step, and the number of time-steps can be more than a million) which motivates its grid implementation. MCSAES grid implementation scheme includes two main tasks: (i) Grid implementation of the DEM, (ii) Grid implementation of the Monte Carlo integration. In this work we present our new developments in the integration part of the application. We have developed an algorithm for GPU-based generation of scrambled quasirandom sequences which can be combined with the CPU-based computations related to the SA. Owen first proposed scrambling of Sobol sequence through permutation in a manner that improves the convergence rates. Scrambling is necessary not only for error analysis but for parallel implementations. Good scrambling is
MPC Toolbox with GPU Accelerated Optimization Algorithms
DEFF Research Database (Denmark)
Gade-Nielsen, Nicolai Fog; Jørgensen, John Bagterp; Dammann, Bernd
2012-01-01
The introduction of Graphical Processing Units (GPUs) in scientific computing has shown great promise in many different fields. While GPUs are capable of very high floating point performance and memory bandwidth, its massively parallel architecture requires algorithms to be reimplemented to suit...
Advanced noise reduction in placental ultrasound imaging using CPU and GPU: a comparative study
Zombori, G.; Ryan, J.; McAuliffe, F.; Rainford, L.; Moran, M.; Brennan, P.
2010-03-01
This paper presents a comparison of different implementations of 3D anisotropic diffusion speckle noise reduction technique on ultrasound images. In this project we are developing a novel volumetric calcification assessment metric for the placenta, and providing a software tool for this purpose. The tool can also automatically segment and visualize (in 3D) ultrasound data. One of the first steps when developing such a tool is to find a fast and efficient way to eliminate speckle noise. Previous works on this topic by Duan, Q. [1] and Sun, Q. [2] have proven that the 3D noise reducing anisotropic diffusion (3D SRAD) method shows exceptional performance in enhancing ultrasound images for object segmentation. Therefore we have implemented this method in our software application and performed a comparative study on the different variants in terms of performance and computation time. To increase processing speed it was necessary to utilize the full potential of current state of the art Graphics Processing Units (GPUs). Our 3D datasets are represented in a spherical volume format. With the aim of 2D slice visualization and segmentation, a "scan conversion" or "slice-reconstruction" step is needed, which includes coordinate transformation from spherical to Cartesian, re-sampling of the volume and interpolation. Combining the noise filtering and slice reconstruction in one process on the GPU, we can achieve close to real-time operation on high quality data sets without the need for down-sampling or reducing image quality. For the GPU programming OpenCL language was used. Therefore the presented solution is fully portable.
A distributed multi-GPU system for high speed electron microscopic tomographic reconstruction.
Zheng, Shawn Q; Branlund, Eric; Kesthelyi, Bettina; Braunfeld, Michael B; Cheng, Yifan; Sedat, John W; Agard, David A
2011-07-01
Full resolution electron microscopic tomographic (EMT) reconstruction of large-scale tilt series requires significant computing power. The desire to perform multiple cycles of iterative reconstruction and realignment dramatically increases the pressing need to improve reconstruction performance. This has motivated us to develop a distributed multi-GPU (graphics processing unit) system to provide the required computing power for rapid constrained, iterative reconstructions of very large three-dimensional (3D) volumes. The participating GPUs reconstruct segments of the volume in parallel, and subsequently, the segments are assembled to form the complete 3D volume. Owing to its power and versatility, the CUDA (NVIDIA, USA) platform was selected for GPU implementation of the EMT reconstruction. For a system containing 10 GPUs provided by 5 GTX295 cards, 10 cycles of SIRT reconstruction for a tomogram of 4096(2) × 512 voxels from an input tilt series containing 122 projection images of 4096(2) pixels (single precision float) takes a total of 1845 s of which 1032 s are for computation with the remainder being the system overhead. The same system takes only 39 s total to reconstruct 1024(2) × 256 voxels from 122 1024(2) pixel projections. While the system overhead is non-trivial, performance analysis indicates that adding extra GPUs to the system would lead to steadily enhanced overall performance. Therefore, this system can be easily expanded to generate superior computing power for very large tomographic reconstructions and especially to empower iterative cycles of reconstruction and realignment. Copyright © 2011 Elsevier B.V. All rights reserved.
Hagan, Aaron; Sawant, Amit; Folkerts, Michael; Modiri, Arezoo
2018-01-01
We report on the design, implementation and characterization of a multi-graphic processing unit (GPU) computational platform for higher-order optimization in radiotherapy treatment planning. In collaboration with a commercial vendor (Varian Medical Systems, Palo Alto, CA), a research prototype GPU-enabled Eclipse (V13.6) workstation was configured. The hardware consisted of dual 8-core Xeon processors, 256 GB RAM and four NVIDIA Tesla K80 general purpose GPUs. We demonstrate the utility of this platform for large radiotherapy optimization problems through the development and characterization of a parallelized particle swarm optimization (PSO) four dimensional (4D) intensity modulated radiation therapy (IMRT) technique. The PSO engine was coupled to the Eclipse treatment planning system via a vendor-provided scripting interface. Specific challenges addressed in this implementation were (i) data management and (ii) non-uniform memory access (NUMA). For the former, we alternated between parameters over which the computation process was parallelized. For the latter, we reduced the amount of data required to be transferred over the NUMA bridge. The datasets examined in this study were approximately 300 GB in size, including 4D computed tomography images, anatomical structure contours and dose deposition matrices. For evaluation, we created a 4D-IMRT treatment plan for one lung cancer patient and analyzed computation speed while varying several parameters (number of respiratory phases, GPUs, PSO particles, and data matrix sizes). The optimized 4D-IMRT plan enhanced sparing of organs at risk by an average reduction of 26% in maximum dose, compared to the clinical optimized IMRT plan, where the internal target volume was used. We validated our computation time analyses in two additional cases. The computation speed in our implementation did not monotonically increase with the number of GPUs. The optimal number of GPUs (five, in our study) is directly related to the
Nonhazardous Urine Pretreatment Method
Akse, James R.; Holtsnider, John T.
2012-01-01
A method combines solid phase acidification with two non-toxic biocides to prevent ammonia volatilization and microbial proliferation. The safe, non-oxidizing biocide combination consists of a quaternary amine and a food preservative. This combination has exhibited excellent stabilization of both acidified and unacidified urine. During pretreatment tests, composite urine collected from donors was challenged with a microorganism known to proliferate in urine, and then was processed using the nonhazardous urine pre-treatment method. The challenge microorganisms included Escherichia coli, a common gram-negative bacteria; Enterococcus faecalis, a ureolytic gram-positive bacteria; Candida albicans, a yeast commonly found in urine; and Aspergillus niger, a problematic mold that resists urine pre-treatment. Urine processed in this manner remained microbially stable for over 57 days. Such effective urine stabilization was achieved using non-toxic, non-oxidizing biocides at higher pH (3.6 to 5.8) than previous methods in use or projected for use aboard the International Space Station (ISS). ISS urine pretreatment methods employ strong oxidants including ozone and hexavalent chromium (Cr(VI)), a carcinogenic material, under very acidic conditions (pH = 1.8 to 2.4). The method described here offers a much more benign chemical environment than previous pretreatment methods, and will lower equivalent system mass (ESM) by reducing containment volume and mass, system complexity, and crew time needed to handle pre-treatment chemicals. The biocides, being non-oxidizing, minimize the potential for chemical reactions with urine constituents to produce volatile, airborne contaminants such as cyanogen chloride. Additionally, the biocides are active under significantly less acidic conditions than those used in the current system, thereby reducing the degree of required acidification. A simple flow-through solid phase acidification (SPA) bed is employed to overcome the natural buffering
CUDA GPU based full-Stokes finite difference modelling of glaciers
DEFF Research Database (Denmark)
Brædstrup, Christian; Egholm, D.L.
advances in graphics card (GPU) technology for high performance computing have proven extremely efficient in accelerating many large scale scientific com- putations. The general purpose GPU (GPGPU) technology is cheap, has a low power consumption and fits into a normal desktop computer. It could therefore...
A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors
Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S.; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
2011-01-01
Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities. PMID:22164116
A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors
Directory of Open Access Journals (Sweden)
Dennis Akos
2011-09-01
Full Text Available Due to their weak received signal power, Global Positioning System (GPS signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs. However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU coupled with a new generation Graphics Processing Unit (GPU having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities.
Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng
2018-02-01
De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
Holtzapple, Mark T.; Madison, Maxine Jones; Ramirez, Rocio Sierra; Deimund, Mark A.; Falls, Matthew; Dunkelman, John J.
2014-07-01
Methods and apparatus for treating biomass that may include introducing a biomass to a chamber; exposing the biomass in the chamber to a shock event to produce a shocked biomass; and transferring the shocked biomass from the chamber. In some aspects, the method may include pretreating the biomass with a chemical before introducing the biomass to the chamber and/or after transferring shocked biomass from the chamber.
An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data
DEFF Research Database (Denmark)
Liu, Weifeng; Vinter, Brian
2014-01-01
matrices. Recent work on GPU SpGEMM has demonstrated rather good both time and space complexity, but works best for fairly regular matrices. In this work we present a GPU SpGEMM algorithm that particularly focuses on the above three problems. Memory pre-allocation for the result matrix is organized...... by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach....... Load balancing builds on the number of the necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the state-of-the-art GPU SpGEMM methods in the CUSPARSE library and the CUSP library and the latest CPU SpGEMM method in the Intel Math Kernel Library, our...
Semi-automatic tool to ease the creation and optimization of GPU programs
DEFF Research Database (Denmark)
Jepsen, Jacob
2014-01-01
We present a tool that reduces the development time of GPU-executable code. We implement a catalogue of common optimizations specific to the GPU architecture. Through the tool, the programmer can semi-automatically transform a computationally-intensive code section into GPU-executable form...... and apply optimizations thereto. Based on experiments, the code generated by the tool can be 3-256X faster than code generated by an OpenACC compiler, 4-37X faster than optimized CPU code, and attain up to 25% of peak performance of the GPU. We found that by using pattern-matching rules, many...... of the transformations can be performed automatically, which makes the tool usable for both novices and experts in GPU programming....
Nguyen, Van-Giang; Lee, Soo-Jin
2016-07-01
Iterative reconstruction from Compton scattered data is known to be computationally more challenging than that from conventional line-projection based emission data in that the gamma rays that undergo Compton scattering are modeled as conic projections rather than line projections. In conventional tomographic reconstruction, to parallelize the projection and backprojection operations using the graphics processing unit (GPU), approximated methods that use an unmatched pair of ray-tracing forward projector and voxel-driven backprojector have been widely used. In this work, we propose a new GPU-accelerated method for Compton camera reconstruction which is more accurate by using exactly matched pair of projector and backprojector. To calculate conic forward projection, we first sample the cone surface into conic rays and accumulate the intersecting chord lengths of the conic rays passing through voxels using a fast ray-tracing method (RTM). For conic backprojection, to obtain the true adjoint of the conic forward projection, while retaining the computational efficiency of the GPU, we use a voxel-driven RTM which is essentially the same as the standard RTM used for the conic forward projector. Our simulation results show that, while the new method is about 3 times slower than the approximated method, it is still about 16 times faster than the CPU-based method without any loss of accuracy. The net conclusion is that our proposed method is guaranteed to retain the reconstruction accuracy regardless of the number of iterations by providing a perfectly matched projector-backprojector pair, which makes iterative reconstruction methods for Compton imaging faster and more accurate. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
A GPU offloading mechanism for LHCb
Badalov, Alexey; Campora Perez, Daniel Hugo; Zvyagin, Alexander; Neufeld, Niko; Vilasis Cardona, Xavier
2014-06-01
The current computational infrastructure at LHCb is designed for sequential execution. It is possible to make use of modern multi-core machines by using multi-threaded algorithms and running multiple instances in parallel, but there is no way to make efficient use of specialized massively parallel hardware, such as graphical processing units and Intel Xeon/Phi. We extend the current infrastructure with an out-of-process computational server able to gather data from multiple instances and process them in large batches.
Park, Hyeong-Gyu; Shin, Yeong-Gil; Lee, Ho
2015-12-01
A ray-driven backprojector is based on ray-tracing, which computes the length of the intersection between the ray paths and each voxel to be reconstructed. To reduce the computational burden caused by these exhaustive intersection tests, we propose a fully graphics processing unit (GPU)-based ray-driven backprojector in conjunction with a ray-culling scheme that enables straightforward parallelization without compromising the high computing performance of a GPU. The purpose of the ray-culling scheme is to reduce the number of ray-voxel intersection tests by excluding rays irrelevant to a specific voxel computation. This rejection step is based on an axis-aligned bounding box (AABB) enclosing a region of voxel projection, where eight vertices of each voxel are projected onto the detector plane. The range of the rectangular-shaped AABB is determined by min/max operations on the coordinates in the region. Using the indices of pixels inside the AABB, the rays passing through the voxel can be identified and the voxel is weighted as the length of intersection between the voxel and the ray. This procedure makes it possible to reflect voxel-level parallelization, allowing an independent calculation at each voxel, which is feasible for a GPU implementation. To eliminate redundant calculations during ray-culling, a shared-memory optimization is applied to exploit the GPU memory hierarchy. In experimental results using real measurement data with phantoms, the proposed GPU-based ray-culling scheme reconstructed a volume of resolution 28032803176 in 77 seconds from 680 projections of resolution 10243768 , which is 26 times and 7.5 times faster than standard CPU-based and GPU-based ray-driven backprojectors, respectively. Qualitative and quantitative analyses showed that the ray-driven backprojector provides high-quality reconstruction images when compared with those generated by the Feldkamp-Davis-Kress algorithm using a pixel-driven backprojector, with an average of 2.5 times
3D fast adaptive correlation imaging for large-scale gravity data based on GPU computation
Chen, Z.; Meng, X.; Guo, L.; Liu, G.
2011-12-01
comtinue to perform 3D correlation imaging for the redisual gravity data. After several iterations, we can obtain a satisfactoy results. Newly developed general purpose computing technology from Nvidia GPU (Graphics Processing Unit) has been put into practice and received widespread attention in many areas. Based on the GPU programming mode and two parallel levels, five CPU loops for the main computation of 3D correlation imaging are converted into three loops in GPU kernel functions, thus achieving GPU/CPU collaborative computing. The two inner loops are defined as the dimensions of blocks and the three outer loops are defined as the dimensions of threads, thus realizing the double loop block calculation. Theoretical and real gravity data tests show that results are reliable and the computing time is greatly reduced. Acknowledgments We acknowledge the financial support of Sinoprobe project (201011039 and 201011049-03), the Fundamental Research Funds for the Central Universities (2010ZY26 and 2011PY0183), the National Natural Science Foundation of China (41074095) and the Open Project of State Key Laboratory of Geological Processes and Mineral Resources (GPMR0945).
High performance GPU processing for inversion using uniform grid searches
Venetis, Ioannis E.; Saltogianni, Vasso; Stiros, Stathis; Gallopoulos, Efstratios
2017-04-01
Many geophysical problems are described by systems of redundant, highly non-linear systems of ordinary equations with constant terms deriving from measurements and hence representing stochastic variables. Solution (inversion) of such problems is based on numerical, optimization methods, based on Monte Carlo sampling or on exhaustive searches in cases of two or even three "free" unknown variables. Recently the TOPological INVersion (TOPINV) algorithm, a grid search-based technique in the Rn space, has been proposed. TOPINV is not based on the minimization of a certain cost function and involves only forward computations, hence avoiding computational errors. The basic concept is to transform observation equations into inequalities on the basis of an optimization parameter k and of their standard errors, and through repeated "scans" of n-dimensional search grids for decreasing values of k to identify the optimal clusters of gridpoints which satisfy observation inequalities and by definition contain the "true" solution. Stochastic optimal solutions and their variance-covariance matrices are then computed as first and second statistical moments. Such exhaustive uniform searches produce an excessive computational load and are extremely time consuming for common computers based on a CPU. An alternative is to use a computing platform based on a GPU, which nowadays is affordable to the research community, which provides a much higher computing performance. Using the CUDA programming language to implement TOPINV allows the investigation of the attained speedup in execution time on such a high performance platform. Based on synthetic data we compared the execution time required for two typical geophysical problems, modeling magma sources and seismic faults, described with up to 18 unknown variables, on both CPU/FORTRAN and GPU/CUDA platforms. The same problems for several different sizes of search grids (up to 1012 gridpoints) and numbers of unknown variables were solved on
Parallelized Local Volatility Estimation Using GP-GPU Hardware Acceleration
Douglas, Craig C.
2010-01-01
We introduce an inverse problem for the local volatility model in option pricing. We solve the problem using the Levenberg-Marquardt algorithm and use the notion of the Fréchet derivative when calculating the Jacobian matrix. We analyze the existence of the Fréchet derivative and its numerical computation. To reduce the computational time of the inverse problem, a GP-GPU environment is considered for parallel computation. Numerical results confirm the validity and efficiency of the proposed method. ©2010 IEEE.
Understanding the impact of ionic liquid pretreatment on eucalyptus
Energy Technology Data Exchange (ETDEWEB)
Centikol, Ozgul [Joint Bioenergy Institute; Dibble, Dean [Joint Bioenergy Institute; Cheng, Gang [Joint Bioenergy Institute; Kent, Michael S [ORNL; Knierim, Manfred [Joint Bioenergy Institute; Melnichenko, Yuri B [ORNL
2010-01-01
The development of cost-competitive biofuels necessitates the realization of advanced biomass pretreatment technologies. Ionic liquids provide a basis for one of the most promising pretreatment technologies and are known to allow effective processing of cellulose and some biomass species. Here, we demonstrate that the ionic liquid 1-ethyl-3-methyl imidazolium acetate, [C2mim][OAc], induces structural changes at the molecular level in the cell wall of Eucalyptus globulus. Deacetylation of xylan, acetylation of the lignin units, selective removal of guaiacyl units (increasing the syringyl:guaiacyl ratio) and decreased {beta}-ether content were the most prominent changes observed. Scanning electron microscopy images of the plant cell wall sections reveal extensive swelling during [C2mim][OAc] pretreatment. X-ray diffraction measurements indicate a change in cellulose crystal structure from cellulose I to cellulose II after [C2mim][OAc] pretreatment. Enzymatic saccharification of the pretreated material produced increased sugar yields and improved hydrolysis kinetics after [C2mim][OAc] pretreatment. These results provide new insight into the mechanism of ionic liquid pretreatment and reaffirm that this approach may be promising for the production of cellulosic biofuels from woody biomass.
Pseudo-random number generators for Monte Carlo simulations on ATI Graphics Processing Units
Demchik, Vadim
2011-03-01
Basic uniform pseudo-random number generators are implemented on ATI Graphics Processing Units (GPU). The performance results of the realized generators (multiplicative linear congruential (GGL), XOR-shift (XOR128), RANECU, RANMAR, RANLUX and Mersenne Twister (MT19937)) on CPU and GPU are discussed. The obtained speed up factor is hundreds of times in comparison with CPU. RANLUX generator is found to be the most appropriate for using on GPU in Monte Carlo simulations. The brief review of the pseudo-random number generators used in modern software packages for Monte Carlo simulations in high-energy physics is presented.
Energy Technology Data Exchange (ETDEWEB)
Hernandez D, V. M.; Moreno M, A.; Ortiz L, M. A. [Universidad de Cordoba, 14002 Cordoba (Spain); Vega C, H. R.; Alonso M, O. E., E-mail: vic.mc68010@gmail.com [Universidad Autonoma de Zacatecas, 98000 Zacatecas, Zac. (Mexico)
2016-10-15
The increase in computing power in personal computers has been increasing, computers now have several processors in the CPU and in addition multiple CUDA cores in the graphics processing unit (GPU); both systems can be used individually or combined to perform scientific computation without resorting to processor or supercomputing arrangements. The Bonner sphere spectrometer is the most commonly used multi-element system for neutron detection purposes and its associated spectrum. Each sphere-detector combination gives a particular response that depends on the energy of the neutrons, and the total set of these responses is known like the responses matrix Rφ(E). Thus, the counting rates obtained with each sphere and the neutron spectrum is related to the Fredholm equation in its discrete version. For the reconstruction of the spectrum has a system of poorly conditioned equations with an infinite number of solutions and to find the appropriate solution, it has been proposed the use of artificial intelligence through neural networks with different platforms CPU and GPU. (Author)
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-04-07
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
Porting AMG2013 to Heterogeneous CPU+GPU Nodes
Energy Technology Data Exchange (ETDEWEB)
Samfass, Philipp [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
2017-01-26
LLNL's future advanced technology system SIERRA will feature heterogeneous compute nodes that consist of IBM PowerV9 CPUs and NVIDIA Volta GPUs. Conceptually, the motivation for such an architecture is quite straightforward: While GPUs are optimized for throughput on massively parallel workloads, CPUs strive to minimize latency for rather sequential operations. Yet, making optimal use of heterogeneous architectures raises new challenges for the development of scalable parallel software, e.g., with respect to work distribution. Porting LLNL's parallel numerical libraries to upcoming heterogeneous CPU+GPU architectures is therefore a critical factor for ensuring LLNL's future success in ful lling its national mission. One of these libraries, called HYPRE, provides parallel solvers and precondi- tioners for large, sparse linear systems of equations. In the context of this intern- ship project, I consider AMG2013 which is a proxy application for major parts of HYPRE that implements a benchmark for setting up and solving di erent systems of linear equations. In the following, I describe in detail how I ported multiple parts of AMG2013 to the GPU (Section 2) and present results for di erent experiments that demonstrate a successful parallel implementation on the heterogeneous ma- chines surface and ray (Section 3). In Section 4, I give guidelines on how my code should be used. Finally, I conclude and give an outlook for future work (Section 5).
Heterogeneous CPU-GPU moving targets detection for UAV video
Li, Maowen; Tang, Linbo; Han, Yuqi; Yu, Chunlei; Zhang, Chao; Fu, Huiquan
2017-07-01
Moving targets detection is gaining popularity in civilian and military applications. On some monitoring platform of motion detection, some low-resolution stationary cameras are replaced by moving HD camera based on UAVs. The pixels of moving targets in the HD Video taken by UAV are always in a minority, and the background of the frame is usually moving because of the motion of UAVs. The high computational cost of the algorithm prevents running it at higher resolutions the pixels of frame. Hence, to solve the problem of moving targets detection based UAVs video, we propose a heterogeneous CPU-GPU moving target detection algorithm for UAV video. More specifically, we use background registration to eliminate the impact of the moving background and frame difference to detect small moving targets. In order to achieve the effect of real-time processing, we design the solution of heterogeneous CPU-GPU framework for our method. The experimental results show that our method can detect the main moving targets from the HD video taken by UAV, and the average process time is 52.16ms per frame which is fast enough to solve the problem.
Qualitative and quantitative improvements of PET reconstruction on GPU architecture
International Nuclear Information System (INIS)
Autret, Awen
2016-01-01
In positron emission tomography, reconstructed images suffer from a high noise level and a low resolution. Iterative reconstruction processes require an estimation of the system response (scanner and patient) and the quality of the images depends on the accuracy of this estimate. Accurate and fast to compute models already exists for the attenuation, scattering, random coincidences and dead times. Thus, this thesis focuses on modeling the system components associated with the detector response and the positron range. A new multi-GPU parallelization of the reconstruction based on a cutting of the volume is also proposed to speed up the reconstruction exploiting the computing power of such architectures. The proposed detector response model is based on a multi-ray approach that includes all the detector effects as the geometry and the scattering in the crystals. An evaluation study based on data obtained through Mote Carlo simulation (MCS) showed this model provides reconstructed images with a better contrast to noise ratio and resolution compared with those of the methods from the state of the art. The proposed positron range model is based on a simplified MCS, integrated into the forward projector during the reconstruction. A GPU implementation of this method allows running MCS three order of magnitude faster than the same simulation on GATE, while providing similar results. An evaluation study shows this model integrated in the reconstruction gives images with better contrast recovery and resolution while avoiding artifacts. (author)
G.A.M.E.: GPU-accelerated mixture elucidator.
Schurz, Alioune; Su, Bo-Han; Tu, Yi-Shu; Lu, Tony Tsung-Yu; Lin, Olivia A; Tseng, Yufeng J
2017-09-15
GPU acceleration is useful in solving complex chemical information problems. Identifying unknown structures from the mass spectra of natural product mixtures has been a desirable yet unresolved issue in metabolomics. However, this elucidation process has been hampered by complex experimental data and the inability of instruments to completely separate different compounds. Fortunately, with current high-resolution mass spectrometry, one feasible strategy is to define this problem as extending a scaffold database with sidechains of different probabilities to match the high-resolution mass obtained from a high-resolution mass spectrum. By introducing a dynamic programming (DP) algorithm, it is possible to solve this NP-complete problem in pseudo-polynomial time. However, the running time of the DP algorithm grows by orders of magnitude as the number of mass decimal digits increases, thus limiting the boost in structural prediction capabilities. By harnessing the heavily parallel architecture of modern GPUs, we designed a "compute unified device architecture" (CUDA)-based GPU-accelerated mixture elucidator (G.A.M.E.) that considerably improves the performance of the DP, allowing up to five decimal digits for input mass data. As exemplified by four testing datasets with verified constitutions from natural products, G.A.M.E. allows for efficient and automatic structural elucidation of unknown mixtures for practical procedures. Graphical abstract .
A GPU-accelerated implicit meshless method for compressible flows
Zhang, Jia-Le; Ma, Zhi-Hua; Chen, Hong-Quan; Cao, Cheng
2018-05-01
This paper develops a recently proposed GPU based two-dimensional explicit meshless method (Ma et al., 2014) by devising and implementing an efficient parallel LU-SGS implicit algorithm to further improve the computational efficiency. The capability of the original 2D meshless code is extended to deal with 3D complex compressible flow problems. To resolve the inherent data dependency of the standard LU-SGS method, which causes thread-racing conditions destabilizing numerical computation, a generic rainbow coloring method is presented and applied to organize the computational points into different groups by painting neighboring points with different colors. The original LU-SGS method is modified and parallelized accordingly to perform calculations in a color-by-color manner. The CUDA Fortran programming model is employed to develop the key kernel functions to apply boundary conditions, calculate time steps, evaluate residuals as well as advance and update the solution in the temporal space. A series of two- and three-dimensional test cases including compressible flows over single- and multi-element airfoils and a M6 wing are carried out to verify the developed code. The obtained solutions agree well with experimental data and other computational results reported in the literature. Detailed analysis on the performance of the developed code reveals that the developed CPU based implicit meshless method is at least four to eight times faster than its explicit counterpart. The computational efficiency of the implicit method could be further improved by ten to fifteen times on the GPU.
Implementation of meso-scale radioactive dispersion model for GPU
Energy Technology Data Exchange (ETDEWEB)
Sunarko [National Nuclear Energy Agency of Indonesia (BATAN), Jakarta (Indonesia). Nuclear Energy Assessment Center; Suud, Zaki [Bandung Institute of Technology (ITB), Bandung (Indonesia). Physics Dept.
2017-05-15
Lagrangian Particle Dispersion Method (LPDM) is applied to model atmospheric dispersion of radioactive material in a meso-scale of a few tens of kilometers for site study purpose. Empirical relationships are used to determine the dispersion coefficient for various atmospheric stabilities. Diagnostic 3-D wind-field is solved based on data from one meteorological station using mass-conservation principle. Particles representing radioactive pollutant are dispersed in the wind-field as a point source. Time-integrated air concentration is calculated using kernel density estimator (KDE) in the lowest layer of the atmosphere. Parallel code is developed for GTX-660Ti GPU with a total of 1 344 scalar processors using CUDA. A test of 1-hour release discovers that linear speedup is achieved starting at 28 800 particles-per-hour (pph) up to about 20 x at 14 4000 pph. Another test simulating 6-hour release with 36 000 pph resulted in a speedup of about 60 x. Statistical analysis reveals that resulting grid doses are nearly identical in both CPU and GPU versions of the code.
Graphics Processing Units for HEP trigger systems
Energy Technology Data Exchange (ETDEWEB)
Ammendola, R. [INFN Sezione di Roma “Tor Vergata”, Via della Ricerca Scientifica 1, 00133 Roma (Italy); Bauce, M. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); Biagioni, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); Chiozzi, S.; Cotta Ramusino, A. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Fantechi, R. [INFN Sezione di Pisa, Largo B. Pontecorvo 3, 56127 Pisa (Italy); CERN, Geneve (Switzerland); Fiorini, M. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Giagu, S. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); Gianoli, A. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Lamanna, G., E-mail: gianluca.lamanna@cern.ch [INFN Sezione di Pisa, Largo B. Pontecorvo 3, 56127 Pisa (Italy); INFN Laboratori Nazionali di Frascati, Via Enrico Fermi 40, 00044 Frascati (Roma) (Italy); Lonardo, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); Messina, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); and others
2016-07-11
General-purpose computing on GPUs (Graphics Processing Units) is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughput, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming ripe. We will discuss the use of online parallel computing on GPU for synchronous low level trigger, focusing on CERN NA62 experiment trigger system. The use of GPU in higher level trigger system is also briefly considered.
MAGI: a Node.js web service for fast microRNA-Seq analysis in a GPU infrastructure.
Kim, Jihoon; Levy, Eric; Ferbrache, Alex; Stepanowsky, Petra; Farcas, Claudiu; Wang, Shuang; Brunner, Stefan; Bath, Tyler; Wu, Yuan; Ohno-Machado, Lucila
2014-10-01
MAGI is a web service for fast MicroRNA-Seq data analysis in a graphics processing unit (GPU) infrastructure. Using just a browser, users have access to results as web reports in just a few hours->600% end-to-end performance improvement over state of the art. MAGI's salient features are (i) transfer of large input files in native FASTA with Qualities (FASTQ) format through drag-and-drop operations, (ii) rapid prediction of microRNA target genes leveraging parallel computing with GPU devices, (iii) all-in-one analytics with novel feature extraction, statistical test for differential expression and diagnostic plot generation for quality control and (iv) interactive visualization and exploration of results in web reports that are readily available for publication. MAGI relies on the Node.js JavaScript framework, along with NVIDIA CUDA C, PHP: Hypertext Preprocessor (PHP), Perl and R. It is freely available at http://magi.ucsd.edu. © The Author 2014. Published by Oxford University Press.
An SDR-Based Real-Time Testbed for GNSS Adaptive Array Anti-Jamming Algorithms Accelerated by GPU.
Xu, Hailong; Cui, Xiaowei; Lu, Mingquan
2016-03-11
Nowadays, software-defined radio (SDR) has become a common approach to evaluate new algorithms. However, in the field of Global Navigation Satellite System (GNSS) adaptive array anti-jamming, previous work has been limited due to the high computational power demanded by adaptive algorithms, and often lack flexibility and configurability. In this paper, the design and implementation of an SDR-based real-time testbed for GNSS adaptive array anti-jamming accelerated by a Graphics Processing Unit (GPU) are documented. This testbed highlights itself as a feature-rich and extendible platform with great flexibility and configurability, as well as high computational performance. Both Space-Time Adaptive Processing (STAP) and Space-Frequency Adaptive Processing (SFAP) are implemented with a wide range of parameters. Raw data from as many as eight antenna elements can be processed in real-time in either an adaptive nulling or beamforming mode. To fully take advantage of the parallelism resource provided by the GPU, a batched method in programming is proposed. Tests and experiments are conducted to evaluate both the computational and anti-jamming performance. This platform can be used for research and prototyping, as well as a real product in certain applications.
Budavari, Tamas; Langmead, Ben; Wheelan, Sarah J.; Salzberg, Steven L.; Szalay, Alexander S.
2015-01-01
When computing alignments of DNA sequences to a large genome, a key element in achieving high processing throughput is to prioritize locations in the genome where high-scoring mappings might be expected. We formulated this task as a series of list-processing operations that can be efficiently performed on graphics processing unit (GPU) hardware.We followed this approach in implementing a read aligner called Arioc that uses GPU-based parallel sort and reduction techniques to identify high-priority locations where potential alignments may be found. We then carried out a read-by-read comparison of Arioc’s reported alignments with the alignments found by several leading read aligners. With simulated reads, Arioc has comparable or better accuracy than the other read aligners we tested. With human sequencing reads, Arioc demonstrates significantly greater throughput than the other aligners we evaluated across a wide range of sensitivity settings. The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license. PMID:25780763
An SDR-Based Real-Time Testbed for GNSS Adaptive Array Anti-Jamming Algorithms Accelerated by GPU
Xu, Hailong; Cui, Xiaowei; Lu, Mingquan
2016-01-01
Nowadays, software-defined radio (SDR) has become a common approach to evaluate new algorithms. However, in the field of Global Navigation Satellite System (GNSS) adaptive array anti-jamming, previous work has been limited due to the high computational power demanded by adaptive algorithms, and often lack flexibility and configurability. In this paper, the design and implementation of an SDR-based real-time testbed for GNSS adaptive array anti-jamming accelerated by a Graphics Processing Unit (GPU) are documented. This testbed highlights itself as a feature-rich and extendible platform with great flexibility and configurability, as well as high computational performance. Both Space-Time Adaptive Processing (STAP) and Space-Frequency Adaptive Processing (SFAP) are implemented with a wide range of parameters. Raw data from as many as eight antenna elements can be processed in real-time in either an adaptive nulling or beamforming mode. To fully take advantage of the parallelism resource provided by the GPU, a batched method in programming is proposed. Tests and experiments are conducted to evaluate both the computational and anti-jamming performance. This platform can be used for research and prototyping, as well as a real product in certain applications. PMID:26978363
Directory of Open Access Journals (Sweden)
Christian F. Janßen
2015-07-01
Full Text Available This contribution is dedicated to demonstrating the high potential and manifold applications of state-of-the-art computational fluid dynamics (CFD tools for free-surface flows in civil and environmental engineering. All simulations were performed with the academic research code ELBE (efficient lattice boltzmann environment, http://www.tuhh.de/elbe. The ELBE code follows the supercomputing-on-the-desktop paradigm and is especially designed for local supercomputing, without tedious accesses to supercomputers. ELBE uses graphics processing units (GPU to accelerate the computations and can be used in a single GPU-equipped workstation of, e.g., a design engineer. The code has been successfully validated in very different fields, mostly related to naval architecture and mechanical engineering. In this contribution, we give an overview of past and present applications with practical relevance for civil engineers. The presented applications are grouped into three major categories: (i tsunami simulations, considering wave propagation, wave runup, inundation and debris flows; (ii dam break simulations; and (iii numerical wave tanks for the calculation of hydrodynamic loads on fixed and moving bodies. This broad range of applications in combination with accurate numerical results and very competitive times to solution demonstrates that modern CFD tools in general, and the ELBE code in particular, can be a helpful design tool for civil and environmental engineers.
Ha, Sanghyun; Park, Junshin; You, Donghyun
2017-11-01
Utility of the computational power of modern Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. Due to its serial and bandwidth-bound nature, the present choice of numerical methods is considered to be a good candidate for evaluating the potential of GPUs for solving Navier-Stokes equations using non-explicit time integration. An efficient algorithm is presented for GPU acceleration of the Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution method used in the semi-implicit fractional-step method. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while Navier-Stokes equations are computed on a GPU. Extension to multiple NVIDIA GPUs is implemented using NVLink supported by the Pascal architecture. Performance of the present method is experimented on multiple Tesla P100 GPUs compared with a single-core Xeon E5-2650 v4 CPU in simulations of boundary-layer flow over a flat plate. Supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (Ministry of Science, ICT and Future Planning NRF-2016R1E1A2A01939553, NRF-2014R1A2A1A11049599, and Ministry of Trade, Industry and Energy 201611101000230).
An SDR-Based Real-Time Testbed for GNSS Adaptive Array Anti-Jamming Algorithms Accelerated by GPU
Directory of Open Access Journals (Sweden)
Hailong Xu
2016-03-01
Full Text Available Nowadays, software-defined radio (SDR has become a common approach to evaluate new algorithms. However, in the field of Global Navigation Satellite System (GNSS adaptive array anti-jamming, previous work has been limited due to the high computational power demanded by adaptive algorithms, and often lack flexibility and configurability. In this paper, the design and implementation of an SDR-based real-time testbed for GNSS adaptive array anti-jamming accelerated by a Graphics Processing Unit (GPU are documented. This testbed highlights itself as a feature-rich and extendible platform with great flexibility and configurability, as well as high computational performance. Both Space-Time Adaptive Processing (STAP and Space-Frequency Adaptive Processing (SFAP are implemented with a wide range of parameters. Raw data from as many as eight antenna elements can be processed in real-time in either an adaptive nulling or beamforming mode. To fully take advantage of the parallelism resource provided by the GPU, a batched method in programming is proposed. Tests and experiments are conducted to evaluate both the computational and anti-jamming performance. This platform can be used for research and prototyping, as well as a real product in certain applications.
Directory of Open Access Journals (Sweden)
Richard Wilton
2015-03-01
Full Text Available When computing alignments of DNA sequences to a large genome, a key element in achieving high processing throughput is to prioritize locations in the genome where high-scoring mappings might be expected. We formulated this task as a series of list-processing operations that can be efficiently performed on graphics processing unit (GPU hardware.We followed this approach in implementing a read aligner called Arioc that uses GPU-based parallel sort and reduction techniques to identify high-priority locations where potential alignments may be found. We then carried out a read-by-read comparison of Arioc’s reported alignments with the alignments found by several leading read aligners. With simulated reads, Arioc has comparable or better accuracy than the other read aligners we tested. With human sequencing reads, Arioc demonstrates significantly greater throughput than the other aligners we evaluated across a wide range of sensitivity settings. The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.
International Nuclear Information System (INIS)
Perrotte, Lancelot; Bodin, Bruno; Chodorge, Laurent
2011-01-01
Before an intervention on a nuclear site, it is essential to study different scenarios to identify the less dangerous one for the operator. Therefore, it is mandatory to dispose of an efficient dosimetry simulation code with accurate results. One classical method in radiation protection is the straight-line attenuation method with build-up factors. In the case of 3D industrial scenes composed of meshes, the computation cost resides in the fast computation of all of the intersections between the rays and the triangles of the scene. Efficient GPU algorithms have already been proposed, that enable dosimetry calculation for a huge scene (800000 rays, 800000 triangles) in a fraction of second. But these algorithms are not robust: because of the rounding caused by floating-point arithmetic, the numerical results of the ray-triangle intersection tests can differ from the expected mathematical results. In worst case scenario, this can lead to a computed dose rate dramatically inferior to the real dose rate to which the operator is exposed. In this paper, we present a hybrid GPU-CPU algorithm to manage adaptive precision floating-point arithmetic. This algorithm allows robust ray-triangle intersection tests, with very small loss of performance (less than 5 % overhead), and without any need for scene-dependent tuning. (author)
High-Performance Pseudo-Random Number Generation on Graphics Processing Units
Nandapalan, Nimalan; Brent, Richard P.; Murray, Lawrence M.; Rendell, Alistair
2011-01-01
This work considers the deployment of pseudo-random number generators (PRNGs) on graphics processing units (GPUs), developing an approach based on the xorgens generator to rapidly produce pseudo-random numbers of high statistical quality. The chosen algorithm has configurable state size and period, making it ideal for tuning to the GPU architecture. We present a comparison of both speed and statistical quality with other common parallel, GPU-based PRNGs, demonstrating favourable performance o...
An Investigation of the Performance of the Colored Gauss-Seidel Solver on CPU and GPU
International Nuclear Information System (INIS)
Yoon, Jong Seon; Choi, Hyoung Gwon; Jeon, Byoung Jin
2017-01-01
The performance of the colored Gauss–Seidel solver on CPU and GPU was investigated for the two- and three-dimensional heat conduction problems by using different mesh sizes. The heat conduction equation was discretized by the finite difference method and finite element method. The CPU yielded good performance for small problems but deteriorated when the total memory required for computing was larger than the cache memory for large problems. In contrast, the GPU performed better as the mesh size increased because of the latency hiding technique. Further, GPU computation by the colored Gauss–Siedel solver was approximately 7 times that by the single CPU. Furthermore, the colored Gauss–Seidel solver was found to be approximately twice that of the Jacobi solver when parallel computing was conducted on the GPU.
Fast plane wave density functional theory molecular dynamics calculations on multi-GPU machines
International Nuclear Information System (INIS)
Jia, Weile; Fu, Jiyun; Cao, Zongyan; Wang, Long; Chi, Xuebin; Gao, Weiguo; Wang, Lin-Wang
2013-01-01
Plane wave pseudopotential (PWP) density functional theory (DFT) calculation is the most widely used method for material simulations, but its absolute speed stagnated due to the inability to use large scale CPU based computers. By a drastic redesign of the algorithm, and moving all the major computation parts into GPU, we have reached a speed of 12 s per molecular dynamics (MD) step for a 512 atom system using 256 GPU cards. This is about 20 times faster than the CPU version of the code regardless of the number of CPU cores used. Our tests and analysis on different GPU platforms and configurations shed lights on the optimal GPU deployments for PWP-DFT calculations. An 1800 step MD simulation is used to study the liquid phase properties of GaInP
GPU-Based Image Processing Use Cases: A High-Level Approach
Wieser, V.; Grelck, C.; Schöner, H.; Haslinger, P.; Bosa, K.; Moser, B.
2012-01-01
This paper addresses the gap between envisioned hardware-virtualized techniques for GPU programming and a conventional approach from the point of view of an application engineer taking software engineering aspects like maintainability, understandability and productivity, and resulting achieved gain
Simulation of disc-bulge-halo galaxies using parallel GPU based codes
Veles, O.; Berczik, P.; Just, A.
2016-02-01
We compare the performance of the very popular Tree-GPU code BONSAI with the older Particle-(Multi)Mesh code SUPERBOX. Both code we run on a same hardware using the GPU acceleration for the force calculation. SUPERBOX is a particle-mesh code with high resolution sub-grid and a higher order NGP (nearest grid point) force-calculation scheme. In our research, we are aiming to demonstrate that the new parallel version of SUPERBOX is capable to do the high resolution simulations of the interaction of the system of disc-bulge-halo composed galaxy. We describe the improvement of performance and scalability of SUPERBOX particularly for the Kepler cluster (NVIDIA K20 GPU). A comparison was made with the very popular and publicly available Tree-GPU code BONSAI†.
Rohl, Sebastian; Bodenstedt, Sebastian; Suwelack, Stefan; Dillmann, Rudiger; Speidel, Stefanie; Kenngott, Hannes; Muller-Stich, Beat P
2012-03-01
In laparoscopic surgery, soft tissue deformations substantially change the surgical site, thus impeding the use of preoperative planning during intraoperative navigation. Extracting depth information from endoscopic images and building a surface model of the surgical field-of-view is one way to represent this constantly deforming environment. The information can then be used for intraoperative registration. Stereo reconstruction is a typical problem within computer vision. However, most of the available methods do not fulfill the specific requirements in a minimally invasive setting such as the need of real-time performance, the problem of view-dependent specular reflections and large curved areas with partly homogeneous or periodic textures and occlusions. In this paper, the authors present an approach toward intraoperative surface reconstruction based on stereo endoscopic images. The authors describe our answer to this problem through correspondence analysis, disparity correction and refinement, 3D reconstruction, point cloud smoothing and meshing. Real-time performance is achieved by implementing the algorithms on the gpu. The authors also present a new hybrid cpu-gpu algorithm that unifies the advantages of the cpu and the gpu version. In a comprehensive evaluation using in vivo data, in silico data from the literature and virtual data from a newly developed simulation environment, the cpu, the gpu, and the hybrid cpu-gpu versions of the surface reconstruction are compared to a cpu and a gpu algorithm from the literature. The recommended approach toward intraoperative surface reconstruction can be conducted in real-time depending on the image resolution (20 fps for the gpu and 14fps for the hybrid cpu-gpu version on resolution of 640 × 480). It is robust to homogeneous regions without texture, large image changes, noise or errors from camera calibration, and it reconstructs the surface down to sub millimeter accuracy. In all the experiments within the
MASSIVELY PARALLEL LATENT SEMANTIC ANALYSES USING A GRAPHICS PROCESSING UNIT
Energy Technology Data Exchange (ETDEWEB)
Cavanagh, J.; Cui, S.
2009-01-01
Latent Semantic Analysis (LSA) aims to reduce the dimensions of large term-document datasets using Singular Value Decomposition. However, with the ever-expanding size of datasets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. A graphics processing unit (GPU) can solve some highly parallel problems much faster than a traditional sequential processor or central processing unit (CPU). Thus, a deployable system using a GPU to speed up large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a PC cluster. Due to the GPU’s application-specifi c architecture, harnessing the GPU’s computational prowess for LSA is a great challenge. We presented a parallel LSA implementation on the GPU, using NVIDIA® Compute Unifi ed Device Architecture and Compute Unifi ed Basic Linear Algebra Subprograms software. The performance of this implementation is compared to traditional LSA implementation on a CPU using an optimized Basic Linear Algebra Subprograms library. After implementation, we discovered that the GPU version of the algorithm was twice as fast for large matrices (1 000x1 000 and above) that had dimensions not divisible by 16. For large matrices that did have dimensions divisible by 16, the GPU algorithm ran fi ve to six times faster than the CPU version. The large variation is due to architectural benefi ts of the GPU for matrices divisible by 16. It should be noted that the overall speeds for the CPU version did not vary from relative normal when the matrix dimensions were divisible by 16. Further research is needed in order to produce a fully implementable version of LSA. With that in mind, the research we presented shows that the GPU is a viable option for increasing the speed of LSA, in terms of cost/performance ratio.
Comparison of GPU-Based Numerous Particles Simulation and Experiment
International Nuclear Information System (INIS)
Park, Sang Wook; Jun, Chul Woong; Sohn, Jeong Hyun; Lee, Jae Wook
2014-01-01
The dynamic behavior of numerous grains interacting with each other can be easily observed. In this study, this dynamic behavior was analyzed based on the contact between numerous grains. The discrete element method was used for analyzing the dynamic behavior of each particle and the neighboring-cell algorithm was employed for detecting their contact. The Hertzian and tangential sliding friction contact models were used for calculating the contact force acting between the particles. A GPU-based parallel program was developed for conducting the computer simulation and calculating the numerous contacts. The dam break experiment was performed to verify the simulation results. The reliability of the program was verified by comparing the results of the simulation with those of the experiment
GPU Accelerated DG-FDF Large Eddy Simulator
Inkarbekov, Medet; Aitzhan, Aidyn; Sammak, Shervin; Givi, Peyman; Kaltayev, Aidarkhan
2017-11-01
A GPU accelerated simulator is developed and implemented for large eddy simulation (LES) of turbulent flows. The filtered density function (FDF) is utilized for modeling of the subgrid scale quantities. The filtered transport equations are solved via a discontinuous Galerkin (DG) and the FDF is simulated via particle based Lagrangian Monte-Carlo (MC) method. It is demonstrated that the GPUs simulations are of the order of 100 times faster than the CPU-based calculations. This brings LES of turbulent flows to a new level, facilitating efficient simulation of more complex problems. The work at Al-Faraby Kazakh National University is sponsored by MoES of RK under Grant 3298/GF-4.
A Versatile and Efficient GPU Data Structure for Spatial Indexing.
Schneider, Jens; Rautek, Peter
2017-01-01
In this paper we present a novel GPU-based data structure for spatial indexing. Based on Fenwick trees-a special type of binary indexed trees-our data structure allows construction in linear time. Updates and prefixes can be computed in logarithmic time, whereas point queries require only constant time on average. Unlike competing data structures such as summed-area tables and spatial hashing, our data structure requires a constant amount of bits for each data element, and it offers unconstrained point queries. This property makes our data structure ideally suited for applications requiring unconstrained indexing of large data, such as block-storage of large and block-sparse volumes. Finally, we provide asymptotic bounds on both run-time and memory requirements, and we show applications for which our new data structure is useful.
A Versatile and Efficient GPU Data Structure for Spatial Indexing
Schneider, Jens
2016-08-10
In this paper we present a novel GPU-based data structure for spatial indexing. Based on Fenwick trees—a special type of binary indexed trees—our data structure allows construction in linear time. Updates and prefixes can be computed in logarithmic time, whereas point queries require only constant time on average. Unlike competing data structures such as summed-area tables and spatial hashing, our data structure requires a constant amount of bits for each data element, and it offers unconstrained point queries. This property makes our data structure ideally suited for applications requiring unconstrained indexing of large data, such as block-storage of large and block-sparse volumes. Finally, we provide asymptotic bounds on both run-time and memory requirements, and we show applications for which our new data structure is useful.
A GPU-based mipmapping method for water surface visualization
Li, Hua; Quan, Wei; Xu, Chao; Wu, Yan
2018-03-01
Visualization of water surface is a hot topic in computer graphics. In this paper, we presented a fast method to generate wide range of water surface with good image quality both near and far from the viewpoint. This method utilized uniform mesh and Fractal Perlin noise to model water surface. Mipmapping technology was enforced to the surface textures, which adjust the resolution with respect to the distance from the viewpoint and reduce the computing cost. Lighting effect was computed based on shadow mapping technology, Snell's law and Fresnel term. The render pipeline utilizes a CPU-GPU shared memory structure, which improves the rendering efficiency. Experiment results show that our approach visualizes water surface with good image quality at real-time frame rates performance.
Implementation of collisions on GPU architecture in the Vorpal code
Leddy, Jarrod; Averkin, Sergey; Cowan, Ben; Sides, Scott; Werner, Greg; Cary, John
2017-10-01
The Vorpal code contains a variety of collision operators allowing for the simulation of plasmas containing multiple charge species interacting with neutrals, background gas, and EM fields. These existing algorithms have been improved and reimplemented to take advantage of the massive parallelization allowed by GPU architecture. The use of GPUs is most effective when algorithms are single-instruction multiple-data, so particle collisions are an ideal candidate for this parallelization technique due to their nature as a series of independent processes with the same underlying operation. This refactoring required data memory reorganization and careful consideration of device/host data allocation to minimize memory access and data communication per operation. Successful implementation has resulted in an order of magnitude increase in simulation speed for a test-case involving multiple binary collisions using the null collision method. Work supported by DARPA under contract W31P4Q-16-C-0009.
Singular value decomposition for collaborative filtering on a GPU
Kato, Kimikazu; Hosino, Tikara
2010-06-01
A collaborative filtering predicts customers' unknown preferences from known preferences. In a computation of the collaborative filtering, a singular value decomposition (SVD) is needed to reduce the size of a large scale matrix so that the burden for the next phase computation will be decreased. In this application, SVD means a roughly approximated factorization of a given matrix into smaller sized matrices. Webb (a.k.a. Simon Funk) showed an effective algorithm to compute SVD toward a solution of an open competition called "Netflix Prize". The algorithm utilizes an iterative method so that the error of approximation improves in each step of the iteration. We give a GPU version of Webb's algorithm. Our algorithm is implemented in the CUDA and it is shown to be efficient by an experiment.
Real-time GPU surface curvature estimation on deforming meshes and volumetric data sets.
Griffin, Wesley; Wang, Yu; Berrios, David; Olano, Marc
2012-10-01
Surface curvature is used in a number of areas in computer graphics, including texture synthesis and shape representation, mesh simplification, surface modeling, and nonphotorealistic line drawing. Most real-time applications must estimate curvature on a triangular mesh. This estimation has been limited to CPU algorithms, forcing object geometry to reside in main memory. However, as more computational work is done directly on the GPU, it is increasingly common for object geometry to exist only in GPU memory. Examples include vertex skinned animations and isosurfaces from GPU-based surface reconstruction algorithms. For static models, curvature can be precomputed and CPU algorithms are a reasonable choice. For deforming models where the geometry only resides on the GPU, transferring the deformed mesh back to the CPU limits performance. We introduce a GPU algorithm for estimating curvature in real time on arbitrary triangular meshes. We demonstrate our algorithm with curvature-based NPR feature lines and a curvature-based approximation for an ambient occlusion. We show curvature computation on volumetric data sets with a GPU isosurface extraction algorithm and vertex-skinned animations. We present a graphics pipeline and CUDA implementation. Our curvature estimation is up to ~18x faster than a multithreaded CPU benchmark.
Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George
2014-01-01
Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified
Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X George
2014-07-01
Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm
MrBayes tgMC³: a tight GPU implementation of MrBayes.
Directory of Open Access Journals (Sweden)
Cheng Ling
Full Text Available MrBayes is model-based phylogenetic inference tool using Bayesian statistics. However, model-based assessment of phylogenetic trees adds to the computational burden of tree-searching, and so poses significant computational challenges. Graphics Processing Units (GPUs have been proposed as high performance, low cost acceleration platforms and several parallelized versions of the Metropolis Coupled Markov Chain Mote Carlo (MC(3 algorithm in MrBayes have been presented that can run on GPUs. However, some bottlenecks decrease the efficiency of these implementations. To address these bottlenecks, we propose a tight GPU MC(3 (tgMC(3 algorithm. tgMC(3 implements a different architecture from the one-to-one acceleration architecture employed in previously proposed methods. It merges multiply discrete GPU kernels according to the data dependency and hence decreases the number of kernels launched and the complexity of data transfer. We implemented tgMC(3 and made performance comparisons with an earlier proposed algorithm, nMC(3, and also with MrBayes MC(3 under serial and multiply concurrent CPU processes. All of the methods were benchmarked on the same computing node from DEGIMA. Experiments indicate that the tgMC(3 method outstrips nMC(3 (v1.0 with speedup factors from 2.1 to 2.7×. In addition, tgMC(3 outperforms the serial MrBayes MC(3 by a factor of 6 to 30× when using a single GTX480 card, whereas a speedup factor of around 51× can be achieved by using two GTX 480 cards on relatively long sequences. Moreover, tgMC(3 was compared with MrBayes accelerated by BEAGLE, and achieved speedup factors from 3.7 to 5.7×. The reported performance improvement of tgMC(3 is significant and appears to scale well with increasing dataset sizes. In addition, the strategy proposed in tgMC(3 could benefit the acceleration of other Bayesian-based phylogenetic analysis methods using GPUs.
Implementation and evaluation of various demons deformable image registration algorithms on a GPU
International Nuclear Information System (INIS)
Gu Xuejun; Pan Hubert; Liang Yun; Jiang, Steve B; Castillo, Richard; Yang Deshan; Choi, Dongju; Majumdar, Amitava; Castillo, Edward; Guerrero, Thomas
2010-01-01
Online adaptive radiation therapy (ART) promises the ability to deliver an optimal treatment in response to daily patient anatomic variation. A major technical barrier for the clinical implementation of online ART is the requirement of rapid image segmentation. Deformable image registration (DIR) has been used as an automated segmentation method to transfer tumor/organ contours from the planning image to daily images. However, the current computational time of DIR is insufficient for online ART. In this work, this issue is addressed by using computer graphics processing units (GPUs). A gray-scale-based DIR algorithm called demons and five of its variants were implemented on GPUs using the compute unified device architecture (CUDA) programming environment. The spatial accuracy of these algorithms was evaluated over five sets of pulmonary 4D CT images with an average size of 256 x 256 x 100 and more than 1100 expert-determined landmark point pairs each. For all the testing scenarios presented in this paper, the GPU-based DIR computation required around 7 to 11 s to yield an average 3D error ranging from 1.5 to 1.8 mm. It is interesting to find out that the original passive force demons algorithms outperform subsequently proposed variants based on the combination of accuracy, efficiency and ease of implementation.
High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP
Directory of Open Access Journals (Sweden)
Khaled Benkrid
2012-01-01
Full Text Available This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs, Graphics Processor Units (GPUs, and IBM’s Cell Broadband Engine (Cell BE, in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools, FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs.
Implementation and evaluation of various demons deformable image registration algorithms on a GPU
Gu, Xuejun; Pan, Hubert; Liang, Yun; Castillo, Richard; Yang, Deshan; Choi, Dongju; Castillo, Edward; Majumdar, Amitava; Guerrero, Thomas; Jiang, Steve B.
2010-01-01
Online adaptive radiation therapy (ART) promises the ability to deliver an optimal treatment in response to daily patient anatomic variation. A major technical barrier for the clinical implementation of online ART is the requirement of rapid image segmentation. Deformable image registration (DIR) has been used as an automated segmentation method to transfer tumor/organ contours from the planning image to daily images. However, the current computational time of DIR is insufficient for online ART. In this work, this issue is addressed by using computer graphics processing units (GPUs). A gray-scale-based DIR algorithm called demons and five of its variants were implemented on GPUs using the compute unified device architecture (CUDA) programming environment. The spatial accuracy of these algorithms was evaluated over five sets of pulmonary 4D CT images with an average size of 256 × 256 × 100 and more than 1100 expert-determined landmark point pairs each. For all the testing scenarios presented in this paper, the GPU-based DIR computation required around 7 to 11 s to yield an average 3D error ranging from 1.5 to 1.8 mm. It is interesting to find out that the original passive force demons algorithms outperform subsequently proposed variants based on the combination of accuracy, efficiency and ease of implementation.
A fast - Monte Carlo toolkit on GPU for treatment plan dose recalculation in proton therapy
Senzacqua, M.; Schiavi, A.; Patera, V.; Pioli, S.; Battistoni, G.; Ciocca, M.; Mairani, A.; Magro, G.; Molinelli, S.
2017-10-01
In the context of the particle therapy a crucial role is played by Treatment Planning Systems (TPSs), tools aimed to compute and optimize the tratment plan. Nowadays one of the major issues related to the TPS in particle therapy is the large CPU time needed. We developed a software toolkit (FRED) for reducing dose recalculation time by exploiting Graphics Processing Units (GPU) hardware. Thanks to their high parallelization capability, GPUs significantly reduce the computation time, up to factor 100 respect to a standard CPU running software. The transport of proton beams in the patient is accurately described through Monte Carlo methods. Physical processes reproduced are: Multiple Coulomb Scattering, energy straggling and nuclear interactions of protons with the main nuclei composing the biological tissues. FRED toolkit does not rely on the water equivalent translation of tissues, but exploits the Computed Tomography anatomical information by reconstructing and simulating the atomic composition of each crossed tissue. FRED can be used as an efficient tool for dose recalculation, on the day of the treatment. In fact it can provide in about one minute on standard hardware the dose map obtained combining the treatment plan, earlier computed by the TPS, and the current patient anatomic arrangement.
Near real-time digital holographic microscope based on GPU parallel computing
Zhu, Gang; Zhao, Zhixiong; Wang, Huarui; Yang, Yan
2018-01-01
A transmission near real-time digital holographic microscope with in-line and off-axis light path is presented, in which the parallel computing technology based on compute unified device architecture (CUDA) and digital holographic microscopy are combined. Compared to other holographic microscopes, which have to implement reconstruction in multiple focal planes and are time-consuming the reconstruction speed of the near real-time digital holographic microscope can be greatly improved with the parallel computing technology based on CUDA, so it is especially suitable for measurements of particle field in micrometer and nanometer scale. Simulations and experiments show that the proposed transmission digital holographic microscope can accurately measure and display the velocity of particle field in micrometer scale, and the average velocity error is lower than 10%.With the graphic processing units(GPU), the computing time of the 100 reconstruction planes(512×512 grids) is lower than 120ms, while it is 4.9s using traditional reconstruction method by CPU. The reconstruction speed has been raised by 40 times. In other words, it can handle holograms at 8.3 frames per second and the near real-time measurement and display of particle velocity field are realized. The real-time three-dimensional reconstruction of particle velocity field is expected to achieve by further optimization of software and hardware. Keywords: digital holographic microscope,
Han, Myounghee; Kim, Kyunghun; Jang, Sun-Joo; Cho, Han Saem; Bouma, Brett E.; Oh, Wang-Yuhl; Ryu, Sukyoung
2015-01-01
Frequency domain optical coherence tomography (FD-OCT) has become one of the important clinical tools for intracoronary imaging to diagnose and monitor coronary artery disease, which has been one of the leading causes of death. To help more accurate diagnosis and monitoring of the disease, many researchers have recently worked on visualization of various coronary microscopic features including stent struts by constructing three-dimensional (3D) volumetric rendering from series of cross-sectional intracoronary FD-OCT images. In this paper, we present the first, to our knowledge, "push-of-a-button" graphics processing unit (GPU)-accelerated framework for intracoronary OCT imaging. Our framework visualizes 3D microstructures of the vessel wall with stent struts from raw binary OCT data acquired by the system digitizer as one seamless process. The framework reports the state-of-the-art performance; from raw OCT data, it takes 4.7 seconds to provide 3D visualization of a 5-cm-long coronary artery (of size 1600 samples x 1024 A-lines x 260 frames) with stent struts and detection of malapposition automatically at the single push of a button. PMID:25880375
Mermelstein, Daniel J; Lin, Charles; Nelson, Gard; Kretsch, Rachael; McCammon, J Andrew; Walker, Ross C
2018-03-12
Alchemical free energy (AFE) calculations based on molecular dynamics (MD) simulations are key tools in both improving our understanding of a wide variety of biological processes and accelerating the design and optimization of therapeutics for numerous diseases. Computing power and theory have, however, long been insufficient to enable AFE calculations to be routinely applied in early stage drug discovery. One of the major difficulties in performing AFE calculations is the length of time required for calculations to converge to an ensemble average. CPU implementations of MD-based free energy algorithms can effectively only reach tens of nanoseconds per day for systems on the order of 50,000 atoms, even running on massively parallel supercomputers. Therefore, converged free energy calculations on large numbers of potential lead compounds are often untenable, preventing researchers from gaining crucial insight into molecular recognition, potential druggability and other crucial areas of interest. Graphics Processing Units (GPUs) can help address this. We present here a seamless GPU implementation, within the PMEMD module of the AMBER molecular dynamics package, of thermodynamic integration (TI) capable of reaching speeds of >140 ns/day for a 44,907-atom system, with accuracy equivalent to the existing CPU implementation in AMBER. The implementation described here is currently part of the AMBER 18 beta code and will be an integral part of the upcoming version 18 release of AMBER. © 2018 Wiley Periodicals, Inc. © 2018 Wiley Periodicals, Inc.
Smooth Particle Hydrodynamics GPU-Acceleration Tool for Asteroid Fragmentation Simulation
Buruchenko, Sergey K.; Schäfer, Christoph M.; Maindl, Thomas I.
2017-10-01
The impact threat of near-Earth objects (NEOs) is a concern to the global community, as evidenced by the Chelyabinsk event (caused by a 17-m meteorite) in Russia on February 15, 2013 and a near miss by asteroid 2012 DA14 ( 30 m diameter), on the same day. The expected energy, from either a low-altitude air burst or direct impact, would have severe consequences, especially in populated regions. To mitigate this threat one of the methods is employment of large kinetic-energy impactors (KEIs). The simulation of asteroid target fragmentation is a challenging task which demands efficient and accurate numerical methods with large computational power. Modern graphics processing units (GPUs) lead to a major increase 10 times and more in the performance of the computation of astrophysical and high velocity impacts. The paper presents a new implementation of the numerical method smooth particle hydrodynamics (SPH) using NVIDIA-GPU and the first astrophysical and high velocity application of the new code. The code allows for a tremendous increase in speed of astrophysical simulations with SPH and self-gravity at low costs for new hardware. We have implemented the SPH equations to model gas, liquids and elastic, and plastic solid bodies and added a fragmentation model for brittle materials. Self-gravity may be optionally included in the simulations.
Parallel, distributed and GPU computing technologies in single-particle electron microscopy
International Nuclear Information System (INIS)
Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger
2009-01-01
An introduction to the current paradigm shift towards concurrency in software. Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined
Holovideo: Real-time 3D range video encoding and decoding on GPU
Karpinsky, Nikolaus; Zhang, Song
2012-02-01
We present a 3D video-encoding technique called Holovideo that is capable of encoding high-resolution 3D videos into standard 2D videos, and then decoding the 2D videos back into 3D rapidly without significant loss of quality. Due to the nature of the algorithm, 2D video compression such as JPEG encoding with QuickTime Run Length Encoding (QTRLE) can be applied with little quality loss, resulting in an effective way to store 3D video at very small file sizes. We found that under a compression ratio of 134:1, Holovideo to OBJ file format, the 3D geometry quality drops at a negligible level. Several sets of 3D videos were captured using a structured light scanner, compressed using the Holovideo codec, and then uncompressed and displayed to demonstrate the effectiveness of the codec. With the use of OpenGL Shaders (GLSL), the 3D video codec can encode and decode in realtime. We demonstrated that for a video size of 512×512, the decoding speed is 28 frames per second (FPS) with a laptop computer using an embedded NVIDIA GeForce 9400 m graphics processing unit (GPU). Encoding can be done with this same setup at 18 FPS, making this technology suitable for applications such as interactive 3D video games and 3D video conferencing.
Improvement of radiation dose estimation due to nuclear accidents using deep neural network and GPU
Energy Technology Data Exchange (ETDEWEB)
Desterro, Filipe S.M.; Almeida, Adino A.H.; Pereira, Claudio M.N.A., E-mail: filipesantana18@gmail.com, E-mail: adino@ien.gov.br, E-mail: cmcoelho@ien.gov.br [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil)
2017-07-01
Recently, the use of mobile devices has been proposed for dose assessment during nuclear accidents. The idea is to support field teams, providing an approximated estimation of the dose distribution map in the vicinity of the nuclear power plant (NPP), without needing to be connected to the NPP systems. In order to provide such stand-alone execution, the use of artificial neural networks (ANN) has been proposed in substitution of the complex and time consuming physical models executed by the atmospheric dispersion radionuclide (ADR) system. One limitation observed on such approach is the very time-consuming training of the ANNs. Moreover, if the number of input parameters increases the performance of standard ANNs, like Multilayer-Perceptron (MLP) with backpropagation training, is affected leading to unreasonable training time. To improve learning, allowing better dose estimations, more complex ANN architectures are required. ANNs with many layers (much more than a typical number of layers), referred to as Deep Neural Networks (DNN), for example, have demonstrating to achieve better results. On the other hand, the training of such ANNs is very much slow. In order to allow the use of such DNNs in a reasonable training time, a parallel programming solution, using Graphic Processing Units (GPU) and Computing Unified Device Architecture (CUDA) is proposed. This work focuses on the study of computational technologies for improvement of the ANNs to be used in the mobile application, as well as their training algorithms. (author)
GPU Enhancement of the Trigger to Extend Physics Reach at the LHC
Halyo, V; Jindal, P; LeGresley, P; Lujan, P
2013-01-01
Significant new challenges are continuously confronting the High Energy Physics (HEP) experiments, in particular the two detectors at the Large Hadron Collider (LHC) at CERN, where nominal conditions deliver proton-proton collisions to the detectors at a rate of 40 MHz. This rate must be significantly reduced to comply with both the performance limitations of the mass storage hardware and the capabilities of the computing resources to process the collected data in a timely fashion for physics analysis. At the same time, the physics signals of interest must be retained with high efficiency. The quest for rare new physics phenomena at the LHC leads us to evaluate a Graphics Processing Unit (GPU) enhancement of the existing High-Level Trigger (HLT), made possible by the current flexibility of the trigger system, which not only provides faster and more efficient event selection, but also includes the possibility of new complex triggers that were not previously feasible. A new tracking algorithm is evaluated on a ...
Fast k-NNG construction with GPU-based quick multi-select.
Directory of Open Access Journals (Sweden)
Ivan Komarov
Full Text Available In this paper, we describe a new brute force algorithm for building the k-Nearest Neighbor Graph (k-NNG. The k-NNG algorithm has many applications in areas such as machine learning, bio-informatics, and clustering analysis. While there are very efficient algorithms for data of low dimensions, for high dimensional data the brute force search is the best algorithm. There are two main parts to the algorithm: the first part is finding the distances between the input vectors, which may be formulated as a matrix multiplication problem; the second is the selection of the k-NNs for each of the query vectors. For the second part, we describe a novel graphics processing unit (GPU-based multi-select algorithm based on quick sort. Our optimization makes clever use of warp voting functions available on the latest GPUs along with user-controlled cache. Benchmarks show significant improvement over state-of-the-art implementations of the k-NN search on GPUs.
Improvement of radiation dose estimation due to nuclear accidents using deep neural network and GPU
International Nuclear Information System (INIS)
Desterro, Filipe S.M.; Almeida, Adino A.H.; Pereira, Claudio M.N.A.
2017-01-01
Recently, the use of mobile devices has been proposed for dose assessment during nuclear accidents. The idea is to support field teams, providing an approximated estimation of the dose distribution map in the vicinity of the nuclear power plant (NPP), without needing to be connected to the NPP systems. In order to provide such stand-alone execution, the use of artificial neural networks (ANN) has been proposed in substitution of the complex and time consuming physical models executed by the atmospheric dispersion radionuclide (ADR) system. One limitation observed on such approach is the very time-consuming training of the ANNs. Moreover, if the number of input parameters increases the performance of standard ANNs, like Multilayer-Perceptron (MLP) with backpropagation training, is affected leading to unreasonable training time. To improve learning, allowing better dose estimations, more complex ANN architectures are required. ANNs with many layers (much more than a typical number of layers), referred to as Deep Neural Networks (DNN), for example, have demonstrating to achieve better results. On the other hand, the training of such ANNs is very much slow. In order to allow the use of such DNNs in a reasonable training time, a parallel programming solution, using Graphic Processing Units (GPU) and Computing Unified Device Architecture (CUDA) is proposed. This work focuses on the study of computational technologies for improvement of the ANNs to be used in the mobile application, as well as their training algorithms. (author)
Energy Technology Data Exchange (ETDEWEB)
Song, Chenchen; Martínez, Todd J. [Department of Chemistry and the PULSE Institute, Stanford University, Stanford, California 94305 (United States); SLAC National Accelerator Laboratory, Menlo Park, California 94025 (United States)
2016-05-07
We present a tensor hypercontracted (THC) scaled opposite spin second order Møller-Plesset perturbation theory (SOS-MP2) method. By using THC, we reduce the formal scaling of SOS-MP2 with respect to molecular size from quartic to cubic. We achieve further efficiency by exploiting sparsity in the atomic orbitals and using graphical processing units (GPUs) to accelerate integral construction and matrix multiplication. The practical scaling of GPU-accelerated atomic orbital-based THC-SOS-MP2 calculations is found to be N{sup 2.6} for reference data sets of water clusters and alanine polypeptides containing up to 1600 basis functions. The errors in correlation energy with respect to density-fitting-SOS-MP2 are less than 0.5 kcal/mol for all systems tested (up to 162 atoms).
Directory of Open Access Journals (Sweden)
Ali Dashti
Full Text Available This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU. The method is applicable to homogeneous computing clusters with a varying number of nodes and GPUs per node. We achieve a 6-fold speedup in data processing as compared with an optimized method running on a cluster of CPUs and bring a hitherto impossible [Formula: see text]-NNG generation for a dataset of twenty million images with 15 k dimensionality into the realm of practical possibility.
Exploiting graphics processing units for computational biology and bioinformatics.
Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H
2010-09-01
Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.
SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems
International Nuclear Information System (INIS)
Xiao, K; Chen, D. Z; Hu, X. S; Zhou, B
2014-01-01
Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this procedure into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF
Energy Technology Data Exchange (ETDEWEB)
Dong, Tingzing Tim [University of Tennessee (UT); Tomov, Stanimire Z [ORNL; Luszczek, Piotr R [ORNL; Dongarra, Jack J [ORNL
2015-01-01
As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.
Fast GPU-based computation of spatial multigrid multiframe LMEM for PET.
Nassiri, Moulay Ali; Carrier, Jean-François; Després, Philippe
2015-09-01
Significant efforts were invested during the last decade to accelerate PET list-mode reconstructions, notably with GPU devices. However, the computation time per event is still relatively long, and the list-mode efficiency on the GPU is well below the histogram-mode efficiency. Since list-mode data are not arranged in any regular pattern, costly accesses to the GPU global memory can hardly be optimized and geometrical symmetries cannot be used. To overcome obstacles that limit the acceleration of reconstruction from list-mode on the GPU, a multigrid and multiframe approach of an expectation-maximization algorithm was developed. The reconstruction process is started during data acquisition, and calculations are executed concurrently on the GPU and the CPU, while the system matrix is computed on-the-fly. A new convergence criterion also was introduced, which is computationally more efficient on the GPU. The implementation was tested on a Tesla C2050 GPU device for a Gemini GXL PET system geometry. The results show that the proposed algorithm (multigrid and multiframe list-mode expectation-maximization, MGMF-LMEM) converges to the same solution as the LMEM algorithm more than three times faster. The execution time of the MGMF-LMEM algorithm was 1.1 s per million of events on the Tesla C2050 hardware used, for a reconstructed space of 188 x 188 x 57 voxels of 2 x 2 x 3.15 mm3. For 17- and 22-mm simulated hot lesions, the MGMF-LMEM algorithm led on the first iteration to contrast recovery coefficients (CRC) of more than 75 % of the maximum CRC while achieving a minimum in the relative mean square error. Therefore, the MGMF-LMEM algorithm can be used as a one-pass method to perform real-time reconstructions for low-count acquisitions, as in list-mode gated studies. The computation time for one iteration and 60 millions of events was approximately 66 s.
GPU-accelerated track reconstruction in the ALICE High Level Trigger
Rohr, David; Gorbunov, Sergey; Lindenstruth, Volker; ALICE Collaboration
2017-10-01
ALICE (A Large Heavy Ion Experiment) is one of the four major experiments at the Large Hadron Collider (LHC) at CERN. The High Level Trigger (HLT) is an online compute farm which reconstructs events measured by the ALICE detector in real-time. The most compute-intensive part is the reconstruction of particle trajectories called tracking and the most important detector for tracking is the Time Projection Chamber (TPC). The HLT uses a GPU-accelerated algorithm for TPC tracking that is based on the Cellular Automaton principle and on the Kalman filter. The GPU tracking has been running in 24/7 operation since 2012 in LHC Run 1 and 2. In order to better leverage the potential of the GPUs, and speed up the overall HLT reconstruction, we plan to bring more reconstruction steps (e.g. the tracking for other detectors) onto the GPUs. There are several tasks running so far on the CPU that could benefit from cooperation with the tracking, which is hardly feasible at the moment due to the delay of the PCI Express transfers. Moving more steps onto the GPU, and processing them on the GPU at once, will reduce PCI Express transfers and free up CPU resources. On top of that, modern GPUs and GPU programming APIs provide new features which are not yet exploited by the TPC tracking. We present our new developments for GPU reconstruction, both with a focus on the online reconstruction on GPU for the online offline computing upgrade in ALICE during LHC Run 3, and also taking into account how the current HLT in Run 2 can profit from these improvements.
Energy Technology Data Exchange (ETDEWEB)
Kim, Kyungsang; Ye, Jong Chul, E-mail: jong.ye@kaist.ac.kr [Bio Imaging and Signal Processing Laboratory, Department of Bio and Brain Engineering, KAIST 291, Daehak-ro, Yuseong-gu, Daejeon 34141 (Korea, Republic of); Lee, Taewon; Cho, Seungryong [Medical Imaging and Radiotherapeutics Laboratory, Department of Nuclear and Quantum Engineering, KAIST 291, Daehak-ro, Yuseong-gu, Daejeon 34141 (Korea, Republic of); Seong, Younghun; Lee, Jongha; Jang, Kwang Eun [Samsung Advanced Institute of Technology, Samsung Electronics, 130, Samsung-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do, 443-803 (Korea, Republic of); Choi, Jaegu; Choi, Young Wook [Korea Electrotechnology Research Institute (KERI), 111, Hanggaul-ro, Sangnok-gu, Ansan-si, Gyeonggi-do, 426-170 (Korea, Republic of); Kim, Hak Hee; Shin, Hee Jung; Cha, Joo Hee [Department of Radiology and Research Institute of Radiology, Asan Medical Center, University of Ulsan College of Medicine, 88 Olympic-ro, 43-gil, Songpa-gu, Seoul, 138-736 (Korea, Republic of)
2015-09-15
Purpose: In digital breast tomosynthesis (DBT), scatter correction is highly desirable, as it improves image quality at low doses. Because the DBT detector panel is typically stationary during the source rotation, antiscatter grids are not generally compatible with DBT; thus, a software-based scatter correction is required. This work proposes a fully iterative scatter correction method that uses a novel fast Monte Carlo simulation (MCS) with a tissue-composition ratio estimation technique for DBT imaging. Methods: To apply MCS to scatter estimation, the material composition in each voxel should be known. To overcome the lack of prior accurate knowledge of tissue composition for DBT, a tissue-composition ratio is estimated based on the observation that the breast tissues are principally composed of adipose and glandular tissues. Using this approximation, the composition ratio can be estimated from the reconstructed attenuation coefficients, and the scatter distribution can then be estimated by MCS using the composition ratio. The scatter estimation and image reconstruction procedures can be performed iteratively until an acceptable accuracy is achieved. For practical use, (i) the authors have implemented a fast MCS using a graphics processing unit (GPU), (ii) the MCS is simplified to transport only x-rays in the energy range of 10–50 keV, modeling Rayleigh and Compton scattering and the photoelectric effect using the tissue-composition ratio of adipose and glandular tissues, and (iii) downsampling is used because the scatter distribution varies rather smoothly. Results: The authors have demonstrated that the proposed method can accurately estimate the scatter distribution, and that the contrast-to-noise ratio of the final reconstructed image is significantly improved. The authors validated the performance of the MCS by changing the tissue thickness, composition ratio, and x-ray energy. The authors confirmed that the tissue-composition ratio estimation was quite
Xie, Chunliang; Gong, Wenbing; Yang, Qi; Zhu, Zuohua; Yan, Li; Hu, Zhenxiu; Peng, Yuande
2017-11-01
White-rot fungi combined with alkaline/oxidative (A/O) pretreatments of industrial hemp woody core were proposed to improve enzymatic saccharification. In this study, hemp woody core were treated with only white rot fungi, only A/O and combined with the two methods. The results showed that Pleurotus eryngii (P. eryngii) was the most effective fungus for pretreatment. Reducing sugars yield was 329mg/g with 30 Filter Paper Unit (FPU)/g cellulase loading when treated 21day. In the A/O groups, the results showed that when treated with 3% NaOH and 3% H 2 O 2 , the yield of reducing sugars was 288mg/g with 30FPU/g cellulase loading. After combination pretreatment with P. eryngii and A/O pretreatment, the reducing sugar yield from enzymatic hydrolysis of combined sample increased 1.10-1.29-fold than that of bio-treated or A/O pretreatment sample at the same conditions, suggesting that P. eryngii combined with A/O pretreatment was an effective method to improve enzyme hydrolysis. Copyright © 2017 Elsevier Ltd. All rights reserved.
Processes for pretreating lignocellulosic biomass: A review
Energy Technology Data Exchange (ETDEWEB)
McMillan, J.D.
1992-11-01
This paper reviews existing and proposed pretreatment processes for biomass. The focus is on the mechanisms by which the various pretreatments act and the influence of biomass structure and composition on the efficacy of particular pretreatment techniques. This analysis is used to identify pretreatment technologies and issues that warrant further research.
Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model
Energy Technology Data Exchange (ETDEWEB)
Shi, Xuanhua; Luo, Xuan; Liang, Junling; Zhao, Peng; Di, Sheng; He, Bingsheng; Jin, Hai
2018-01-01
GPUs have been increasingly used to accelerate graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing requires locking or atomic operations, leading to significant penalties/overheads when implemented on GPUs. As such, coloring algorithm is adopted to separate the vertices with potential updating conflicts, guaranteeing the consistency/correctness of the parallel processing. Common coloring algorithms, however, may suffer from low parallelism because of a large number of colors generally required for processing a large-scale graph with billions of vertices. We propose a light-weight asynchronous processing framework called Frog with a preprocessing/hybrid coloring model. The fundamental idea is based on Pareto principle (or 80-20 rule) about coloring algorithms as we observed through masses of realworld graph coloring cases. We find that a majority of vertices (about 80%) are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency. Accordingly, our solution separates the processing of the vertices based on the distribution of colors. In this work, we mainly answer three questions: (1) how to partition the vertices in a sparse graph with maximized parallelism, (2) how to process large-scale graphs that cannot fit into GPU memory, and (3) how to reduce the overhead of data transfers on PCIe while processing each partition. We conduct experiments on real-world data (Amazon, DBLP, YouTube, RoadNet-CA, WikiTalk and Twitter) to evaluate our approach and make comparisons with well-known non-preprocessed (such as Totem, Medusa, MapGraph and Gunrock) and preprocessed (Cusha) approaches, by testing four classical algorithms (BFS, PageRank, SSSP and CC). On all the tested applications and
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G
2011-07-01
In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
Kim, Sun Min; Dien, Bruce S; Singh, Vijay
2016-01-01
Production of advanced biofuels from woody and herbaceous feedstocks is moving into commercialization. Biomass needs to be pretreated to overcome the physicochemical properties of biomass that hinder enzyme accessibility, impeding the conversion of the plant cell walls to fermentable sugars. Pretreatment also remains one of the most costly unit operations in the process and among the most critical because it is the source of chemicals that inhibit enzymes and microorganisms and largely determines enzyme loading and sugar yields. Pretreatments are categorized into hydrothermal (aqueous)/chemical, physical, and biological pretreatments, and the mechanistic details of which are briefly outlined in this review. To leverage the synergistic effects of different pretreatment methods, conducting two or more pretreatments consecutively has gained attention. Especially, combining hydrothermal/chemical pretreatment and mechanical refining, a type of physical pretreatment, has the potential to be applied to an industrial plant. Here, the effects of the combined pretreatment (combined hydrothermal/chemical pretreatment and mechanical refining) on energy consumption, physical structure, sugar yields, and enzyme dosage are summarized.
Computing OpenSURF on OpenCL and General Purpose GPU
Directory of Open Access Journals (Sweden)
Wanglong Yan
2013-10-01
Full Text Available Speeded-Up Robust Feature (SURF algorithm is widely used for image feature detecting and matching in computer vision area. Open Computing Language (OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. This paper introduces how to implement an open-sourced SURF program, namely OpenSURF, on general purpose GPU by OpenCL, and discusses the optimizations in terms of the thread architectures and memory models in detail. Our final OpenCL implementation of OpenSURF is on average 37% and 64% faster than the OpenCV SURF v2.4.5 CUDA implementation on NVidia's GTX660 and GTX460SE GPUs, repectively. Our OpenCL program achieved real-time performance (>25 Frames Per Second for almost all the input images with different sizes from 320*240 to 1024*768 on NVidia's GTX660 GPU, NVidia's GTX460SE GPU and AMD's Radeon HD 6850 GPU. Our OpenCL approach on NVidia's GTX660 GPU is more than 22.8 times faster than its original CPU version on Intel's Dual-Core E5400 2.7G on average.
GPU-based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites.
Leinweber, Matthias; Fober, Thomas; Freisleben, Bernd
2016-11-07
In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92% (CPU) to 96% (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.
A GPU-Accelerated Approach for Feature Tracking in Time-Varying Imagery Datasets.
Peng, Chao; Sahani, Sandip; Rushing, John
2017-10-01
We propose a novel parallel connected component labeling (CCL) algorithm along with efficient out-of-core data management to detect and track feature regions of large time-varying imagery datasets. Our approach contributes to the big data field with parallel algorithms tailored for GPU architectures. We remove the data dependency between frames and achieve pixel-level parallelism. Due to the large size, the entire dataset cannot fit into cached memory. Frames have to be streamed through the memory hierarchy (disk to CPU main memory and then to GPU memory), partitioned, and processed as batches, where each batch is small enough to fit into the GPU. To reconnect the feature regions that are separated due to data partitioning, we present a novel batch merging algorithm to extract the region connection information across multiple batches in a parallel fashion. The information is organized in a memory-efficient structure and supports fast indexing on the GPU. Our experiment uses a commodity workstation equipped with a single GPU. The results show that our approach can efficiently process a weather dataset composed of terabytes of time-varying radar images. The advantages of our approach are demonstrated by comparing to the performance of an efficient CPU cluster implementation which is being used by the weather scientists.
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran
Directory of Open Access Journals (Sweden)
Hamed Kargaran
2016-04-01
Full Text Available The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL_MODE and SHARED_MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core for GLOBAL_MODE and SHARED_MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran
Energy Technology Data Exchange (ETDEWEB)
Kargaran, Hamed, E-mail: h-kargaran@sbu.ac.ir; Minuchehr, Abdolhamid; Zolfaghari, Ahmad [Department of nuclear engineering, Shahid Behesti University, Tehran, 1983969411 (Iran, Islamic Republic of)
2016-04-15
The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL-MODE and SHARED-MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL-MODE and SHARED-MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.
Energy Technology Data Exchange (ETDEWEB)
Dong, Han; Sharma, Diksha; Badano, Aldo, E-mail: aldo.badano@fda.hhs.gov [Division of Imaging, Diagnostics, and Software Reliability, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland 20993 (United States)
2014-12-15
Purpose: Monte Carlo simulations play a vital role in the understanding of the fundamental limitations, design, and optimization of existing and emerging medical imaging systems. Efforts in this area have resulted in the development of a wide variety of open-source software packages. One such package, hybridMANTIS, uses a novel hybrid concept to model indirect scintillator detectors by balancing the computational load using dual CPU and graphics processing unit (GPU) processors, obtaining computational efficiency with reasonable accuracy. In this work, the authors describe two open-source visualization interfaces, webMANTIS and visualMANTIS to facilitate the setup of computational experiments via hybridMANTIS. Methods: The visualization tools visualMANTIS and webMANTIS enable the user to control simulation properties through a user interface. In the case of webMANTIS, control via a web browser allows access through mobile devices such as smartphones or tablets. webMANTIS acts as a server back-end and communicates with an NVIDIA GPU computing cluster that can support multiuser environments where users can execute different experiments in parallel. Results: The output consists of point response and pulse-height spectrum, and optical transport statistics generated by hybridMANTIS. The users can download the output images and statistics through a zip file for future reference. In addition, webMANTIS provides a visualization window that displays a few selected optical photon path as they get transported through the detector columns and allows the user to trace the history of the optical photons. Conclusions: The visualization tools visualMANTIS and webMANTIS provide features such as on the fly generation of pulse-height spectra and response functions for microcolumnar x-ray imagers while allowing users to save simulation parameters and results from prior experiments. The graphical interfaces simplify the simulation setup and allow the user to go directly from specifying
Dong, Han; Sharma, Diksha; Badano, Aldo
2014-12-01
Monte Carlo simulations play a vital role in the understanding of the fundamental limitations, design, and optimization of existing and emerging medical imaging systems. Efforts in this area have resulted in the development of a wide variety of open-source software packages. One such package, hybridmantis, uses a novel hybrid concept to model indirect scintillator detectors by balancing the computational load using dual CPU and graphics processing unit (GPU) processors, obtaining computational efficiency with reasonable accuracy. In this work, the authors describe two open-source visualization interfaces, webmantis and visualmantis to facilitate the setup of computational experiments via hybridmantis. The visualization tools visualmantis and webmantis enable the user to control simulation properties through a user interface. In the case of webmantis, control via a web browser allows access through mobile devices such as smartphones or tablets. webmantis acts as a server back-end and communicates with an NVIDIA GPU computing cluster that can support multiuser environments where users can execute different experiments in parallel. The output consists of point response and pulse-height spectrum, and optical transport statistics generated by hybridmantis. The users can download the output images and statistics through a zip file for future reference. In addition, webmantis provides a visualization window that displays a few selected optical photon path as they get transported through the detector columns and allows the user to trace the history of the optical photons. The visualization tools visualmantis and webmantis provide features such as on the fly generation of pulse-height spectra and response functions for microcolumnar x-ray imagers while allowing users to save simulation parameters and results from prior experiments. The graphical interfaces simplify the simulation setup and allow the user to go directly from specifying input parameters to receiving visual
International Nuclear Information System (INIS)
Dong, Han; Sharma, Diksha; Badano, Aldo
2014-01-01
Purpose: Monte Carlo simulations play a vital role in the understanding of the fundamental limitations, design, and optimization of existing and emerging medical imaging systems. Efforts in this area have resulted in the development of a wide variety of open-source software packages. One such package, hybridMANTIS, uses a novel hybrid concept to model indirect scintillator detectors by balancing the computational load using dual CPU and graphics processing unit (GPU) processors, obtaining computational efficiency with reasonable accuracy. In this work, the authors describe two open-source visualization interfaces, webMANTIS and visualMANTIS to facilitate the setup of computational experiments via hybridMANTIS. Methods: The visualization tools visualMANTIS and webMANTIS enable the user to control simulation properties through a user interface. In the case of webMANTIS, control via a web browser allows access through mobile devices such as smartphones or tablets. webMANTIS acts as a server back-end and communicates with an NVIDIA GPU computing cluster that can support multiuser environments where users can execute different experiments in parallel. Results: The output consists of point response and pulse-height spectrum, and optical transport statistics generated by hybridMANTIS. The users can download the output images and statistics through a zip file for future reference. In addition, webMANTIS provides a visualization window that displays a few selected optical photon path as they get transported through the detector columns and allows the user to trace the history of the optical photons. Conclusions: The visualization tools visualMANTIS and webMANTIS provide features such as on the fly generation of pulse-height spectra and response functions for microcolumnar x-ray imagers while allowing users to save simulation parameters and results from prior experiments. The graphical interfaces simplify the simulation setup and allow the user to go directly from specifying
Castaño-Díez, Daniel
2017-06-01
Dynamo is a package for the processing of tomographic data. As a tool for subtomogram averaging, it includes different alignment and classification strategies. Furthermore, its data-management module allows experiments to be organized in groups of tomograms, while offering specialized three-dimensional tomographic browsers that facilitate visualization, location of regions of interest, modelling and particle extraction in complex geometries. Here, a technical description of the package is presented, focusing on its diverse strategies for optimizing computing performance. Dynamo is built upon mbtools (middle layer toolbox), a general-purpose MATLAB library for object-oriented scientific programming specifically developed to underpin Dynamo but usable as an independent tool. Its structure intertwines a flexible MATLAB codebase with precompiled C++ functions that carry the burden of numerically intensive operations. The package can be delivered as a precompiled standalone ready for execution without a MATLAB license. Multicore parallelization on a single node is directly inherited from the high-level parallelization engine provided for MATLAB, automatically imparting a balanced workload among the threads in computationally intense tasks such as alignment and classification, but also in logistic-oriented tasks such as tomogram binning and particle extraction. Dynamo supports the use of graphical processing units (GPUs), yielding considerable speedup factors both for native Dynamo procedures (such as the numerically intensive subtomogram alignment) and procedures defined by the user through its MATLAB-based GPU library for three-dimensional operations. Cloud-based virtual computing environments supplied with a pre-installed version of Dynamo can be publicly accessed through the Amazon Elastic Compute Cloud (EC2), enabling users to rent GPU computing time on a pay-as-you-go basis, thus avoiding upfront investments in hardware and longterm software maintenance.
Optimizing memory-bound SYMV kernel on GPU hardware accelerators
Abdelfattah, Ahmad
2013-01-01
Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively. © 2013 Springer-Verlag.
A GPU Reaction Diffusion Soil-Microbial Model
Falconer, Ruth; Houston, Alasdair; Schmidt, Sonja; Otten, Wilfred
2014-05-01
Parallelised algorithms are frequent in bioinformatics as a consequence of the close link to informatics - however in the field of soil science and ecology they are less prevalent. A current challenge in soil ecology is to link habitat structure to microbial dynamics. Soil science is therefore entering the 'big data' paradigm as a consequence of integrating data pertinent to the physical soil environment obtained via imaging and theoretical models describing growth and development of microbial dynamics permitting accurate analyses of spatio-temporal properties of different soil microenvironments. The microenvironment is often captured by 3D imaging (CT tomography) which yields large datasets and when used in computational studies the physical sizes of the samples that are amenable to computation are less than 1 cm3. Today's commodity graphics cards are programmable and possess a data parallel architecture that in many cases is capable of out-performing the CPU in terms of computational rates. The programmable aspect is achieved via a low-level parallel programming language (CUDA, OpenCL and DirectX). We ported a Soil-Microbial Model onto the GPU using the DirectX Compute API. We noted a significant computational speed up as well as an increase in the physical size that can be simulated. Some of the drawbacks of such an approach were concerned with numerical precision and the steep learning curve associated with GPGPU technologies.
Linear solver performance in elastoplastic problem solution on GPU cluster
Khalevitsky, Yu. V.; Konovalov, A. V.; Burmasheva, N. V.; Partin, A. S.
2017-12-01
Applying the finite element method to severe plastic deformation problems involves solving linear equation systems. While the solution procedure is relatively hard to parallelize and computationally intensive by itself, a long series of large scale systems need to be solved for each problem. When dealing with fine computational meshes, such as in the simulations of three-dimensional metal matrix composite microvolume deformation, tens and hundreds of hours may be needed to complete the whole solution procedure, even using modern supercomputers. In general, one of the preconditioned Krylov subspace methods is used in a linear solver for such problems. The method convergence highly depends on the operator spectrum of a problem stiffness matrix. In order to choose the appropriate method, a series of computational experiments is used. Different methods may be preferable for different computational systems for the same problem. In this paper we present experimental data obtained by solving linear equation systems from an elastoplastic problem on a GPU cluster. The data can be used to substantiate the choice of the appropriate method for a linear solver to use in severe plastic deformation simulations.
International Nuclear Information System (INIS)
Setiani, Tia Dwi; Suprijadi; Haryanto, Freddy
2016-01-01
Monte Carlo (MC) is one of the powerful techniques for simulation in x-ray imaging. MC method can simulate the radiation transport within matter with high accuracy and provides a natural way to simulate radiation transport in complex systems. One of the codes based on MC algorithm that are widely used for radiographic images simulation is MC-GPU, a codes developed by Andrea Basal. This study was aimed to investigate the time computation of x-ray imaging simulation in GPU (Graphics Processing Unit) compared to a standard CPU (Central Processing Unit). Furthermore, the effect of physical parameters to the quality of radiographic images and the comparison of image quality resulted from simulation in the GPU and CPU are evaluated in this paper. The simulations were run in CPU which was simulated in serial condition, and in two GPU with 384 cores and 2304 cores. In simulation using GPU, each cores calculates one photon, so, a large number of photon were calculated simultaneously. Results show that the time simulations on GPU were significantly accelerated compared to CPU. The simulations on the 2304 core of GPU were performed about 64 -114 times faster than on CPU, while the simulation on the 384 core of GPU were performed about 20 – 31 times faster than in a single core of CPU. Another result shows that optimum quality of images from the simulation was gained at the history start from 10 8 and the energy from 60 Kev to 90 Kev. Analyzed by statistical approach, the quality of GPU and CPU images are relatively the same.
Numerical Integration with Graphical Processing Unit for QKD Simulation
2014-03-27
33 NUMERICAL INTEGRATION WITH GRAPHICAL PROCESSING UNIT FOR QKD SIMULATION Virginia R. Garrett, B.S.E.E. Captain, USAF Approved: //signed// Douglas ...17] B. Nelson, R. Kirby , and R. Haimes, “Gpu-based volume visualization from high- order finite element fields,” IEEE Transactions on Visualization and...Intel i7-3610QM CPU. 15. SUBJECT TERMS Software Engineering, GPU Programming, Numerical Methods, Quantum Key Distribution U U U UU 74 Dr. Douglas Hodson, AFIT/ENG (937) 785-3636 x4719
Significantly reducing registration time in IGRT using graphics processing units
DEFF Research Database (Denmark)
Noe, Karsten Østergaard; Denis de Senneville, Baudouin; Tanderup, Kari
2008-01-01
Purpose/Objective For online IGRT, rapid image processing is needed. Fast parallel computations using graphics processing units (GPUs) have recently been made more accessible through general purpose programming interfaces. We present a GPU implementation of the Horn and Schunck method...... respiration phases in a free breathing volunteer and 41 anatomical landmark points in each image series. The registration method used is a multi-resolution GPU implementation of the 3D Horn and Schunck algorithm. It is based on the CUDA framework from Nvidia. Results On an Intel Core 2 CPU at 2.4GHz each...... significant reduction of the registration time due to the parallelized architecture of the GPU. Considering the slice spacing we find the registration result acceptable. The accuracy is comparable to previous results for the Demons algorithm in the POPI model (Vandemeulenboucke et al, ICCR 2007...
Graphics Processing Unit Enhanced Parallel Document Flocking Clustering
Energy Technology Data Exchange (ETDEWEB)
Cui, Xiaohui [ORNL; Potok, Thomas E [ORNL; ST Charles, Jesse Lee [ORNL
2010-01-01
Analyzing and clustering documents is a complex problem. One explored method of solving this problem borrows from nature, imitating the flocking behavior of birds. One limitation of this method of document clustering is its complexity O(n2). As the number of documents grows, it becomes increasingly difficult to generate results in a reasonable amount of time. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. In this paper, we have conducted research to exploit this archi- tecture and apply its strengths to the flocking based document clustering problem. Using the CUDA platform from NVIDIA, we developed a doc- ument flocking implementation to be run on the NVIDIA GEFORCE GPU. Performance gains ranged from thirty-six to nearly sixty times improvement of the GPU over the CPU implementation.
Tank Focus Area pretreatment activities
International Nuclear Information System (INIS)
McGinnis, C.P.; Welch, T.D.; Manke, K.L.
1997-01-01
Plans call for the high-level wastes to be retrieved from the tanks and immobilized in a stable waste form suitable for long-term isolation. Chemistry and chemical engineering operations are required to retrieve the wastes, to condition the wastes for subsequent steps, and to reduce the costs of the waste management enterprise. Pretreatment includes those processes between retrieval and immobilization, and includes preparation of suitable feed material for immobilization and separations to partition the waste into streams that yield lower life-cycle costs. Some of the technologies being developed by the Tank Focus Area (TFA) to process these wastes are described. These technologies fall roughly into three areas: (1) solid/liquid separation (SLS), (2) sludge pretreatment, and (3) supernate pretreatment
Design Patterns for Sparse-Matrix Computations on Hybrid CPU/GPU Platforms
Directory of Open Access Journals (Sweden)
Valeria Cardellini
2014-01-01
Full Text Available We apply object-oriented software design patterns to develop code for scientific software involving sparse matrices. Design patterns arise when multiple independent developments produce similar designs which converge onto a generic solution. We demonstrate how to use design patterns to implement an interface for sparse matrix computations on NVIDIA GPUs starting from PSBLAS, an existing sparse matrix library, and from existing sets of GPU kernels for sparse matrices. We also compare the throughput of the PSBLAS sparse matrix–vector multiplication on two platforms exploiting the GPU with that obtained by a CPU-only PSBLAS implementation. Our experiments exhibit encouraging results regarding the comparison between CPU and GPU executions in double precision, obtaining a speedup of up to 35.35 on NVIDIA GTX 285 with respect to AMD Athlon 7750, and up to 10.15 on NVIDIA Tesla C2050 with respect to Intel Xeon X5650.
High performance technique for database applicationsusing a hybrid GPU/CPU platform
Zidan, Mohammed A.
2012-07-28
Many database applications, such as sequence comparing, sequence searching, and sequence matching, etc, process large database sequences. we introduce a novel and efficient technique to improve the performance of database applica- tions by using a Hybrid GPU/CPU platform. In particular, our technique solves the problem of the low efficiency result- ing from running short-length sequences in a database on a GPU. To verify our technique, we applied it to the widely used Smith-Waterman algorithm. The experimental results show that our Hybrid GPU/CPU technique improves the average performance by a factor of 2.2, and improves the peak performance by a factor of 2.8 when compared to earlier implementations. Copyright © 2011 by ASME.
A GPU accelerated and error-controlled solver for the unbounded Poisson equation in three dimensions
Exl, Lukas
2017-12-01
An efficient solver for the three dimensional free-space Poisson equation is presented. The underlying numerical method is based on finite Fourier series approximation. While the error of all involved approximations can be fully controlled, the overall computation error is driven by the convergence of the finite Fourier series of the density. For smooth and fast-decaying densities the proposed method will be spectrally accurate. The method scales with O(N log N) operations, where N is the total number of discretization points in the Cartesian grid. The majority of the computational costs come from fast Fourier transforms (FFT), which makes it ideal for GPU computation. Several numerical computations on CPU and GPU validate the method and show efficiency and convergence behavior. Tests are performed using the Vienna Scientific Cluster 3 (VSC3). A free MATLAB implementation for CPU and GPU is provided to the interested community.
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems
Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.
2014-01-01
The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
3D Data Denoising via Nonlocal Means Filter by Using Parallel GPU Strategies
Cuomo, Salvatore; De Michele, Pasquale; Piccialli, Francesco
2014-01-01
Nonlocal Means (NLM) algorithm is widely considered as a state-of-the-art denoising filter in many research fields. Its high computational complexity leads researchers to the development of parallel programming approaches and the use of massively parallel architectures such as the GPUs. In the recent years, the GPU devices had led to achieving reasonable running times by filtering, slice-by-slice, and 3D datasets with a 2D NLM algorithm. In our approach we design and implement a fully 3D NonLocal Means parallel approach, adopting different algorithm mapping strategies on GPU architecture and multi-GPU framework, in order to demonstrate its high applicability and scalability. The experimental results we obtained encourage the usability of our approach in a large spectrum of applicative scenarios such as magnetic resonance imaging (MRI) or video sequence denoising. PMID:25045397
Hazard Analysis for the Pretreatment Engineering Platform (PEP)
Energy Technology Data Exchange (ETDEWEB)
Sullivan, Robin S.; Geeting, John GH; Lawrence, Wesley E.; Young, Jonathan
2008-07-10
The Pretreatment Engineering Platform (PEP) is designed to perform a demonstration on an engineering scale to confirm the Hanford Waste Treatment Plant Pretreatment Facility (PTF) leaching and filtration process equipment design and sludge treatment process. The system will use scaled prototypic equipment to demonstrate sludge water wash, caustic leaching, oxidative leaching, and filtration. Unit operations to be tested include pumping, solids washing, chemical reagent addition and blending, heating, cooling, leaching, filtration, and filter cleaning. In addition, the PEP will evaluate potential design changes to the ultrafiltration process system equipment to potentially enhance leaching and filtration performance as well as overall pretreatment throughput. The skid-mounted system will be installed and operated in the Processing Development Laboratory-West at Pacific Northwest National Laboratory (PNNL) in Richland, Washington.
Finite Difference Time-Domain Modelling of Metamaterials: GPU Implementation of Cylindrical Cloak
Directory of Open Access Journals (Sweden)
A. Dawood
2013-08-01
Full Text Available Finite difference time-domain (FDTD technique can be used to model metamaterials by treating them as dispersive material. Drude or Lorentz model can be incorporated into the standard FDTD algorithm for modelling negative permittivity and permeability. FDTD algorithm is readily parallelisable and can take advantage of GPU acceleration to achieve speed-ups of 5x-50x depending on hardware setup. Metamaterial scattering problems are implemented using dispersive FDTD technique on GPU resulting in performance gain of 10x-15x compared to conventional CPU implementation.
Brain MR Image Restoration Using an Automatic Trilateral Filter With GPU-Based Acceleration.
Chang, Herng-Hua; Li, Cheng-Yuan; Gallogly, Audrey Haihong
2018-02-01
Noise reduction in brain magnetic resonance (MR) images has been a challenging and demanding task. This study develops a new trilateral filter that aims to achieve robust and efficient image restoration. Extended from the bilateral filter, the proposed algorithm contains one additional intensity similarity funct-ion, which compensates for the unique characteristics of noise in brain MR images. An entropy function adaptive to intensity variations is introduced to regulate the contributions of the weighting components. To hasten the computation, parallel computing based on the graphics processing unit (GPU) strategy is explored with emphasis on memory allocations and thread distributions. To automate the filtration, image texture feature analysis associated with machine learning is investigated. Among the 98 candidate features, the sequential forward floating selection scheme is employed to acquire the optimal texture features for regularization. Subsequently, a two-stage classifier that consists of support vector machines and artificial neural networks is established to predict the filter parameters for automation. A speedup gain of 757 was reached to process an entire MR image volume of 256 × 256 × 256 pixels, which completed within 0.5 s. Automatic restoration results revealed high accuracy with an ensemble average relative error of 0.53 ± 0.85% in terms of the peak signal-to-noise ratio. This self-regulating trilateral filter outperformed many state-of-the-art noise reduction methods both qualitatively and quantitatively. We believe that this new image restoration algorithm is of potential in many brain MR image processing applications that require expedition and automation.
Mobile Devices and GPU Parallelism in Ionospheric Data Processing
Mascharka, D.; Pankratius, V.
2015-12-01
Scientific data acquisition in the field is often constrained by data transfer backchannels to analysis environments. Geoscientists are therefore facing practical bottlenecks with increasing sensor density and variety. Mobile devices, such as smartphones and tablets, offer promising solutions to key problems in scientific data acquisition, pre-processing, and validation by providing advanced capabilities in the field. This is due to affordable network connectivity options and the increasing mobile computational power. This contribution exemplifies a scenario faced by scientists in the field and presents the "Mahali TEC Processing App" developed in the context of the NSF-funded Mahali project. Aimed at atmospheric science and the study of ionospheric Total Electron Content (TEC), this app is able to gather data from various dual-frequency GPS receivers. It demonstrates parsing of full-day RINEX files on mobile devices and on-the-fly computation of vertical TEC values based on satellite ephemeris models that are obtained from NASA. Our experiments show how parallel computing on the mobile device GPU enables fast processing and visualization of up to 2 million datapoints in real-time using OpenGL. GPS receiver bias is estimated through minimum TEC approximations that can be interactively adjusted by scientists in the graphical user interface. Scientists can also perform approximate computations for "quickviews" to reduce CPU processing time and memory consumption. In the final stage of our mobile processing pipeline, scientists can upload data to the cloud for further processing. Acknowledgements: The Mahali project (http://mahali.mit.edu) is funded by the NSF INSPIRE grant no. AGS-1343967 (PI: V. Pankratius). We would like to acknowledge our collaborators at Boston College, Virginia Tech, Johns Hopkins University, Colorado State University, as well as the support of UNAVCO for loans of dual-frequency GPS receivers for use in this project, and Intel for loans of
GPU efficient SAR image despeckling using mixed norms
Özcan, Caner; Şen, Baha; Nar, Fatih
2014-10-01
Speckle noise which is inherent to Synthetic Aperture Radar (SAR) imaging obstructs various image exploitation tasks such as edge detection, segmentation, change detection, and target recognition. Therefore, speckle reduction is generally used as a first step which has to smooth out homogeneous regions while preserving edges and point scatterers. Traditional speckle reduction methods are fast and their memory consumption is insignificant. However, they are either good at smoothing homogeneous regions or preserving edges and point scatterers. State of the art despeckling methods are proposed to overcome this trade-off. However, they introduce another trade-off between denoising quality and resource consumption, thereby higher denoising quality requires higher computational load and/or memory consumption. In this paper, a local pixel-based total variation (TV) approach is proposed, which combines l2-norm and l1-norm in order to improve despeckling quality while keeping execution times reasonably short. Pixel-based approach allows efficient computation model with relatively low memory consumption. Their parallel implementations are also more efficient comparing to global TV approaches which generally require numerical solution of sparse linear systems. However, pixel-based approaches are trapped to local minima frequently hence despeckling quality is worse comparing to global TV approaches. Proposed method, namely mixed norm despeckling (MND), combines l2-norm and l1-norm in order to improve despeckling performance by alleviating local minima problem. All steps of the MND are parallelized using OpenMP on CPU and CUDA on GPU. Speckle reduction performance, execution time and memory consumption of the proposed method are shown using synthetic images and TerraSAR-X spot mode SAR images.
GPU-based relative fuzzy connectedness image segmentation
Zhuge, Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W.
2013-01-01
Purpose: Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an ℓ∞-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA’s Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology. PMID:23298094