Graphics processing unit (GPU) real-time infrared scene generation
Christie, Chad L.; Gouthas, Efthimios (Themie); Williams, Owen M.
2007-04-01
VIRSuite, the GPU-based suite of software tools developed at DSTO for real-time infrared scene generation, is described. The tools include the painting of scene objects with radiometrically-associated colours, translucent object generation, polar plot validation and versatile scene generation. Special features include radiometric scaling within the GPU and the presence of zoom anti-aliasing at the core of VIRSuite. Extension of the zoom anti-aliasing construct to cover target embedding and the treatment of translucent objects is described.
Software Graphics Processing Unit (sGPU) for Deep Space Applications
McCabe, Mary; Salazar, George; Steele, Glen
2015-01-01
A graphics processing capability will be required for deep space missions and must include a range of applications, from safety-critical vehicle health status to telemedicine for crew health. However, preliminary radiation testing of commercial graphics processing cards suggest they cannot operate in the deep space radiation environment. Investigation into an Software Graphics Processing Unit (sGPU)comprised of commercial-equivalent radiation hardened/tolerant single board computers, field programmable gate arrays, and safety-critical display software shows promising results. Preliminary performance of approximately 30 frames per second (FPS) has been achieved. Use of multi-core processors may provide a significant increase in performance.
permGPU: Using graphics processing units in RNA microarray association studies
Directory of Open Access Journals (Sweden)
George Stephen L
2010-06-01
Full Text Available Abstract Background Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. Results We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. Conclusions permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.
2017-08-01
used for its GPU computing capability during the experiment. It has Nvidia Tesla K40 GPU accelerators containing 32 GPU nodes consisting of 1024...cores. CUDA is a parallel computing platform and application programming interface (API) model that was created and designed by Nvidia to give direct...Agricultural and Forest Meteorology. 1995:76:277–291, ISSN 0168-1923. 3. GPU vs. CPU? What is GPU computing? Santa Clara (CA): Nvidia Corporation; 2017
Le, Anh H.; Park, Young W.; Ma, Kevin; Jacobs, Colin; Liu, Brent J.
2010-03-01
Multiple Sclerosis (MS) is a progressive neurological disease affecting myelin pathways in the brain. Multiple lesions in the white matter can cause paralysis and severe motor disabilities of the affected patient. To solve the issue of inconsistency and user-dependency in manual lesion measurement of MRI, we have proposed a 3-D automated lesion quantification algorithm to enable objective and efficient lesion volume tracking. The computer-aided detection (CAD) of MS, written in MATLAB, utilizes K-Nearest Neighbors (KNN) method to compute the probability of lesions on a per-voxel basis. Despite the highly optimized algorithm of imaging processing that is used in CAD development, MS CAD integration and evaluation in clinical workflow is technically challenging due to the requirement of high computation rates and memory bandwidth in the recursive nature of the algorithm. In this paper, we present the development and evaluation of using a computing engine in the graphical processing unit (GPU) with MATLAB for segmentation of MS lesions. The paper investigates the utilization of a high-end GPU for parallel computing of KNN in the MATLAB environment to improve algorithm performance. The integration is accomplished using NVIDIA's CUDA developmental toolkit for MATLAB. The results of this study will validate the practicality and effectiveness of the prototype MS CAD in a clinical setting. The GPU method may allow MS CAD to rapidly integrate in an electronic patient record or any disease-centric health care system.
GPU applications for data processing
Energy Technology Data Exchange (ETDEWEB)
Vladymyrov, Mykhailo, E-mail: mykhailo.vladymyrov@cern.ch [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); Aleksandrov, Andrey [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); INFN sezione di Napoli, I-80125 Napoli (Italy); Tioukov, Valeri [INFN sezione di Napoli, I-80125 Napoli (Italy)
2015-12-31
Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.
Seismic Shot Processing on GPU
Johansen, Owe
2009-01-01
Today s petroleum industry demand an ever increasing amount of compu- tational resources. Seismic processing applications in use by these types of companies have generally been using large clusters of compute nodes, whose only computing resource has been the CPU. However, using Graphics Pro- cessing Units (GPU) for general purpose programming is these days becoming increasingly more popular in the high performance computing area. In 2007, NVIDIA corporation launched their framework for develo...
Tsuchimoto, Masashi; Tanimura, Yoshitaka
2015-08-11
A system with many energy states coupled to a harmonic oscillator bath is considered. To study quantum non-Markovian system-bath dynamics numerically rigorously and nonperturbatively, we developed a computer code for the reduced hierarchy equations of motion (HEOM) for a graphics processor unit (GPU) that can treat the system as large as 4096 energy states. The code employs a Padé spectrum decomposition (PSD) for a construction of HEOM and the exponential integrators. Dynamics of a quantum spin glass system are studied by calculating the free induction decay signal for the cases of 3 × 2 to 3 × 4 triangular lattices with antiferromagnetic interactions. We found that spins relax faster at lower temperature due to transitions through a quantum coherent state, as represented by the off-diagonal elements of the reduced density matrix, while it has been known that the spins relax slower due to suppression of thermal activation in a classical case. The decay of the spins are qualitatively similar regardless of the lattice sizes. The pathway of spin relaxation is analyzed under a sudden temperature drop condition. The Compute Unified Device Architecture (CUDA) based source code used in the present calculations is provided as Supporting Information .
Schmidt, J.; Piret, C.; Zhang, N.; Kadlec, B. J.; Liu, Y.; Yuen, D. A.; Wright, G. B.; Sevre, E. O.
2008-12-01
The faster growth curves in the speed of GPUs relative to CPUs in recent years and its rapidly gained popularity has spawned a new area of development in computational technology. There is much potential in utilizing GPUs for solving evolutionary partial differential equations and producing the attendant visualization. We are concerned with modeling tsunami waves, where computational time is of extreme essence, for broadcasting warnings. In order to test the efficacy of the GPU on the set of shallow-water equations, we employed the NVIDIA board 8600M GT on a MacBook Pro. We have compared the relative speeds between the CPU and the GPU on a single processor for two types of spatial discretization based on second-order finite-differences and radial basis functions. RBFs are a more novel method based on a gridless and a multi- scale, adaptive framework. Using the NVIDIA 8600M GT, we received a speed up factor of 8 in favor of GPU for the finite-difference method and a factor of 7 for the RBF scheme. We have also studied the atmospheric dynamics problem of swirling flows over a spherical surface and found a speed-up of 5.3 using the GPU. The time steps employed for the RBF method are larger than those used in finite-differences, because of the much fewer number of nodal points needed by RBF. Thus, in modeling the same physical time, RBF acting in concert with GPU would be the fastest way to go.
Putnam, William M.
2011-01-01
Earth system models like the Goddard Earth Observing System model (GEOS-5) have been pushing the limits of large clusters of multi-core microprocessors, producing breath-taking fidelity in resolving cloud systems at a global scale. GPU computing presents an opportunity for improving the efficiency of these leading edge models. A GPU implementation of GEOS-5 will facilitate the use of cloud-system resolving resolutions in data assimilation and weather prediction, at resolutions near 3.5 km, improving our ability to extract detailed information from high-resolution satellite observations and ultimately produce better weather and climate predictions
Energy Technology Data Exchange (ETDEWEB)
Choi, Sunghoon, E-mail: choi.sh@yonsei.ac.kr [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Lee, Seungwan [Department of Radiological Science, College of Medical Science, Konyang University, 158 Gwanjeodong-ro, Daejeon, 308-812 (Korea, Republic of); Lee, Haenghwa [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Lee, Donghoon; Choi, Seungyeon [Department of Radiation Convergence Engineering, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Shin, Jungwook [LISTEM Corporation, 94 Donghwagongdan-ro, Munmak-eup, Wonju (Korea, Republic of); Seo, Chang-Woo [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Kim, Hee-Joung, E-mail: hjk1@yonsei.ac.kr [Department of Radiological Science, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of); Department of Radiation Convergence Engineering, College of Health Science, Yonsei University, 1 Yonseidae-gil, Wonju, Gangwon-do 220-710 (Korea, Republic of)
2017-03-11
Digital tomosynthesis offers the advantage of low radiation doses compared to conventional computed tomography (CT) by utilizing small numbers of projections (~80) acquired over a limited angular range. It produces 3D volumetric data, although there are artifacts due to incomplete sampling. Based upon these characteristics, we developed a prototype digital tomosynthesis R/F system for applications in chest imaging. Our prototype chest digital tomosynthesis (CDT) R/F system contains an X-ray tube with high power R/F pulse generator, flat-panel detector, R/F table, electromechanical radiographic subsystems including a precise motor controller, and a reconstruction server. For image reconstruction, users select between analytic and iterative reconstruction methods. Our reconstructed images of Catphan700 and LUNGMAN phantoms clearly and rapidly described the internal structures of phantoms using graphics processing unit (GPU) programming. Contrast-to-noise ratio (CNR) values of the CTP682 module of Catphan700 were higher in images using a simultaneous algebraic reconstruction technique (SART) than in those using filtered back-projection (FBP) for all materials by factors of 2.60, 3.78, 5.50, 2.30, 3.70, and 2.52 for air, lung foam, low density polyethylene (LDPE), Delrin{sup ®} (acetal homopolymer resin), bone 50% (hydroxyapatite), and Teflon, respectively. Total elapsed times for producing 3D volume were 2.92 s and 86.29 s on average for FBP and SART (20 iterations), respectively. The times required for reconstruction were clinically feasible. Moreover, the total radiation dose from our system (5.68 mGy) was lower than that of conventional chest CT scan. Consequently, our prototype tomosynthesis R/F system represents an important advance in digital tomosynthesis applications.
Fourtakas, G.; Rogers, B. D.
2016-06-01
A two-phase numerical model using Smoothed Particle Hydrodynamics (SPH) is applied to two-phase liquid-sediments flows. The absence of a mesh in SPH is ideal for interfacial and highly non-linear flows with changing fragmentation of the interface, mixing and resuspension. The rheology of sediment induced under rapid flows undergoes several states which are only partially described by previous research in SPH. This paper attempts to bridge the gap between the geotechnics, non-Newtonian and Newtonian flows by proposing a model that combines the yielding, shear and suspension layer which are needed to predict accurately the global erosion phenomena, from a hydrodynamics prospective. The numerical SPH scheme is based on the explicit treatment of both phases using Newtonian and the non-Newtonian Bingham-type Herschel-Bulkley-Papanastasiou constitutive model. This is supplemented by the Drucker-Prager yield criterion to predict the onset of yielding of the sediment surface and a concentration suspension model. The multi-phase model has been compared with experimental and 2-D reference numerical models for scour following a dry-bed dam break yielding satisfactory results and improvements over well-known SPH multi-phase models. With 3-D simulations requiring a large number of particles, the code is accelerated with a graphics processing unit (GPU) in the open-source DualSPHysics code. The implementation and optimisation of the code achieved a speed up of x58 over an optimised single thread serial code. A 3-D dam break over a non-cohesive erodible bed simulation with over 4 million particles yields close agreement with experimental scour and water surface profiles.
GPU: the biggest key processor for AI and parallel processing
Baji, Toru
2017-07-01
Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
Medical image processing on the GPU - past, present and future.
Eklund, Anders; Dufort, Paul; Forsberg, Daniel; LaConte, Stephen M
2013-12-01
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges. Copyright © 2013 Elsevier B.V. All rights reserved.
Zhmurov, A; Dima, R I; Kholodov, Y; Barsegov, V
2010-11-01
Theoretical exploration of fundamental biological processes involving the forced unraveling of multimeric proteins, the sliding motion in protein fibers and the mechanical deformation of biomolecular assemblies under physiological force loads is challenging even for distributed computing systems. Using a C(α)-based coarse-grained self organized polymer (SOP) model, we implemented the Langevin simulations of proteins on graphics processing units (SOP-GPU program). We assessed the computational performance of an end-to-end application of the program, where all the steps of the algorithm are running on a GPU, by profiling the simulation time and memory usage for a number of test systems. The ∼90-fold computational speedup on a GPU, compared with an optimized central processing unit program, enabled us to follow the dynamics in the centisecond timescale, and to obtain the force-extension profiles using experimental pulling speeds (v(f) = 1-10 μm/s) employed in atomic force microscopy and in optical tweezers-based dynamic force spectroscopy. We found that the mechanical molecular response critically depends on the conditions of force application and that the kinetics and pathways for unfolding change drastically even upon a modest 10-fold increase in v(f). This implies that, to resolve accurately the free energy landscape and to relate the results of single-molecule experiments in vitro and in silico, molecular simulations should be carried out under the experimentally relevant force loads. This can be accomplished in reasonable wall-clock time for biomolecules of size as large as 10(5) residues using the SOP-GPU package. © 2010 Wiley-Liss, Inc.
Graphics Processing Units for HEP trigger systems
International Nuclear Information System (INIS)
Ammendola, R.; Bauce, M.; Biagioni, A.; Chiozzi, S.; Cotta Ramusino, A.; Fantechi, R.; Fiorini, M.; Giagu, S.; Gianoli, A.; Lamanna, G.; Lonardo, A.; Messina, A.
2016-01-01
General-purpose computing on GPUs (Graphics Processing Units) is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughput, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming ripe. We will discuss the use of online parallel computing on GPU for synchronous low level trigger, focusing on CERN NA62 experiment trigger system. The use of GPU in higher level trigger system is also briefly considered.
Graphics Processing Units for HEP trigger systems
Energy Technology Data Exchange (ETDEWEB)
Ammendola, R. [INFN Sezione di Roma “Tor Vergata”, Via della Ricerca Scientifica 1, 00133 Roma (Italy); Bauce, M. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); Biagioni, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); Chiozzi, S.; Cotta Ramusino, A. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Fantechi, R. [INFN Sezione di Pisa, Largo B. Pontecorvo 3, 56127 Pisa (Italy); CERN, Geneve (Switzerland); Fiorini, M. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Giagu, S. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); Gianoli, A. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Lamanna, G., E-mail: gianluca.lamanna@cern.ch [INFN Sezione di Pisa, Largo B. Pontecorvo 3, 56127 Pisa (Italy); INFN Laboratori Nazionali di Frascati, Via Enrico Fermi 40, 00044 Frascati (Roma) (Italy); Lonardo, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); Messina, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); and others
2016-07-11
General-purpose computing on GPUs (Graphics Processing Units) is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughput, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming ripe. We will discuss the use of online parallel computing on GPU for synchronous low level trigger, focusing on CERN NA62 experiment trigger system. The use of GPU in higher level trigger system is also briefly considered.
R-GPU : A reconfigurable GPU architecture
van den Braak, G.J.; Corporaal, H.
2016-01-01
Over the last decade, Graphics Processing Unit (GPU) architectures have evolved from a fixed-function graphics pipeline to a programmable, energy-efficient compute accelerator for massively parallel applications. The compute power arises from the GPU's Single Instruction/Multiple Threads
Micromagnetic simulations using Graphics Processing Units
International Nuclear Information System (INIS)
Lopez-Diaz, L; Aurelio, D; Torres, L; Martinez, E; Hernandez-Lopez, M A; Gomez, J; Alejos, O; Carpentieri, M; Finocchio, G; Consolo, G
2012-01-01
The methodology for adapting a standard micromagnetic code to run on graphics processing units (GPUs) and exploit the potential for parallel calculations of this platform is discussed. GPMagnet, a general purpose finite-difference GPU-based micromagnetic tool, is used as an example. Speed-up factors of two orders of magnitude can be achieved with GPMagnet with respect to a serial code. This allows for running extensive simulations, nearly inaccessible with a standard micromagnetic solver, at reasonable computational times. (topical review)
GPU based numerical simulation of core shooting process
Directory of Open Access Journals (Sweden)
Yi-zhong Zhang
2017-11-01
Full Text Available Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model (TFM and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit (GPU has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture (CUDA platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications.
Accelerating Molecular Dynamic Simulation on Graphics Processing Units
Friedrichs, Mark S.; Eastman, Peter; Vaidyanathan, Vishal; Houston, Mike; Legrand, Scott; Beberg, Adam L.; Ensign, Daniel L.; Bruns, Christopher M.; Pande, Vijay S.
2009-01-01
We describe a complete implementation of all-atom protein molecular dynamics running entirely on a graphics processing unit (GPU), including all standard force field terms, integration, constraints, and implicit solvent. We discuss the design of our algorithms and important optimizations needed to fully take advantage of a GPU. We evaluate its performance, and show that it can be more than 700 times faster than a conventional implementation running on a single CPU core. PMID:19191337
Impact of memory bottleneck on the performance of graphics processing units
Son, Dong Oh; Choi, Hong Jun; Kim, Jong Myon; Kim, Cheol Hong
2015-12-01
Recent graphics processing units (GPUs) can process general-purpose applications as well as graphics applications with the help of various user-friendly application programming interfaces (APIs) supported by GPU vendors. Unfortunately, utilizing the hardware resource in the GPU efficiently is a challenging problem, since the GPU architecture is totally different to the traditional CPU architecture. To solve this problem, many studies have focused on the techniques for improving the system performance using GPUs. In this work, we analyze the GPU performance varying GPU parameters such as the number of cores and clock frequency. According to our simulations, the GPU performance can be improved by 125.8% and 16.2% on average as the number of cores and clock frequency increase, respectively. However, the performance is saturated when memory bottleneck problems incur due to huge data requests to the memory. The performance of GPUs can be improved as the memory bottleneck is reduced by changing GPU parameters dynamically.
High-speed optical coherence tomography signal processing on GPU
International Nuclear Information System (INIS)
Li Xiqi; Shi Guohua; Zhang Yudong
2011-01-01
The signal processing speed of spectral domain optical coherence tomography (SD-OCT) has become a bottleneck in many medical applications. Recently, a time-domain interpolation method was proposed. This method not only gets a better signal-to noise ratio (SNR) but also gets a faster signal processing time for the SD-OCT than the widely used zero-padding interpolation method. Furthermore, the re-sampled data is obtained by convoluting the acquired data and the coefficients in time domain. Thus, a lot of interpolations can be performed concurrently. So, this interpolation method is suitable for parallel computing. An ultra-high optical coherence tomography signal processing can be realized by using graphics processing unit (GPU) with computer unified device architecture (CUDA). This paper will introduce the signal processing steps of SD-OCT on GPU. An experiment is performed to acquire a frame SD-OCT data (400A-linesx2048 pixel per A-line) and real-time processed the data on GPU. The results show that it can be finished in 6.208 milliseconds, which is 37 times faster than that on Central Processing Unit (CPU).
Data Sorting Using Graphics Processing Units
Directory of Open Access Journals (Sweden)
M. J. Mišić
2012-06-01
Full Text Available Graphics processing units (GPUs have been increasingly used for general-purpose computation in recent years. The GPU accelerated applications are found in both scientific and commercial domains. Sorting is considered as one of the very important operations in many applications, so its efficient implementation is essential for the overall application performance. This paper represents an effort to analyze and evaluate the implementations of the representative sorting algorithms on the graphics processing units. Three sorting algorithms (Quicksort, Merge sort, and Radix sort were evaluated on the Compute Unified Device Architecture (CUDA platform that is used to execute applications on NVIDIA graphics processing units. Algorithms were tested and evaluated using an automated test environment with input datasets of different characteristics. Finally, the results of this analysis are briefly discussed.
Graphics processing units accelerated semiclassical initial value representation molecular dynamics
Energy Technology Data Exchange (ETDEWEB)
Tamascelli, Dario; Dambrosio, Francesco Saverio [Dipartimento di Fisica, Università degli Studi di Milano, via Celoria 16, 20133 Milano (Italy); Conte, Riccardo [Department of Chemistry and Cherry L. Emerson Center for Scientific Computation, Emory University, Atlanta, Georgia 30322 (United States); Ceotto, Michele, E-mail: michele.ceotto@unimi.it [Dipartimento di Chimica, Università degli Studi di Milano, via Golgi 19, 20133 Milano (Italy)
2014-05-07
This paper presents a Graphics Processing Units (GPUs) implementation of the Semiclassical Initial Value Representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations. The time-averaging formulation of the SC-IVR for power spectrum calculations is employed. Details about the GPU implementation of the semiclassical code are provided. Four molecules with an increasing number of atoms are considered and the GPU-calculated vibrational frequencies perfectly match the benchmark values. The computational time scaling of two GPUs (NVIDIA Tesla C2075 and Kepler K20), respectively, versus two CPUs (Intel Core i5 and Intel Xeon E5-2687W) and the critical issues related to the GPU implementation are discussed. The resulting reduction in computational time and power consumption is significant and semiclassical GPU calculations are shown to be environment friendly.
Graphics processing unit based computation for NDE applications
Nahas, C. A.; Rajagopal, Prabhu; Balasubramaniam, Krishnan; Krishnamurthy, C. V.
2012-05-01
Advances in parallel processing in recent years are helping to improve the cost of numerical simulation. Breakthroughs in Graphical Processing Unit (GPU) based computation now offer the prospect of further drastic improvements. The introduction of 'compute unified device architecture' (CUDA) by NVIDIA (the global technology company based in Santa Clara, California, USA) has made programming GPUs for general purpose computing accessible to the average programmer. Here we use CUDA to develop parallel finite difference schemes as applicable to two problems of interest to NDE community, namely heat diffusion and elastic wave propagation. The implementations are for two-dimensions. Performance improvement of the GPU implementation against serial CPU implementation is then discussed.
Exploiting graphics processing units for computational biology and bioinformatics.
Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H
2010-09-01
Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.
High Performance Processing and Analysis of Geospatial Data Using CUDA on GPU
Directory of Open Access Journals (Sweden)
STOJANOVIC, N.
2014-11-01
Full Text Available In this paper, the high-performance processing of massive geospatial data on many-core GPU (Graphic Processing Unit is presented. We use CUDA (Compute Unified Device Architecture programming framework to implement parallel processing of common Geographic Information Systems (GIS algorithms, such as viewshed analysis and map-matching. Experimental evaluation indicates the improvement in performance with respect to CPU-based solutions and shows feasibility of using GPU and CUDA for parallel implementation of GIS algorithms over large-scale geospatial datasets.
Graphics Processing Unit Accelerated Hirsch-Fye Quantum Monte Carlo
Moore, Conrad; Abu Asal, Sameer; Rajagoplan, Kaushik; Poliakoff, David; Caprino, Joseph; Tomko, Karen; Thakur, Bhupender; Yang, Shuxiang; Moreno, Juana; Jarrell, Mark
2012-02-01
In Dynamical Mean Field Theory and its cluster extensions, such as the Dynamic Cluster Algorithm, the bottleneck of the algorithm is solving the self-consistency equations with an impurity solver. Hirsch-Fye Quantum Monte Carlo is one of the most commonly used impurity and cluster solvers. This work implements optimizations of the algorithm, such as enabling large data re-use, suitable for the Graphics Processing Unit (GPU) architecture. The GPU's sheer number of concurrent parallel computations and large bandwidth to many shared memories takes advantage of the inherent parallelism in the Green function update and measurement routines, and can substantially improve the efficiency of the Hirsch-Fye impurity solver.
Accelerating VASP electronic structure calculations using graphic processing units
Hacene, Mohamed
2012-08-20
We present a way to improve the performance of the electronic structure Vienna Ab initio Simulation Package (VASP) program. We show that high-performance computers equipped with graphics processing units (GPUs) as accelerators may reduce drastically the computation time when offloading these sections to the graphic chips. The procedure consists of (i) profiling the performance of the code to isolate the time-consuming parts, (ii) rewriting these so that the algorithms become better-suited for the chosen graphic accelerator, and (iii) optimizing memory traffic between the host computer and the GPU accelerator. We chose to accelerate VASP with NVIDIA GPU using CUDA. We compare the GPU and original versions of VASP by evaluating the Davidson and RMM-DIIS algorithms on chemical systems of up to 1100 atoms. In these tests, the total time is reduced by a factor between 3 and 8 when running on n (CPU core + GPU) compared to n CPU cores only, without any accuracy loss. © 2012 Wiley Periodicals, Inc.
Accelerating VASP electronic structure calculations using graphic processing units
Hacene, Mohamed; Anciaux-Sedrakian, Ani; Rozanska, Xavier; Klahr, Diego; Guignon, Thomas; Fleurat-Lessard, Paul
2012-01-01
We present a way to improve the performance of the electronic structure Vienna Ab initio Simulation Package (VASP) program. We show that high-performance computers equipped with graphics processing units (GPUs) as accelerators may reduce drastically the computation time when offloading these sections to the graphic chips. The procedure consists of (i) profiling the performance of the code to isolate the time-consuming parts, (ii) rewriting these so that the algorithms become better-suited for the chosen graphic accelerator, and (iii) optimizing memory traffic between the host computer and the GPU accelerator. We chose to accelerate VASP with NVIDIA GPU using CUDA. We compare the GPU and original versions of VASP by evaluating the Davidson and RMM-DIIS algorithms on chemical systems of up to 1100 atoms. In these tests, the total time is reduced by a factor between 3 and 8 when running on n (CPU core + GPU) compared to n CPU cores only, without any accuracy loss. © 2012 Wiley Periodicals, Inc.
Graphics Processing Unit Enhanced Parallel Document Flocking Clustering
Energy Technology Data Exchange (ETDEWEB)
Cui, Xiaohui [ORNL; Potok, Thomas E [ORNL; ST Charles, Jesse Lee [ORNL
2010-01-01
Analyzing and clustering documents is a complex problem. One explored method of solving this problem borrows from nature, imitating the flocking behavior of birds. One limitation of this method of document clustering is its complexity O(n2). As the number of documents grows, it becomes increasingly difficult to generate results in a reasonable amount of time. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. In this paper, we have conducted research to exploit this archi- tecture and apply its strengths to the flocking based document clustering problem. Using the CUDA platform from NVIDIA, we developed a doc- ument flocking implementation to be run on the NVIDIA GEFORCE GPU. Performance gains ranged from thirty-six to nearly sixty times improvement of the GPU over the CPU implementation.
MASSIVELY PARALLEL LATENT SEMANTIC ANALYSES USING A GRAPHICS PROCESSING UNIT
Energy Technology Data Exchange (ETDEWEB)
Cavanagh, J.; Cui, S.
2009-01-01
Latent Semantic Analysis (LSA) aims to reduce the dimensions of large term-document datasets using Singular Value Decomposition. However, with the ever-expanding size of datasets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. A graphics processing unit (GPU) can solve some highly parallel problems much faster than a traditional sequential processor or central processing unit (CPU). Thus, a deployable system using a GPU to speed up large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a PC cluster. Due to the GPU’s application-specifi c architecture, harnessing the GPU’s computational prowess for LSA is a great challenge. We presented a parallel LSA implementation on the GPU, using NVIDIA® Compute Unifi ed Device Architecture and Compute Unifi ed Basic Linear Algebra Subprograms software. The performance of this implementation is compared to traditional LSA implementation on a CPU using an optimized Basic Linear Algebra Subprograms library. After implementation, we discovered that the GPU version of the algorithm was twice as fast for large matrices (1 000x1 000 and above) that had dimensions not divisible by 16. For large matrices that did have dimensions divisible by 16, the GPU algorithm ran fi ve to six times faster than the CPU version. The large variation is due to architectural benefi ts of the GPU for matrices divisible by 16. It should be noted that the overall speeds for the CPU version did not vary from relative normal when the matrix dimensions were divisible by 16. Further research is needed in order to produce a fully implementable version of LSA. With that in mind, the research we presented shows that the GPU is a viable option for increasing the speed of LSA, in terms of cost/performance ratio.
High-throughput sequence alignment using Graphics Processing Units
Directory of Open Access Journals (Sweden)
Trapnell Cole
2007-12-01
Full Text Available Abstract Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU
Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong
2014-03-01
A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNRe., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
Belleman, R.G.; Bédorf, J.; Portegies Zwart, S.F.
2008-01-01
We present the results of gravitational direct N-body simulations using the graphics processing unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the N-body problem is implemented in "Compute Unified Device Architecture" (CUDA) using the GPU to
GAMER: A GRAPHIC PROCESSING UNIT ACCELERATED ADAPTIVE-MESH-REFINEMENT CODE FOR ASTROPHYSICS
International Nuclear Information System (INIS)
Schive, H.-Y.; Tsai, Y.-C.; Chiueh Tzihong
2010-01-01
We present the newly developed code, GPU-accelerated Adaptive-MEsh-Refinement code (GAMER), which adopts a novel approach in improving the performance of adaptive-mesh-refinement (AMR) astrophysical simulations by a large factor with the use of the graphic processing unit (GPU). The AMR implementation is based on a hierarchy of grid patches with an oct-tree data structure. We adopt a three-dimensional relaxing total variation diminishing scheme for the hydrodynamic solver and a multi-level relaxation scheme for the Poisson solver. Both solvers have been implemented in GPU, by which hundreds of patches can be advanced in parallel. The computational overhead associated with the data transfer between the CPU and GPU is carefully reduced by utilizing the capability of asynchronous memory copies in GPU, and the computing time of the ghost-zone values for each patch is diminished by overlapping it with the GPU computations. We demonstrate the accuracy of the code by performing several standard test problems in astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster system. We measure the performance of the code by performing purely baryonic cosmological simulations in different hardware implementations, in which detailed timing analyses provide comparison between the computations with and without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are demonstrated using one GPU with 4096 3 effective resolution and 16 GPUs with 8192 3 effective resolution, respectively.
High-Performance Pseudo-Random Number Generation on Graphics Processing Units
Nandapalan, Nimalan; Brent, Richard P.; Murray, Lawrence M.; Rendell, Alistair
2011-01-01
This work considers the deployment of pseudo-random number generators (PRNGs) on graphics processing units (GPUs), developing an approach based on the xorgens generator to rapidly produce pseudo-random numbers of high statistical quality. The chosen algorithm has configurable state size and period, making it ideal for tuning to the GPU architecture. We present a comparison of both speed and statistical quality with other common parallel, GPU-based PRNGs, demonstrating favourable performance o...
Optimized Laplacian image sharpening algorithm based on graphic processing unit
Ma, Tinghuai; Li, Lu; Ji, Sai; Wang, Xin; Tian, Yuan; Al-Dhelaan, Abdullah; Al-Rodhaan, Mznah
2014-12-01
In classical Laplacian image sharpening, all pixels are processed one by one, which leads to large amount of computation. Traditional Laplacian sharpening processed on CPU is considerably time-consuming especially for those large pictures. In this paper, we propose a parallel implementation of Laplacian sharpening based on Compute Unified Device Architecture (CUDA), which is a computing platform of Graphic Processing Units (GPU), and analyze the impact of picture size on performance and the relationship between the processing time of between data transfer time and parallel computing time. Further, according to different features of different memory, an improved scheme of our method is developed, which exploits shared memory in GPU instead of global memory and further increases the efficiency. Experimental results prove that two novel algorithms outperform traditional consequentially method based on OpenCV in the aspect of computing speed.
Partial wave analysis using graphics processing units
Energy Technology Data Exchange (ETDEWEB)
Berger, Niklaus; Liu Beijiang; Wang Jike, E-mail: nberger@ihep.ac.c [Institute of High Energy Physics, Chinese Academy of Sciences, 19B Yuquan Lu, Shijingshan, 100049 Beijing (China)
2010-04-01
Partial wave analysis is an important tool for determining resonance properties in hadron spectroscopy. For large data samples however, the un-binned likelihood fits employed are computationally very expensive. At the Beijing Spectrometer (BES) III experiment, an increase in statistics compared to earlier experiments of up to two orders of magnitude is expected. In order to allow for a timely analysis of these datasets, additional computing power with short turnover times has to be made available. It turns out that graphics processing units (GPUs) originally developed for 3D computer games have an architecture of massively parallel single instruction multiple data floating point units that is almost ideally suited for the algorithms employed in partial wave analysis. We have implemented a framework for tensor manipulation and partial wave fits called GPUPWA. The user writes a program in pure C++ whilst the GPUPWA classes handle computations on the GPU, memory transfers, caching and other technical details. In conjunction with a recent graphics processor, the framework provides a speed-up of the partial wave fit by more than two orders of magnitude compared to legacy FORTRAN code.
Pseudo-random number generators for Monte Carlo simulations on ATI Graphics Processing Units
Demchik, Vadim
2011-03-01
Basic uniform pseudo-random number generators are implemented on ATI Graphics Processing Units (GPU). The performance results of the realized generators (multiplicative linear congruential (GGL), XOR-shift (XOR128), RANECU, RANMAR, RANLUX and Mersenne Twister (MT19937)) on CPU and GPU are discussed. The obtained speed up factor is hundreds of times in comparison with CPU. RANLUX generator is found to be the most appropriate for using on GPU in Monte Carlo simulations. The brief review of the pseudo-random number generators used in modern software packages for Monte Carlo simulations in high-energy physics is presented.
Accelerated Numerical Processing API Based on GPU Technology, Phase II
National Aeronautics and Space Administration — The recent performance increases in graphics processing units (GPUs) have made graphics cards an attractive platform for implementing computationally intense...
Accelerated Numerical Processing API Based on GPU Technology, Phase I
National Aeronautics and Space Administration — The recent performance increases in graphics processing units (GPUs) have made graphics cards an attractive platform for implementing computationally intense...
GPU-Based FFT Computation for Multi-Gigabit WirelessHD Baseband Processing
Directory of Open Access Journals (Sweden)
Nicholas Hinitt
2010-01-01
Full Text Available The next generation Graphics Processing Units (GPUs are being considered for non-graphics applications. Millimeter wave (60 Ghz wireless networks that are capable of multi-gigabit per second (Gbps transfer rates require a significant baseband throughput. In this work, we consider the baseband of WirelessHD, a 60 GHz communications system, which can provide a data rate of up to 3.8 Gbps over a short range wireless link. Thus, we explore the feasibility of achieving gigabit baseband throughput using the GPUs. One of the most computationally intensive functions commonly used in baseband communications, the Fast Fourier Transform (FFT algorithm, is implemented on an NVIDIA GPU using their general-purpose computing platform called the Compute Unified Device Architecture (CUDA. The paper, first, investigates the implementation of an FFT algorithm using the GPU hardware and exploiting the computational capability available. It then outlines the limitations discovered and the methods used to overcome these challenges. Finally a new algorithm to compute FFT is proposed, which reduces interprocessor communication. It is further optimized by improving memory access, enabling the processing rate to exceed 4 Gbps, achieving a processing time of a 512-point FFT in less than 200 ns using a two-GPU solution.
Mobile Devices and GPU Parallelism in Ionospheric Data Processing
Mascharka, D.; Pankratius, V.
2015-12-01
Scientific data acquisition in the field is often constrained by data transfer backchannels to analysis environments. Geoscientists are therefore facing practical bottlenecks with increasing sensor density and variety. Mobile devices, such as smartphones and tablets, offer promising solutions to key problems in scientific data acquisition, pre-processing, and validation by providing advanced capabilities in the field. This is due to affordable network connectivity options and the increasing mobile computational power. This contribution exemplifies a scenario faced by scientists in the field and presents the "Mahali TEC Processing App" developed in the context of the NSF-funded Mahali project. Aimed at atmospheric science and the study of ionospheric Total Electron Content (TEC), this app is able to gather data from various dual-frequency GPS receivers. It demonstrates parsing of full-day RINEX files on mobile devices and on-the-fly computation of vertical TEC values based on satellite ephemeris models that are obtained from NASA. Our experiments show how parallel computing on the mobile device GPU enables fast processing and visualization of up to 2 million datapoints in real-time using OpenGL. GPS receiver bias is estimated through minimum TEC approximations that can be interactively adjusted by scientists in the graphical user interface. Scientists can also perform approximate computations for "quickviews" to reduce CPU processing time and memory consumption. In the final stage of our mobile processing pipeline, scientists can upload data to the cloud for further processing. Acknowledgements: The Mahali project (http://mahali.mit.edu) is funded by the NSF INSPIRE grant no. AGS-1343967 (PI: V. Pankratius). We would like to acknowledge our collaborators at Boston College, Virginia Tech, Johns Hopkins University, Colorado State University, as well as the support of UNAVCO for loans of dual-frequency GPS receivers for use in this project, and Intel for loans of
Graphical Language for Data Processing
Alphonso, Keith
2011-01-01
A graphical language for processing data allows processing elements to be connected with virtual wires that represent data flows between processing modules. The processing of complex data, such as lidar data, requires many different algorithms to be applied. The purpose of this innovation is to automate the processing of complex data, such as LIDAR, without the need for complex scripting and programming languages. The system consists of a set of user-interface components that allow the user to drag and drop various algorithmic and processing components onto a process graph. By working graphically, the user can completely visualize the process flow and create complex diagrams. This innovation supports the nesting of graphs, such that a graph can be included in another graph as a single step for processing. In addition to the user interface components, the system includes a set of .NET classes that represent the graph internally. These classes provide the internal system representation of the graphical user interface. The system includes a graph execution component that reads the internal representation of the graph (as described above) and executes that graph. The execution of the graph follows the interpreted model of execution in that each node is traversed and executed from the original internal representation. In addition, there are components that allow external code elements, such as algorithms, to be easily integrated into the system, thus making the system infinitely expandable.
International Nuclear Information System (INIS)
Dudnik, V.A.; Kudryavtsev, V.I.; Us, S.A.; Shestakov, M.V.
2015-01-01
A comparative analysis has been made to describe the potentialities of hardware and software tools of two most widely used modern architectures of graphic processors (AMD and NVIDIA). Special features and differences of GPU architectures are exemplified by fragments of GPGPU programs. Time consumption for the program development has been estimated. Some pieces of advice are given as to the optimum choice of the GPU type for speeding up the processing of scientific research results. Recommendations are formulated for the use of software tools that reduce the time of GPGPU application programming for the given types of graphic processors
Stochastic Analysis of a Queue Length Model Using a Graphics Processing Unit
Czech Academy of Sciences Publication Activity Database
Přikryl, Jan; Kocijan, J.
2012-01-01
Roč. 5, č. 2 (2012), s. 55-62 ISSN 1802-971X R&D Projects: GA MŠk(CZ) MEB091015 Institutional support: RVO:67985556 Keywords : graphics processing unit * GPU * Monte Carlo simulation * computer simulation * modeling Subject RIV: BC - Control Systems Theory http://library.utia.cas.cz/separaty/2012/AS/prikryl-stochastic analysis of a queue length model using a graphics processing unit.pdf
GPU Computing Gems Emerald Edition
Hwu, Wen-mei W
2011-01-01
".the perfect companion to Programming Massively Parallel Processors by Hwu & Kirk." -Nicolas Pinto, Research Scientist at Harvard & MIT, NVIDIA Fellow 2009-2010 Graphics processing units (GPUs) can do much more than render graphics. Scientists and researchers increasingly look to GPUs to improve the efficiency and performance of computationally-intensive experiments across a range of disciplines. GPU Computing Gems: Emerald Edition brings their techniques to you, showcasing GPU-based solutions including: Black hole simulations with CUDA GPU-accelerated computation and interactive display of
GPU Computing For Particle Tracking
International Nuclear Information System (INIS)
Nishimura, Hiroshi; Song, Kai; Muriki, Krishna; Sun, Changchun; James, Susan; Qin, Yong
2011-01-01
This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculation of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ (2) is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.
Accelerating cardiac bidomain simulations using graphics processing units.
Neic, A; Liebmann, M; Hoetzl, E; Mitchell, L; Vigmond, E J; Haase, G; Plank, G
2012-08-01
Anatomically realistic and biophysically detailed multiscale computer models of the heart are playing an increasingly important role in advancing our understanding of integrated cardiac function in health and disease. Such detailed simulations, however, are computationally vastly demanding, which is a limiting factor for a wider adoption of in-silico modeling. While current trends in high-performance computing (HPC) hardware promise to alleviate this problem, exploiting the potential of such architectures remains challenging since strongly scalable algorithms are necessitated to reduce execution times. Alternatively, acceleration technologies such as graphics processing units (GPUs) are being considered. While the potential of GPUs has been demonstrated in various applications, benefits in the context of bidomain simulations where large sparse linear systems have to be solved in parallel with advanced numerical techniques are less clear. In this study, the feasibility of multi-GPU bidomain simulations is demonstrated by running strong scalability benchmarks using a state-of-the-art model of rabbit ventricles. The model is spatially discretized using the finite element methods (FEM) on fully unstructured grids. The GPU code is directly derived from a large pre-existing code, the Cardiac Arrhythmia Research Package (CARP), with very minor perturbation of the code base. Overall, bidomain simulations were sped up by a factor of 11.8 to 16.3 in benchmarks running on 6-20 GPUs compared to the same number of CPU cores. To match the fastest GPU simulation which engaged 20 GPUs, 476 CPU cores were required on a national supercomputing facility.
Flocking-based Document Clustering on the Graphics Processing Unit
Energy Technology Data Exchange (ETDEWEB)
Cui, Xiaohui [ORNL; Potok, Thomas E [ORNL; Patton, Robert M [ORNL; ST Charles, Jesse Lee [ORNL
2008-01-01
Abstract?Analyzing and grouping documents by content is a complex problem. One explored method of solving this problem borrows from nature, imitating the flocking behavior of birds. Each bird represents a single document and flies toward other documents that are similar to it. One limitation of this method of document clustering is its complexity O(n2). As the number of documents grows, it becomes increasingly difficult to receive results in a reasonable amount of time. However, flocking behavior, along with most naturally inspired algorithms such as ant colony optimization and particle swarm optimization, are highly parallel and have found increased performance on expensive cluster computers. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. Some applications see a huge increase in performance on this new platform. The cost of these high-performance devices is also marginal when compared with the price of cluster machines. In this paper, we have conducted research to exploit this architecture and apply its strengths to the document flocking problem. Our results highlight the potential benefit the GPU brings to all naturally inspired algorithms. Using the CUDA platform from NIVIDA? we developed a document flocking implementation to be run on the NIVIDA?GEFORCE 8800. Additionally, we developed a similar but sequential implementation of the same algorithm to be run on a desktop CPU. We tested the performance of each on groups of news articles ranging in size from 200 to 3000 documents. The results of these tests were very significant. Performance gains ranged from three to nearly five times improvement of the GPU over the CPU implementation. This dramatic improvement in runtime makes the GPU a potentially revolutionary platform for document clustering algorithms.
GPU Boosted CNN Simulator Library for Graphical Flow-Based Programmability
Directory of Open Access Journals (Sweden)
Balázs Gergely Soós
2009-01-01
Full Text Available A graphical environment for CNN algorithm development is presented. The new generation of graphical cards with many general purpose processing units introduces the massively parallel computing into PC environment. Universal Machine on Flows- (UMF like notation, highlighting image flows and operations, is a useful tool to describe image processing algorithms. This documentation step can be turned into modeling using our framework backed with MATLAB Simulink and the power of a video card. This latter relatively cheap extension enables a convenient and fast analysis of CNN dynamics and complex algorithms. Comparison with other PC solutions is also presented. For single template execution, our approach yields run times 40x faster than that of the widely used Candy simulator. In the case of simpler algorithms, real-time execution is also possible.
Spectra processing with computer graphics
International Nuclear Information System (INIS)
Kruse, H.
1979-01-01
A program of processng gamma-ray spectra in rock analysis is described. The peak search was performed by applying a cross-correlation function. The experimental data were approximated by an analytical function represented by the sum of a polynomial and a multiple peak function. The latter is Gaussian, joined with the low-energy side by an exponential. A modified Gauss-Newton algorithm is applied for the purpose of fitting the data to the function. The processing of the values derived from a lunar sample demonstrates the effect of different choices of polynomial orders for approximating the background for various fitting intervals. Observations on applications of interactive graphics are presented. 3 figures, 1 table
Implementation of RLS-based Adaptive Filterson nVIDIA GeForce Graphics Processing Unit
Hirano, Akihiro; Nakayama, Kenji
2011-01-01
This paper presents efficient implementa- tion of RLS-based adaptive filters with a large number of taps on nVIDIA GeForce graphics processing unit (GPU) and CUDA software development environment. Modification of the order and the combination of calcu- lations reduces the number of accesses to slow off-chip memory. Assigning tasks into multiple threads also takes memory access order into account. For a 4096-tap case, a GPU program is almost three times faster than a CPU program.
An Application of Graphics Processing Units to Geosimulation of Collective Crowd Behaviour
Directory of Open Access Journals (Sweden)
Cjoskāns Jānis
2017-12-01
Full Text Available The goal of the paper is to assess the ways for computational performance and efficiency improvement of collective crowd behaviour simulation by using parallel computing methods implemented on graphics processing unit (GPU. To perform an experimental evaluation of benefits of parallel computing, a new GPU-based simulator prototype is proposed and the runtime performance is analysed. Based on practical examples of pedestrian dynamics geosimulation, the obtained performance measurements are compared to several other available multiagent simulation tools to determine the efficiency of the proposed simulator, as well as to provide generic guidelines for the efficiency improvements of the parallel simulation of collective crowd behaviour.
Energy Technology Data Exchange (ETDEWEB)
Smith, M.
2009-11-15
Accelware has developed a graphic processor technology (GPU) that is transforming the petroleum industry. The benefits of the technology are its small-footprint, low-wattage, and high speed. The software brings supercomputing speed to the desktop by leveraging the massive parallel processing capacity to the very latest in GPU technology. This article discussed the GPU technology and its emergence as a powerful supercomputing tool. Accelware's partnering with California-based NVIDIA was also outlined. The advantages of the technology were also discussed including its smaller footprint. Accelware's hardware takes up a fraction of the space and uses up to 70 per cent less power than a traditional central processing unit. By combining Accelware's core knowledge in making complex algorithms run in parallel with an in-house team of seismic industry experts, the company provides software solutions for seismic data processors that access the massively parallel processing capabilities of GPUs. 1 fig.
High-speed, multi-input, multi-output control using GPU processing in the HBT-EP tokamak
Energy Technology Data Exchange (ETDEWEB)
Rath, N., E-mail: Nikolaus@rath.org [Columbia University, Rm 200 Mudd, 500 W 120th St, New York, NY - 10027 (United States); Bialek, J.; Byrne, P.J.; DeBono, B.; Levesque, J.P.; Li, B.; Mauel, M.E.; Maurer, D.A.; Navratil, G.A.; Shiraki, D. [Columbia University, Rm 200 Mudd, 500 W 120th St, New York, NY - 10027 (United States)
2012-12-15
Highlights: Black-Right-Pointing-Pointer We present a GPU based system for magnetic control of perturbed equilibria. Black-Right-Pointing-Pointer Cycle times are below 5 {mu}s and I/O latencies below 10 {mu}s for 96 inputs and 64 outputs. Black-Right-Pointing-Pointer A new architecture removes host RAM and CPU from the control cycle. Black-Right-Pointing-Pointer GPU and DA/AD modules operate independently and communicate via PCIe peer-to-peer connections. Black-Right-Pointing-Pointer The Linux host system does not require real-time extensions. - Abstract: We report on the design of a new plasma control system for the HBT-EP tokamak that utilizes a graphical processing unit (GPU) to magnetically control the 3D perturbed equilibrium state [1] of the plasma. The control system achieves cycle times of 5 {mu}s and I/O latencies below 10 {mu}s for up to 96 inputs and 64 outputs. The number of state variables is in the same order. To handle the resulting computational complexity under the given time constraints, the control algorithms are designed for massively parallel processing. The necessary hardware resources are provided by an NVIDIA Tesla M2050 GPU, offering a total of 448 computing cores running at 1.3 GHz each. A new control architecture allows control input from magnetic diagnostics to be pushed directly into GPU memory by a D-TACQ ACQ196 digitizer, and control output to be pulled directly from GPU memory by two D-TACQ AO32 analog output modules. By using peer-to-peer PCI express connections, this technique completely eliminates the use of host RAM and central processing unit (CPU) from the control cycle, permitting single-digit microsecond latencies on a standard Linux host system without any real-time extensions.
Real-time radar signal processing using GPGPU (general-purpose graphic processing unit)
Kong, Fanxing; Zhang, Yan Rockee; Cai, Jingxiao; Palmer, Robert D.
2016-05-01
This study introduces a practical approach to develop real-time signal processing chain for general phased array radar on NVIDIA GPUs(Graphical Processing Units) using CUDA (Compute Unified Device Architecture) libraries such as cuBlas and cuFFT, which are adopted from open source libraries and optimized for the NVIDIA GPUs. The processed results are rigorously verified against those from the CPUs. Performance benchmarked in computation time with various input data cube sizes are compared across GPUs and CPUs. Through the analysis, it will be demonstrated that GPGPUs (General Purpose GPU) real-time processing of the array radar data is possible with relatively low-cost commercial GPUs.
FAST CALCULATION OF THE LOMB-SCARGLE PERIODOGRAM USING GRAPHICS PROCESSING UNITS
International Nuclear Information System (INIS)
Townsend, R. H. D.
2010-01-01
I introduce a new code for fast calculation of the Lomb-Scargle periodogram that leverages the computing power of graphics processing units (GPUs). After establishing a background to the newly emergent field of GPU computing, I discuss the code design and narrate key parts of its source. Benchmarking calculations indicate no significant differences in accuracy compared to an equivalent CPU-based code. However, the differences in performance are pronounced; running on a low-end GPU, the code can match eight CPU cores, and on a high-end GPU it is faster by a factor approaching 30. Applications of the code include analysis of long photometric time series obtained by ongoing satellite missions and upcoming ground-based monitoring facilities, and Monte Carlo simulation of periodogram statistical properties.
Monte Carlo method for neutron transport calculations in graphics processing units (GPUs)
International Nuclear Information System (INIS)
Pellegrino, Esteban
2011-01-01
Monte Carlo simulation is well suited for solving the Boltzmann neutron transport equation in an inhomogeneous media for complicated geometries. However, routine applications require the computation time to be reduced to hours and even minutes in a desktop PC. The interest in adopting Graphics Processing Units (GPUs) for Monte Carlo acceleration is rapidly growing. This is due to the massive parallelism provided by the latest GPU technologies which is the most promising solution to the challenge of performing full-size reactor core analysis on a routine basis. In this study, Monte Carlo codes for a fixed-source neutron transport problem were developed for GPU environments in order to evaluate issues associated with computational speedup using GPUs. Results obtained in this work suggest that a speedup of several orders of magnitude is possible using the state-of-the-art GPU technologies. (author) [es
Real-time track-less Cherenkov ring fitting trigger system based on Graphics Processing Units
Ammendola, R.; Biagioni, A.; Chiozzi, S.; Cretaro, P.; Cotta Ramusino, A.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Gianoli, A.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Piccini, M.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.
2017-12-01
The parallel computing power of commercial Graphics Processing Units (GPUs) is exploited to perform real-time ring fitting at the lowest trigger level using information coming from the Ring Imaging Cherenkov (RICH) detector of the NA62 experiment at CERN. To this purpose, direct GPU communication with a custom FPGA-based board has been used to reduce the data transmission latency. The GPU-based trigger system is currently integrated in the experimental setup of the RICH detector of the NA62 experiment, in order to reconstruct ring-shaped hit patterns. The ring-fitting algorithm running on GPU is fed with raw RICH data only, with no information coming from other detectors, and is able to provide more complex trigger primitives with respect to the simple photodetector hit multiplicity, resulting in a higher selection efficiency. The performance of the system for multi-ring Cherenkov online reconstruction obtained during the NA62 physics run is presented.
Ramirez, Andres; Rahnemoonfar, Maryam
2017-04-01
A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.
Graphics processing units in bioinformatics, computational biology and systems biology.
Nobile, Marco S; Cazzaniga, Paolo; Tangherloni, Andrea; Besozzi, Daniela
2017-09-01
Several studies in Bioinformatics, Computational Biology and Systems Biology rely on the definition of physico-chemical or mathematical models of biological systems at different scales and levels of complexity, ranging from the interaction of atoms in single molecules up to genome-wide interaction networks. Traditional computational methods and software tools developed in these research fields share a common trait: they can be computationally demanding on Central Processing Units (CPUs), therefore limiting their applicability in many circumstances. To overcome this issue, general-purpose Graphics Processing Units (GPUs) are gaining an increasing attention by the scientific community, as they can considerably reduce the running time required by standard CPU-based software, and allow more intensive investigations of biological systems. In this review, we present a collection of GPU tools recently developed to perform computational analyses in life science disciplines, emphasizing the advantages and the drawbacks in the use of these parallel architectures. The complete list of GPU-powered tools here reviewed is available at http://bit.ly/gputools. © The Author 2016. Published by Oxford University Press.
Graphic Design in Libraries: A Conceptual Process
Ruiz, Miguel
2014-01-01
Providing successful library services requires efficient and effective communication with users; therefore, it is important that content creators who develop visual materials understand key components of design and, specifically, develop a holistic graphic design process. Graphic design, as a form of visual communication, is the process of…
Wilson, J Adam; Williams, Justin C
2009-01-01
The clock speeds of modern computer processors have nearly plateaued in the past 5 years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card [graphics processing unit (GPU)] was developed for real-time neural signal processing of a brain-computer interface (BCI). The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter), followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a central processing unit-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels of 250 ms in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G
2011-07-01
In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
Computing the Density Matrix in Electronic Structure Theory on Graphics Processing Units.
Cawkwell, M J; Sanville, E J; Mniszewski, S M; Niklasson, Anders M N
2012-11-13
The self-consistent solution of a Schrödinger-like equation for the density matrix is a critical and computationally demanding step in quantum-based models of interatomic bonding. This step was tackled historically via the diagonalization of the Hamiltonian. We have investigated the performance and accuracy of the second-order spectral projection (SP2) algorithm for the computation of the density matrix via a recursive expansion of the Fermi operator in a series of generalized matrix-matrix multiplications. We demonstrate that owing to its simplicity, the SP2 algorithm [Niklasson, A. M. N. Phys. Rev. B2002, 66, 155115] is exceptionally well suited to implementation on graphics processing units (GPUs). The performance in double and single precision arithmetic of a hybrid GPU/central processing unit (CPU) and full GPU implementation of the SP2 algorithm exceed those of a CPU-only implementation of the SP2 algorithm and traditional matrix diagonalization when the dimensions of the matrices exceed about 2000 × 2000. Padding schemes for arrays allocated in the GPU memory that optimize the performance of the CUBLAS implementations of the level 3 BLAS DGEMM and SGEMM subroutines for generalized matrix-matrix multiplications are described in detail. The analysis of the relative performance of the hybrid CPU/GPU and full GPU implementations indicate that the transfer of arrays between the GPU and CPU constitutes only a small fraction of the total computation time. The errors measured in the self-consistent density matrices computed using the SP2 algorithm are generally smaller than those measured in matrices computed via diagonalization. Furthermore, the errors in the density matrices computed using the SP2 algorithm do not exhibit any dependence of system size, whereas the errors increase linearly with the number of orbitals when diagonalization is employed.
Acceleration of Linear Finite-Difference Poisson-Boltzmann Methods on Graphics Processing Units.
Qi, Ruxi; Botello-Smith, Wesley M; Luo, Ray
2017-07-11
Electrostatic interactions play crucial roles in biophysical processes such as protein folding and molecular recognition. Poisson-Boltzmann equation (PBE)-based models have emerged as widely used in modeling these important processes. Though great efforts have been put into developing efficient PBE numerical models, challenges still remain due to the high dimensionality of typical biomolecular systems. In this study, we implemented and analyzed commonly used linear PBE solvers for the ever-improving graphics processing units (GPU) for biomolecular simulations, including both standard and preconditioned conjugate gradient (CG) solvers with several alternative preconditioners. Our implementation utilizes the standard Nvidia CUDA libraries cuSPARSE, cuBLAS, and CUSP. Extensive tests show that good numerical accuracy can be achieved given that the single precision is often used for numerical applications on GPU platforms. The optimal GPU performance was observed with the Jacobi-preconditioned CG solver, with a significant speedup over standard CG solver on CPU in our diversified test cases. Our analysis further shows that different matrix storage formats also considerably affect the efficiency of different linear PBE solvers on GPU, with the diagonal format best suited for our standard finite-difference linear systems. Further efficiency may be possible with matrix-free operations and integrated grid stencil setup specifically tailored for the banded matrices in PBE-specific linear systems.
Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model
Energy Technology Data Exchange (ETDEWEB)
Shi, Xuanhua; Luo, Xuan; Liang, Junling; Zhao, Peng; Di, Sheng; He, Bingsheng; Jin, Hai
2018-01-01
GPUs have been increasingly used to accelerate graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing requires locking or atomic operations, leading to significant penalties/overheads when implemented on GPUs. As such, coloring algorithm is adopted to separate the vertices with potential updating conflicts, guaranteeing the consistency/correctness of the parallel processing. Common coloring algorithms, however, may suffer from low parallelism because of a large number of colors generally required for processing a large-scale graph with billions of vertices. We propose a light-weight asynchronous processing framework called Frog with a preprocessing/hybrid coloring model. The fundamental idea is based on Pareto principle (or 80-20 rule) about coloring algorithms as we observed through masses of realworld graph coloring cases. We find that a majority of vertices (about 80%) are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency. Accordingly, our solution separates the processing of the vertices based on the distribution of colors. In this work, we mainly answer three questions: (1) how to partition the vertices in a sparse graph with maximized parallelism, (2) how to process large-scale graphs that cannot fit into GPU memory, and (3) how to reduce the overhead of data transfers on PCIe while processing each partition. We conduct experiments on real-world data (Amazon, DBLP, YouTube, RoadNet-CA, WikiTalk and Twitter) to evaluate our approach and make comparisons with well-known non-preprocessed (such as Totem, Medusa, MapGraph and Gunrock) and preprocessed (Cusha) approaches, by testing four classical algorithms (BFS, PageRank, SSSP and CC). On all the tested applications and
Smoldyn on graphics processing units: massively parallel Brownian dynamics simulations.
Dematté, Lorenzo
2012-01-01
Space is a very important aspect in the simulation of biochemical systems; recently, the need for simulation algorithms able to cope with space is becoming more and more compelling. Complex and detailed models of biochemical systems need to deal with the movement of single molecules and particles, taking into consideration localized fluctuations, transportation phenomena, and diffusion. A common drawback of spatial models lies in their complexity: models can become very large, and their simulation could be time consuming, especially if we want to capture the systems behavior in a reliable way using stochastic methods in conjunction with a high spatial resolution. In order to deliver the promise done by systems biology to be able to understand a system as whole, we need to scale up the size of models we are able to simulate, moving from sequential to parallel simulation algorithms. In this paper, we analyze Smoldyn, a widely diffused algorithm for stochastic simulation of chemical reactions with spatial resolution and single molecule detail, and we propose an alternative, innovative implementation that exploits the parallelism of Graphics Processing Units (GPUs). The implementation executes the most computational demanding steps (computation of diffusion, unimolecular, and bimolecular reaction, as well as the most common cases of molecule-surface interaction) on the GPU, computing them in parallel on each molecule of the system. The implementation offers good speed-ups and real time, high quality graphics output
GPU Pro 4 advanced rendering techniques
Engel, Wolfgang
2013-01-01
GPU Pro4: Advanced Rendering Techniques presents ready-to-use ideas and procedures that can help solve many of your day-to-day graphics programming challenges. Focusing on interactive media and games, the book covers up-to-date methods producing real-time graphics. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Sebastien St-Laurent have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book begins with discussions on the abi
Monte Carlo MP2 on Many Graphical Processing Units.
Doran, Alexander E; Hirata, So
2016-10-11
In the Monte Carlo second-order many-body perturbation (MC-MP2) method, the long sum-of-product matrix expression of the MP2 energy, whose literal evaluation may be poorly scalable, is recast into a single high-dimensional integral of functions of electron pair coordinates, which is evaluated by the scalable method of Monte Carlo integration. The sampling efficiency is further accelerated by the redundant-walker algorithm, which allows a maximal reuse of electron pairs. Here, a multitude of graphical processing units (GPUs) offers a uniquely ideal platform to expose multilevel parallelism: fine-grain data-parallelism for the redundant-walker algorithm in which millions of threads compute and share orbital amplitudes on each GPU; coarse-grain instruction-parallelism for near-independent Monte Carlo integrations on many GPUs with few and infrequent interprocessor communications. While the efficiency boost by the redundant-walker algorithm on central processing units (CPUs) grows linearly with the number of electron pairs and tends to saturate when the latter exceeds the number of orbitals, on a GPU it grows quadratically before it increases linearly and then eventually saturates at a much larger number of pairs. This is because the orbital constructions are nearly perfectly parallelized on a GPU and thus completed in a near-constant time regardless of the number of pairs. In consequence, an MC-MP2/cc-pVDZ calculation of a benzene dimer is 2700 times faster on 256 GPUs (using 2048 electron pairs) than on two CPUs, each with 8 cores (which can use only up to 256 pairs effectively). We also numerically determine that the cost to achieve a given relative statistical uncertainty in an MC-MP2 energy increases as O(n 3 ) or better with system size n, which may be compared with the O(n 5 ) scaling of the conventional implementation of deterministic MP2. We thus establish the scalability of MC-MP2 with both system and computer sizes.
International Nuclear Information System (INIS)
Nieto, J.; Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M.; Vega, J.; Castro, R.
2016-01-01
Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.
Energy Technology Data Exchange (ETDEWEB)
Nieto, J., E-mail: jnieto@sec.upm.es [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M. [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Vega, J.; Castro, R. [Asociación EURATOM/CIEMAT para Fusión, Madrid (Spain)
2016-11-15
Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.
Molecular Monte Carlo Simulations Using Graphics Processing Units: To Waste Recycle or Not?
Kim, Jihan; Rodgers, Jocelyn M; Athènes, Manuel; Smit, Berend
2011-10-11
In the waste recycling Monte Carlo (WRMC) algorithm, (1) multiple trial states may be simultaneously generated and utilized during Monte Carlo moves to improve the statistical accuracy of the simulations, suggesting that such an algorithm may be well posed for implementation in parallel on graphics processing units (GPUs). In this paper, we implement two waste recycling Monte Carlo algorithms in CUDA (Compute Unified Device Architecture) using uniformly distributed random trial states and trial states based on displacement random-walk steps, and we test the methods on a methane-zeolite MFI framework system to evaluate their utility. We discuss the specific implementation details of the waste recycling GPU algorithm and compare the methods to other parallel algorithms optimized for the framework system. We analyze the relationship between the statistical accuracy of our simulations and the CUDA block size to determine the efficient allocation of the GPU hardware resources. We make comparisons between the GPU and the serial CPU Monte Carlo implementations to assess speedup over conventional microprocessors. Finally, we apply our optimized GPU algorithms to the important problem of determining free energy landscapes, in this case for molecular motion through the zeolite LTA.
Mei, Gang; Xu, Liangliang; Xu, Nengxiong
2017-09-01
This paper focuses on designing and implementing parallel adaptive inverse distance weighting (AIDW) interpolation algorithms by using the graphics processing unit (GPU). The AIDW is an improved version of the standard IDW, which can adaptively determine the power parameter according to the data points' spatial distribution pattern and achieve more accurate predictions than those predicted by IDW. In this paper, we first present two versions of the GPU-accelerated AIDW, i.e. the naive version without profiting from the shared memory and the tiled version taking advantage of the shared memory. We also implement the naive version and the tiled version using two data layouts, structure of arrays and array of aligned structures, on both single and double precision. We then evaluate the performance of parallel AIDW by comparing it with its corresponding serial algorithm on three different machines equipped with the GPUs GT730M, M5000 and K40c. The experimental results indicate that: (i) there is no significant difference in the computational efficiency when different data layouts are employed; (ii) the tiled version is always slightly faster than the naive version; and (iii) on single precision the achieved speed-up can be up to 763 (on the GPU M5000), while on double precision the obtained highest speed-up is 197 (on the GPU K40c). To benefit the community, all source code and testing data related to the presented parallel AIDW algorithm are publicly available.
Initial Assessment of Parallelization of Monte Carlo Calculation using Graphics Processing Units
International Nuclear Information System (INIS)
Choi, Sung Hoon; Joo, Han Gyu
2009-01-01
Monte Carlo (MC) simulation is an effective tool for calculating neutron transports in complex geometry. However, because Monte Carlo simulates each neutron behavior one by one, it takes a very long computing time if enough neutrons are used for high precision of calculation. Accordingly, methods that reduce the computing time are required. In a Monte Carlo code, parallel calculation is well-suited since it simulates the behavior of each neutron independently and thus parallel computation is natural. The parallelization of the Monte Carlo codes, however, was done using multi CPUs. By the global demand for high quality 3D graphics, the Graphics Processing Unit (GPU) has developed into a highly parallel, multi-core processor. This parallel processing capability of GPUs can be available to engineering computing once a suitable interface is provided. Recently, NVIDIA introduced CUDATM, a general purpose parallel computing architecture. CUDA is a software environment that allows developers to manage GPU using C/C++ or other languages. In this work, a GPU-based Monte Carlo is developed and the initial assessment of it parallel performance is investigated
Directory of Open Access Journals (Sweden)
J. Adam Wilson
2009-07-01
Full Text Available The clock speeds of modern computer processors have nearly plateaued in the past five years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card (GPU was developed for real-time neural signal processing of a brain-computer interface (BCI. The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter, followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally-intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a CPU-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.
Acceleration of PIC simulation with GPU
International Nuclear Information System (INIS)
Suzuki, Junya; Shimazu, Hironori; Fukazawa, Keiichiro; Den, Mitsue
2011-01-01
Particle-in-cell (PIC) is a simulation technique for plasma physics. The large number of particles in high-resolution plasma simulation increases the volume computation required, making it vital to increase computation speed. In this study, we attempt to accelerate computation speed on graphics processing units (GPUs) using KEMPO, a PIC simulation code package. We perform two tests for benchmarking, with small and large grid sizes. In these tests, we run KEMPO1 code using a CPU only, both a CPU and a GPU, and a GPU only. The results showed that performance using only a GPU was twice that of using a CPU alone. While, execution time for using both a CPU and GPU is comparable to the tests with a CPU alone, because of the significant bottleneck in communication between the CPU and GPU. (author)
Fales, B Scott; Levine, Benjamin G
2015-10-13
Methods based on a full configuration interaction (FCI) expansion in an active space of orbitals are widely used for modeling chemical phenomena such as bond breaking, multiply excited states, and conical intersections in small-to-medium-sized molecules, but these phenomena occur in systems of all sizes. To scale such calculations up to the nanoscale, we have developed an implementation of FCI in which electron repulsion integral transformation and several of the more expensive steps in σ vector formation are performed on graphical processing unit (GPU) hardware. When applied to a 1.7 × 1.4 × 1.4 nm silicon nanoparticle (Si72H64) described with the polarized, all-electron 6-31G** basis set, our implementation can solve for the ground state of the 16-active-electron/16-active-orbital CASCI Hamiltonian (more than 100,000,000 configurations) in 39 min on a single NVidia K40 GPU.
AN APPROACH TO EFFICIENT FEM SIMULATIONS ON GRAPHICS PROCESSING UNITS USING CUDA
Directory of Open Access Journals (Sweden)
Björn Nutti
2014-04-01
Full Text Available The paper presents a highly efficient way of simulating the dynamic behavior of deformable objects by means of the finite element method (FEM with computations performed on Graphics Processing Units (GPU. The presented implementation reduces bottlenecks related to memory accesses by grouping the necessary data per node pairs, in contrast to the classical way done per element. This strategy reduces the memory access patterns that are not suitable for the GPU memory architecture. Furthermore, the presented implementation takes advantage of the underlying sparse-block-matrix structure, and it has been demonstrated how to avoid potential bottlenecks in the algorithm. To achieve plausible deformational behavior for large local rotations, the objects are modeled by means of a simplified co-rotational FEM formulation.
GPU Acceleration of DSP for Communication Receivers.
Gunther, Jake; Gunther, Hyrum; Moon, Todd
2017-09-01
Graphics processing unit (GPU) implementations of signal processing algorithms can outperform CPU-based implementations. This paper describes the GPU implementation of several algorithms encountered in a wide range of high-data rate communication receivers including filters, multirate filters, numerically controlled oscillators, and multi-stage digital down converters. These structures are tested by processing the 20 MHz wide FM radio band (88-108 MHz). Two receiver structures are explored: a single channel receiver and a filter bank channelizer. Both run in real time on NVIDIA GeForce GTX 1080 graphics card.
Harvesting graphics power for MD simulations
van Meel, J.A.; Arnold, A.; Frenkel, D.; Portegies Zwart, S.F.; Belleman, R.G.
2008-01-01
We discuss an implementation of molecular dynamics (MD) simulations on a graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms suitable for short-ranged and long-ranged interactions, and a
Harvesting graphics power for MD simulations
Meel, J.A. van; Arnold, A.; Frenkel, D.; Portegies Zwart, S.F.; Belleman, R.G.
We discuss an implementation of molecular dynamics (MD) simulations on a graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms suitable for short-ranged and long-ranged interactions, and a
Distributed GPU Computing in GIScience
Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
2013-12-01
Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE
Real-time computation of parameter fitting and image reconstruction using graphical processing units
Locans, Uldis; Adelmann, Andreas; Suter, Andreas; Fischer, Jannis; Lustermann, Werner; Dissertori, Günther; Wang, Qiulin
2017-06-01
In recent years graphical processing units (GPUs) have become a powerful tool in scientific computing. Their potential to speed up highly parallel applications brings the power of high performance computing to a wider range of users. However, programming these devices and integrating their use in existing applications is still a challenging task. In this paper we examined the potential of GPUs for two different applications. The first application, created at Paul Scherrer Institut (PSI), is used for parameter fitting during data analysis of μSR (muon spin rotation, relaxation and resonance) experiments. The second application, developed at ETH, is used for PET (Positron Emission Tomography) image reconstruction and analysis. Applications currently in use were examined to identify parts of the algorithms in need of optimization. Efficient GPU kernels were created in order to allow applications to use a GPU, to speed up the previously identified parts. Benchmarking tests were performed in order to measure the achieved speedup. During this work, we focused on single GPU systems to show that real time data analysis of these problems can be achieved without the need for large computing clusters. The results show that the currently used application for parameter fitting, which uses OpenMP to parallelize calculations over multiple CPU cores, can be accelerated around 40 times through the use of a GPU. The speedup may vary depending on the size and complexity of the problem. For PET image analysis, the obtained speedups of the GPU version were more than × 40 larger compared to a single core CPU implementation. The achieved results show that it is possible to improve the execution time by orders of magnitude.
Mobile Ultrasound Plane Wave Beamforming on iPhone or iPad using Metal- based GPU Processing
Hewener, Holger J.; Tretbar, Steffen H.
Mobile and cost effective ultrasound devices are being used in point of care scenarios or the drama room. To reduce the costs of such devices we already presented the possibilities of consumer devices like the Apple iPad for full signal processing of raw data for ultrasound image generation. Using technologies like plane wave imaging to generate a full image with only one excitation/reception event the acquisition times and power consumption of ultrasound imaging can be reduced for low power mobile devices based on consumer electronics realizing the transition from FPGA or ASIC based beamforming into more flexible software beamforming. The massive parallel beamforming processing can be done with the Apple framework "Metal" for advanced graphics and general purpose GPU processing for the iOS platform. We were able to integrate the beamforming reconstruction into our mobile ultrasound processing application with imaging rates up to 70 Hz on iPad Air 2 hardware.
Energy Technology Data Exchange (ETDEWEB)
Jie, Liang [School of Information Science and Engineering, Hunan University, Changshang, 410082 (China); Li, KenLi, E-mail: lkl@hnu.edu.cn [School of Information Science and Engineering, Hunan University, Changshang, 410082 (China); National Supercomputing Center in Changsha, 410082 (China); Shi, Lin [School of Information Science and Engineering, Hunan University, Changshang, 410082 (China); Liu, RangSu [School of Physics and Micro Electronic, Hunan University, Changshang, 410082 (China); Mei, Jing [School of Information Science and Engineering, Hunan University, Changshang, 410082 (China)
2014-01-15
Molecular dynamics simulation is a powerful tool to simulate and analyze complex physical processes and phenomena at atomic characteristic for predicting the natural time-evolution of a system of atoms. Precise simulation of physical processes has strong requirements both in the simulation size and computing timescale. Therefore, finding available computing resources is crucial to accelerate computation. However, a tremendous computational resource (GPGPU) are recently being utilized for general purpose computing due to its high performance of floating-point arithmetic operation, wide memory bandwidth and enhanced programmability. As for the most time-consuming component in MD simulation calculation during the case of studying liquid metal solidification processes, this paper presents a fine-grained spatial decomposition method to accelerate the computation of update of neighbor lists and interaction force calculation by take advantage of modern graphics processors units (GPU), enlarging the scale of the simulation system to a simulation system involving 10 000 000 atoms. In addition, a number of evaluations and tests, ranging from executions on different precision enabled-CUDA versions, over various types of GPU (NVIDIA 480GTX, 580GTX and M2050) to CPU clusters with different number of CPU cores are discussed. The experimental results demonstrate that GPU-based calculations are typically 9∼11 times faster than the corresponding sequential execution and approximately 1.5∼2 times faster than 16 CPU cores clusters implementations. On the basis of the simulated results, the comparisons between the theoretical results and the experimental ones are executed, and the good agreement between the two and more complete and larger cluster structures in the actual macroscopic materials are observed. Moreover, different nucleation and evolution mechanism of nano-clusters and nano-crystals formed in the processes of metal solidification is observed with large
NMF-mGPU: non-negative matrix factorization on multi-GPU systems.
Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto
2015-02-13
In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the
Energy Technology Data Exchange (ETDEWEB)
Rath, N., E-mail: Nikolaus@rath.org; Levesque, J. P.; Mauel, M. E.; Navratil, G. A.; Peng, Q. [Department of Applied Physics and Applied Mathematics, Columbia University, 500 W 120th St, New York, New York 10027 (United States); Kato, S. [Department of Information Engineering, Nagoya University, Nagoya (Japan)
2014-04-15
Fast, digital signal processing (DSP) has many applications. Typical hardware options for performing DSP are field-programmable gate arrays (FPGAs), application-specific integrated DSP chips, or general purpose personal computer systems. This paper presents a novel DSP platform that has been developed for feedback control on the HBT-EP tokamak device. The system runs all signal processing exclusively on a Graphics Processing Unit (GPU) to achieve real-time performance with latencies below 8 μs. Signals are transferred into and out of the GPU using PCI Express peer-to-peer direct-memory-access transfers without involvement of the central processing unit or host memory. Tests were performed on the feedback control system of the HBT-EP tokamak using forty 16-bit floating point inputs and outputs each and a sampling rate of up to 250 kHz. Signals were digitized by a D-TACQ ACQ196 module, processing done on an NVIDIA GTX 580 GPU programmed in CUDA, and analog output was generated by D-TACQ AO32CPCI modules.
International Nuclear Information System (INIS)
Rath, N.; Levesque, J. P.; Mauel, M. E.; Navratil, G. A.; Peng, Q.; Kato, S.
2014-01-01
Fast, digital signal processing (DSP) has many applications. Typical hardware options for performing DSP are field-programmable gate arrays (FPGAs), application-specific integrated DSP chips, or general purpose personal computer systems. This paper presents a novel DSP platform that has been developed for feedback control on the HBT-EP tokamak device. The system runs all signal processing exclusively on a Graphics Processing Unit (GPU) to achieve real-time performance with latencies below 8 μs. Signals are transferred into and out of the GPU using PCI Express peer-to-peer direct-memory-access transfers without involvement of the central processing unit or host memory. Tests were performed on the feedback control system of the HBT-EP tokamak using forty 16-bit floating point inputs and outputs each and a sampling rate of up to 250 kHz. Signals were digitized by a D-TACQ ACQ196 module, processing done on an NVIDIA GTX 580 GPU programmed in CUDA, and analog output was generated by D-TACQ AO32CPCI modules
Functional graphical languages for process control
International Nuclear Information System (INIS)
1996-01-01
A wide variety of safety systems are in use today in the process industries. Most of these systems rely on control software using procedural programming languages. This study investigates the use of functional graphical languages for controls in the process industry. Different vendor proprietary software and languages are investigated and evaluation criteria are outlined based on ability to meet regulatory requirements, reference sites involving applications with similar safety concerns, QA/QC procedures, community of users, type and user-friendliness of the man-machine interface, performance of operational code, and degree of flexibility. (author) 16 refs., 4 tabs
Directory of Open Access Journals (Sweden)
Mathew G. Pelletier
2008-02-01
Full Text Available One of the main hurdles standing in the way of optimal cleaning of cotton lint isthe lack of sensing systems that can react fast enough to provide the control system withreal-time information as to the level of trash contamination of the cotton lint. This researchexamines the use of programmable graphic processing units (GPU as an alternative to thePCÃ¢Â€Â™s traditional use of the central processing unit (CPU. The use of the GPU, as analternative computation platform, allowed for the machine vision system to gain asignificant improvement in processing time. By improving the processing time, thisresearch seeks to address the lack of availability of rapid trash sensing systems and thusalleviate a situation in which the current systems view the cotton lint either well before, orafter, the cotton is cleaned. This extended lag/lead time that is currently imposed on thecotton trash cleaning control systems, is what is responsible for system operators utilizing avery large dead-band safety buffer in order to ensure that the cotton lint is not undercleaned.Unfortunately, the utilization of a large dead-band buffer results in the majority ofthe cotton lint being over-cleaned which in turn causes lint fiber-damage as well assignificant losses of the valuable lint due to the excessive use of cleaning machinery. Thisresearch estimates that upwards of a 30% reduction in lint loss could be gained through theuse of a tightly coupled trash sensor to the cleaning machinery control systems. Thisresearch seeks to improve processing times through the development of a new algorithm forcotton trash sensing that allows for implementation on a highly parallel architecture.Additionally, by moving the new parallel algorithm onto an alternative computing platform,the graphic processing unit Ã¢Â€ÂœGPUÃ¢Â€Â, for processing of the cotton trash images, a speed up ofover 6.5 times, over optimized code running on the PCÃ¢Â€Â™s central processing
Multidisciplinary Simulation Acceleration using Multiple Shared-Memory Graphical Processing Units
Kemal, Jonathan Yashar
For purposes of optimizing and analyzing turbomachinery and other designs, the unsteady Favre-averaged flow-field differential equations for an ideal compressible gas can be solved in conjunction with the heat conduction equation. We solve all equations using the finite-volume multiple-grid numerical technique, with the dual time-step scheme used for unsteady simulations. Our numerical solver code targets CUDA-capable Graphical Processing Units (GPUs) produced by NVIDIA. Making use of MPI, our solver can run across networked compute notes, where each MPI process can use either a GPU or a Central Processing Unit (CPU) core for primary solver calculations. We use NVIDIA Tesla C2050/C2070 GPUs based on the Fermi architecture, and compare our resulting performance against Intel Zeon X5690 CPUs. Solver routines converted to CUDA typically run about 10 times faster on a GPU for sufficiently dense computational grids. We used a conjugate cylinder computational grid and ran a turbulent steady flow simulation using 4 increasingly dense computational grids. Our densest computational grid is divided into 13 blocks each containing 1033x1033 grid points, for a total of 13.87 million grid points or 1.07 million grid points per domain block. To obtain overall speedups, we compare the execution time of the solver's iteration loop, including all resource intensive GPU-related memory copies. Comparing the performance of 8 GPUs to that of 8 CPUs, we obtain an overall speedup of about 6.0 when using our densest computational grid. This amounts to an 8-GPU simulation running about 39.5 times faster than running than a single-CPU simulation.
Energy- and cost-efficient lattice-QCD computations using graphics processing units
Energy Technology Data Exchange (ETDEWEB)
Bach, Matthias
2014-07-01
Quarks and gluons are the building blocks of all hadronic matter, like protons and neutrons. Their interaction is described by Quantum Chromodynamics (QCD), a theory under test by large scale experiments like the Large Hadron Collider (LHC) at CERN and in the future at the Facility for Antiproton and Ion Research (FAIR) at GSI. However, perturbative methods can only be applied to QCD for high energies. Studies from first principles are possible via a discretization onto an Euclidean space-time grid. This discretization of QCD is called Lattice QCD (LQCD) and is the only ab-initio option outside of the high-energy regime. LQCD is extremely compute and memory intensive. In particular, it is by definition always bandwidth limited. Thus - despite the complexity of LQCD applications - it led to the development of several specialized compute platforms and influenced the development of others. However, in recent years General-Purpose computation on Graphics Processing Units (GPGPU) came up as a new means for parallel computing. Contrary to machines traditionally used for LQCD, graphics processing units (GPUs) are a massmarket product. This promises advantages in both the pace at which higher-performing hardware becomes available and its price. CL2QCD is an OpenCL based implementation of LQCD using Wilson fermions that was developed within this thesis. It operates on GPUs by all major vendors as well as on central processing units (CPUs). On the AMD Radeon HD 7970 it provides the fastest double-precision D kernel for a single GPU, achieving 120GFLOPS. D - the most compute intensive kernel in LQCD simulations - is commonly used to compare LQCD platforms. This performance is enabled by an in-depth analysis of optimization techniques for bandwidth-limited codes on GPUs. Further, analysis of the communication between GPU and CPU, as well as between multiple GPUs, enables high-performance Krylov space solvers and linear scaling to multiple GPUs within a single system. LQCD
Energy- and cost-efficient lattice-QCD computations using graphics processing units
International Nuclear Information System (INIS)
Bach, Matthias
2014-01-01
Quarks and gluons are the building blocks of all hadronic matter, like protons and neutrons. Their interaction is described by Quantum Chromodynamics (QCD), a theory under test by large scale experiments like the Large Hadron Collider (LHC) at CERN and in the future at the Facility for Antiproton and Ion Research (FAIR) at GSI. However, perturbative methods can only be applied to QCD for high energies. Studies from first principles are possible via a discretization onto an Euclidean space-time grid. This discretization of QCD is called Lattice QCD (LQCD) and is the only ab-initio option outside of the high-energy regime. LQCD is extremely compute and memory intensive. In particular, it is by definition always bandwidth limited. Thus - despite the complexity of LQCD applications - it led to the development of several specialized compute platforms and influenced the development of others. However, in recent years General-Purpose computation on Graphics Processing Units (GPGPU) came up as a new means for parallel computing. Contrary to machines traditionally used for LQCD, graphics processing units (GPUs) are a massmarket product. This promises advantages in both the pace at which higher-performing hardware becomes available and its price. CL2QCD is an OpenCL based implementation of LQCD using Wilson fermions that was developed within this thesis. It operates on GPUs by all major vendors as well as on central processing units (CPUs). On the AMD Radeon HD 7970 it provides the fastest double-precision D kernel for a single GPU, achieving 120GFLOPS. D - the most compute intensive kernel in LQCD simulations - is commonly used to compare LQCD platforms. This performance is enabled by an in-depth analysis of optimization techniques for bandwidth-limited codes on GPUs. Further, analysis of the communication between GPU and CPU, as well as between multiple GPUs, enables high-performance Krylov space solvers and linear scaling to multiple GPUs within a single system. LQCD
Energy Technology Data Exchange (ETDEWEB)
Almeida, Adino Americo Heimlich
2009-07-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)
High-Throughput Characterization of Porous Materials Using Graphics Processing Units
Energy Technology Data Exchange (ETDEWEB)
Kim, Jihan; Martin, Richard L.; Rübel, Oliver; Haranczyk, Maciej; Smit, Berend
2012-05-08
We have developed a high-throughput graphics processing units (GPU) code that can characterize a large database of crystalline porous materials. In our algorithm, the GPU is utilized to accelerate energy grid calculations where the grid values represent interactions (i.e., Lennard-Jones + Coulomb potentials) between gas molecules (i.e., CH$_{4}$ and CO$_{2}$) and material's framework atoms. Using a parallel flood fill CPU algorithm, inaccessible regions inside the framework structures are identified and blocked based on their energy profiles. Finally, we compute the Henry coefficients and heats of adsorption through statistical Widom insertion Monte Carlo moves in the domain restricted to the accessible space. The code offers significant speedup over a single core CPU code and allows us to characterize a set of porous materials at least an order of magnitude larger than ones considered in earlier studies. For structures selected from such a prescreening algorithm, full adsorption isotherms can be calculated by conducting multiple grand canonical Monte Carlo simulations concurrently within the GPU.
Accelerating large-scale protein structure alignments with graphics processing units
Directory of Open Access Journals (Sweden)
Pang Bin
2012-02-01
Full Text Available Abstract Background Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons. Findings We present ppsAlign, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs. As a general-purpose GPU platform, ppsAlign could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated ppsAlign on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH. Conclusions ppsAlign is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU.
Monte Carlo methods for neutron transport on graphics processing units using Cuda - 015
International Nuclear Information System (INIS)
Nelson, A.G.; Ivanov, K.N.
2010-01-01
This work examined the feasibility of utilizing Graphics Processing Units (GPUs) to accelerate Monte Carlo neutron transport simulations. First, a clean-sheet MC code was written in C++ for an x86 CPU and later ported to run on GPUs using NVIDIA's CUDA programming language. After further optimization, the GPU ran 21 times faster than the CPU code when using single-precision floating point math. This can be further increased with no additional effort if accuracy is sacrificed for speed: using a compiler flag, the speedup was increased to 22x. Further, if double-precision floating point math is desired for neutron tracking through the geometry, a speedup of 11x was obtained. The GPUs have proven to be useful in this study, but the current generation does have limitations: the maximum memory currently available on a single GPU is only 4 GB; the GPU RAM does not provide error-checking and correction; and the optimization required for large speedups can lead to confusing code. (authors)
International Nuclear Information System (INIS)
Setiani, Tia Dwi; Suprijadi; Haryanto, Freddy
2016-01-01
Monte Carlo (MC) is one of the powerful techniques for simulation in x-ray imaging. MC method can simulate the radiation transport within matter with high accuracy and provides a natural way to simulate radiation transport in complex systems. One of the codes based on MC algorithm that are widely used for radiographic images simulation is MC-GPU, a codes developed by Andrea Basal. This study was aimed to investigate the time computation of x-ray imaging simulation in GPU (Graphics Processing Unit) compared to a standard CPU (Central Processing Unit). Furthermore, the effect of physical parameters to the quality of radiographic images and the comparison of image quality resulted from simulation in the GPU and CPU are evaluated in this paper. The simulations were run in CPU which was simulated in serial condition, and in two GPU with 384 cores and 2304 cores. In simulation using GPU, each cores calculates one photon, so, a large number of photon were calculated simultaneously. Results show that the time simulations on GPU were significantly accelerated compared to CPU. The simulations on the 2304 core of GPU were performed about 64 -114 times faster than on CPU, while the simulation on the 384 core of GPU were performed about 20 – 31 times faster than in a single core of CPU. Another result shows that optimum quality of images from the simulation was gained at the history start from 10"8 and the energy from 60 Kev to 90 Kev. Analyzed by statistical approach, the quality of GPU and CPU images are relatively the same.
Energy Technology Data Exchange (ETDEWEB)
Setiani, Tia Dwi, E-mail: tiadwisetiani@gmail.com [Computational Science, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung Jalan Ganesha 10 Bandung, 40132 (Indonesia); Suprijadi [Computational Science, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung Jalan Ganesha 10 Bandung, 40132 (Indonesia); Nuclear Physics and Biophysics Reaserch Division, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung Jalan Ganesha 10 Bandung, 40132 (Indonesia); Haryanto, Freddy [Nuclear Physics and Biophysics Reaserch Division, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung Jalan Ganesha 10 Bandung, 40132 (Indonesia)
2016-03-11
Monte Carlo (MC) is one of the powerful techniques for simulation in x-ray imaging. MC method can simulate the radiation transport within matter with high accuracy and provides a natural way to simulate radiation transport in complex systems. One of the codes based on MC algorithm that are widely used for radiographic images simulation is MC-GPU, a codes developed by Andrea Basal. This study was aimed to investigate the time computation of x-ray imaging simulation in GPU (Graphics Processing Unit) compared to a standard CPU (Central Processing Unit). Furthermore, the effect of physical parameters to the quality of radiographic images and the comparison of image quality resulted from simulation in the GPU and CPU are evaluated in this paper. The simulations were run in CPU which was simulated in serial condition, and in two GPU with 384 cores and 2304 cores. In simulation using GPU, each cores calculates one photon, so, a large number of photon were calculated simultaneously. Results show that the time simulations on GPU were significantly accelerated compared to CPU. The simulations on the 2304 core of GPU were performed about 64 -114 times faster than on CPU, while the simulation on the 384 core of GPU were performed about 20 – 31 times faster than in a single core of CPU. Another result shows that optimum quality of images from the simulation was gained at the history start from 10{sup 8} and the energy from 60 Kev to 90 Kev. Analyzed by statistical approach, the quality of GPU and CPU images are relatively the same.
Option pricing with COS method on Graphics Processing Units
B. Zhang (Bo); C.W. Oosterlee (Kees)
2009-01-01
htmlabstractIn this paper, acceleration on the GPU for option pricing by the COS method is demonstrated. In particular, both European and Bermudan options will be discussed in detail. For Bermudan options, we consider both the Black-Scholes model and Levy processes of infinite activity. Moreover,
High performance GPU processing for inversion using uniform grid searches
Venetis, Ioannis E.; Saltogianni, Vasso; Stiros, Stathis; Gallopoulos, Efstratios
2017-04-01
Many geophysical problems are described by systems of redundant, highly non-linear systems of ordinary equations with constant terms deriving from measurements and hence representing stochastic variables. Solution (inversion) of such problems is based on numerical, optimization methods, based on Monte Carlo sampling or on exhaustive searches in cases of two or even three "free" unknown variables. Recently the TOPological INVersion (TOPINV) algorithm, a grid search-based technique in the Rn space, has been proposed. TOPINV is not based on the minimization of a certain cost function and involves only forward computations, hence avoiding computational errors. The basic concept is to transform observation equations into inequalities on the basis of an optimization parameter k and of their standard errors, and through repeated "scans" of n-dimensional search grids for decreasing values of k to identify the optimal clusters of gridpoints which satisfy observation inequalities and by definition contain the "true" solution. Stochastic optimal solutions and their variance-covariance matrices are then computed as first and second statistical moments. Such exhaustive uniform searches produce an excessive computational load and are extremely time consuming for common computers based on a CPU. An alternative is to use a computing platform based on a GPU, which nowadays is affordable to the research community, which provides a much higher computing performance. Using the CUDA programming language to implement TOPINV allows the investigation of the attained speedup in execution time on such a high performance platform. Based on synthetic data we compared the execution time required for two typical geophysical problems, modeling magma sources and seismic faults, described with up to 18 unknown variables, on both CPU/FORTRAN and GPU/CUDA platforms. The same problems for several different sizes of search grids (up to 1012 gridpoints) and numbers of unknown variables were solved on
CUDA/GPU Technology : Parallel Programming For High Performance Scientific Computing
YUHENDRA; KUZE, Hiroaki; JOSAPHAT, Tetuko Sri Sumantyo
2009-01-01
[ABSTRACT]Graphics processing units (GP Us) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. In the high performance computation capabilities, graphic processing units (GPU) lead to much more powerful performance than conventional CPUs by means of parallel processing. In 2007, the birth of Compute Unified Device Architecture (CUDA) and CUDA-enabled GPUs by NVIDIA Corporation brought a revolution in the general purpose GPU a...
BarraCUDA - a fast short read sequence aligner using graphics processing units
Directory of Open Access Journals (Sweden)
Klus Petr
2012-01-01
Full Text Available Abstract Background With the maturation of next-generation DNA sequencing (NGS technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU, extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence. Findings Using the NVIDIA Compute Unified Device Architecture (CUDA software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput. Conclusions BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net
BarraCUDA - a fast short read sequence aligner using graphics processing units
LENUS (Irish Health Repository)
Klus, Petr
2012-01-13
Abstract Background With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence. Findings Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput. Conclusions BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http:\\/\\/seqbarracuda.sf.net
Application of GPU to computational multiphase fluid dynamics
International Nuclear Information System (INIS)
Nagatake, T; Kunugi, T
2010-01-01
The MARS (Multi-interfaces Advection and Reconstruction Solver) [1] is one of the surface volume tracking methods for multi-phase flows. Nowadays, the performance of GPU (Graphics Processing Unit) is much higher than the CPU (Central Processing Unit). In this study, the GPU was applied to the MARS in order to accelerate the computation of multi-phase flows (GPU-MARS), and the performance of the GPU-MARS was discussed. From the performance of the interface tracking method for the analyses of one-directional advection problem, it is found that the computing time of GPU(single GTX280) was around 4 times faster than that of the CPU (Xeon 5040, 4 threads parallelized). From the performance of Poisson Solver by using the algorithm developed in this study, it is found that the performance of the GPU showed around 30 times faster than that of the CPU. Finally, it is confirmed that the GPU showed the large acceleration of the fluid flow computation (GPU-MARS) compared to the CPU. However, it is also found that the double-precision computation of the GPU must perform with very high precision.
International Nuclear Information System (INIS)
Paćko, P; Bielak, T; Staszewski, W J; Uhl, T; Spencer, A B; Worden, K
2012-01-01
This paper demonstrates new parallel computation technology and an implementation for Lamb wave propagation modelling in complex structures. A graphical processing unit (GPU) and computer unified device architecture (CUDA), available in low-cost graphical cards in standard PCs, are used for Lamb wave propagation numerical simulations. The local interaction simulation approach (LISA) wave propagation algorithm has been implemented as an example. Other algorithms suitable for parallel discretization can also be used in practice. The method is illustrated using examples related to damage detection. The results demonstrate good accuracy and effective computational performance of very large models. The wave propagation modelling presented in the paper can be used in many practical applications of science and engineering. (paper)
MGUPGMA: A Fast UPGMA Algorithm With Multiple Graphics Processing Units Using NCCL.
Hua, Guan-Jie; Hung, Che-Lun; Lin, Chun-Yuan; Wu, Fu-Che; Chan, Yu-Wei; Tang, Chuan Yi
2017-01-01
A phylogenetic tree is a visual diagram of the relationship between a set of biological species. The scientists usually use it to analyze many characteristics of the species. The distance-matrix methods, such as Unweighted Pair Group Method with Arithmetic Mean and Neighbor Joining, construct a phylogenetic tree by calculating pairwise genetic distances between taxa. These methods have the computational performance issue. Although several new methods with high-performance hardware and frameworks have been proposed, the issue still exists. In this work, a novel parallel Unweighted Pair Group Method with Arithmetic Mean approach on multiple Graphics Processing Units is proposed to construct a phylogenetic tree from extremely large set of sequences. The experimental results present that the proposed approach on a DGX-1 server with 8 NVIDIA P100 graphic cards achieves approximately 3-fold to 7-fold speedup over the implementation of Unweighted Pair Group Method with Arithmetic Mean on a modern CPU and a single GPU, respectively.
Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments
Directory of Open Access Journals (Sweden)
Jyh-Da Wei
2017-08-01
Full Text Available High-end graphics processing units (GPUs, such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1, which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs. Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform. Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments.
Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments.
Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu
2017-01-01
High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments.
FLOCKING-BASED DOCUMENT CLUSTERING ON THE GRAPHICS PROCESSING UNIT [Book Chapter
Energy Technology Data Exchange (ETDEWEB)
Charles, J S; Patton, R M; Potok, T E; Cui, X
2008-01-01
Analyzing and grouping documents by content is a complex problem. One explored method of solving this problem borrows from nature, imitating the fl ocking behavior of birds. Each bird represents a single document and fl ies toward other documents that are similar to it. One limitation of this method of document clustering is its complexity O(n2). As the number of documents grows, it becomes increasingly diffi cult to receive results in a reasonable amount of time. However, fl ocking behavior, along with most naturally inspired algorithms such as ant colony optimization and particle swarm optimization, are highly parallel and have experienced improved performance on expensive cluster computers. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. Some applications see a huge increase in performance on this new platform. The cost of these high-performance devices is also marginal when compared with the price of cluster machines. In this paper, we have conducted research to exploit this architecture and apply its strengths to the document flocking problem. Our results highlight the potential benefi t the GPU brings to all naturally inspired algorithms. Using the CUDA platform from NVIDIA®, we developed a document fl ocking implementation to be run on the NVIDIA® GEFORCE 8800. Additionally, we developed a similar but sequential implementation of the same algorithm to be run on a desktop CPU. We tested the performance of each on groups of news articles ranging in size from 200 to 3,000 documents. The results of these tests were very signifi cant. Performance gains ranged from three to nearly fi ve times improvement of the GPU over the CPU implementation. This dramatic improvement in runtime makes the GPU a potentially revolutionary platform for document clustering algorithms.
GPU based acceleration of first principles calculation
International Nuclear Information System (INIS)
Tomono, H; Tsumuraya, K; Aoki, M; Iitaka, T
2010-01-01
We present a Graphics Processing Unit (GPU) accelerated simulations of first principles electronic structure calculations. The FFT, which is the most time-consuming part, is about 10 times accelerated. As the result, the total computation time of a first principles calculation is reduced to 15 percent of that of the CPU.
GPU Accelerated Surgical Simulators for Complex Morhpology
DEFF Research Database (Denmark)
Mosegaard, Jesper; Sørensen, Thomas Sangild
2005-01-01
a springmass system in order to simulate a complex organ such as the heart. Computations are accelerated by taking advantage of modern graphics processing units (GPUs). Two GPU implementations are presented. They vary in their generality of spring connections and in the speedup factor they achieve...
Acceleration of the OpenFOAM-based MHD solver using graphics processing units
International Nuclear Information System (INIS)
He, Qingyun; Chen, Hongli; Feng, Jingchao
2015-01-01
Highlights: • A 3D PISO-MHD was implemented on Kepler-class graphics processing units (GPUs) using CUDA technology. • A consistent and conservative scheme is used in the code which was validated by three basic benchmarks in a rectangular and round ducts. • Parallelized of CPU and GPU acceleration were compared relating to single core CPU in MHD problems and non-MHD problems. • Different preconditions for solving MHD solver were compared and the results showed that AMG method is better for calculations. - Abstract: The pressure-implicit with splitting of operators (PISO) magnetohydrodynamics MHD solver of the couple of Navier–Stokes equations and Maxwell equations was implemented on Kepler-class graphics processing units (GPUs) using the CUDA technology. The solver is developed on open source code OpenFOAM based on consistent and conservative scheme which is suitable for simulating MHD flow under strong magnetic field in fusion liquid metal blanket with structured or unstructured mesh. We verified the validity of the implementation on several standard cases including the benchmark I of Shercliff and Hunt's cases, benchmark II of fully developed circular pipe MHD flow cases and benchmark III of KIT experimental case. Computational performance of the GPU implementation was examined by comparing its double precision run times with those of essentially the same algorithms and meshes. The resulted showed that a GPU (GTX 770) can outperform a server-class 4-core, 8-thread CPU (Intel Core i7-4770k) by a factor of 2 at least.
Acceleration of the OpenFOAM-based MHD solver using graphics processing units
Energy Technology Data Exchange (ETDEWEB)
He, Qingyun; Chen, Hongli, E-mail: hlchen1@ustc.edu.cn; Feng, Jingchao
2015-12-15
Highlights: • A 3D PISO-MHD was implemented on Kepler-class graphics processing units (GPUs) using CUDA technology. • A consistent and conservative scheme is used in the code which was validated by three basic benchmarks in a rectangular and round ducts. • Parallelized of CPU and GPU acceleration were compared relating to single core CPU in MHD problems and non-MHD problems. • Different preconditions for solving MHD solver were compared and the results showed that AMG method is better for calculations. - Abstract: The pressure-implicit with splitting of operators (PISO) magnetohydrodynamics MHD solver of the couple of Navier–Stokes equations and Maxwell equations was implemented on Kepler-class graphics processing units (GPUs) using the CUDA technology. The solver is developed on open source code OpenFOAM based on consistent and conservative scheme which is suitable for simulating MHD flow under strong magnetic field in fusion liquid metal blanket with structured or unstructured mesh. We verified the validity of the implementation on several standard cases including the benchmark I of Shercliff and Hunt's cases, benchmark II of fully developed circular pipe MHD flow cases and benchmark III of KIT experimental case. Computational performance of the GPU implementation was examined by comparing its double precision run times with those of essentially the same algorithms and meshes. The resulted showed that a GPU (GTX 770) can outperform a server-class 4-core, 8-thread CPU (Intel Core i7-4770k) by a factor of 2 at least.
Watanabe, Yuuki; Takahashi, Yuhei; Numazawa, Hiroshi
2014-02-01
We demonstrate intensity-based optical coherence tomography (OCT) angiography using the squared difference of two sequential frames with bulk-tissue-motion (BTM) correction. This motion correction was performed by minimization of the sum of the pixel values using axial- and lateral-pixel-shifted structural OCT images. We extract the BTM-corrected image from a total of 25 calculated OCT angiographic images. Image processing was accelerated by a graphics processing unit (GPU) with many stream processors to optimize the parallel processing procedure. The GPU processing rate was faster than that of a line scan camera (46.9 kHz). Our OCT system provides the means of displaying structural OCT images and BTM-corrected OCT angiographic images in real time.
Development of parallel GPU based algorithms for problems in nuclear area
International Nuclear Information System (INIS)
Almeida, Adino Americo Heimlich
2009-01-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)
GPU-based high performance Monte Carlo simulation in neutron transport
Energy Technology Data Exchange (ETDEWEB)
Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A. [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil). Lab. de Inteligencia Artificial Aplicada], e-mail: cmnap@ien.gov.br
2009-07-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)
GPU-based high performance Monte Carlo simulation in neutron transport
International Nuclear Information System (INIS)
Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A.
2009-01-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)
Validation of GPU based TomoTherapy dose calculation engine.
Chen, Quan; Lu, Weiguo; Chen, Yu; Chen, Mingli; Henderson, Douglas; Sterpin, Edmond
2012-04-01
The graphic processing unit (GPU) based TomoTherapy convolution/superposition(C/S) dose engine (GPU dose engine) achieves a dramatic performance improvement over the traditional CPU-cluster based TomoTherapy dose engine (CPU dose engine). Besides the architecture difference between the GPU and CPU, there are several algorithm changes from the CPU dose engine to the GPU dose engine. These changes made the GPU dose slightly different from the CPU-cluster dose. In order for the commercial release of the GPU dose engine, its accuracy has to be validated. Thirty eight TomoTherapy phantom plans and 19 patient plans were calculated with both dose engines to evaluate the equivalency between the two dose engines. Gamma indices (Γ) were used for the equivalency evaluation. The GPU dose was further verified with the absolute point dose measurement with ion chamber and film measurements for phantom plans. Monte Carlo calculation was used as a reference for both dose engines in the accuracy evaluation in heterogeneous phantom and actual patients. The GPU dose engine showed excellent agreement with the current CPU dose engine. The majority of cases had over 99.99% of voxels with Γ(1%, 1 mm) engine also showed similar degree of accuracy in heterogeneous media as the current TomoTherapy dose engine. It is verified and validated that the ultrafast TomoTherapy GPU dose engine can safely replace the existing TomoTherapy cluster based dose engine without degradation in dose accuracy.
Monk, J.; Zhu, Y.; Koons, P. O.; Segee, B. E.
2009-12-01
With the introduction of the G8X series of cards by nVidia an architecture called CUDA was released, virtually all subsequent video cards have had CUDA support. With this new architecture nVidia provided extensions for C/C++ that create an Application Programming Interface (API) allowing code to be executed on the GPU. Since then the concept of GPGPU (general purpose graphics processing unit) has been growing, this is the concept that the GPU is very good a algebra and running things in parallel so we should take use of that power for other applications. This is highly appealing in the area of geodynamic modeling, as multiple parallel solutions of the same differential equations at different points in space leads to a large speedup in simulation speed. Another benefit of CUDA is a programmatic method of transferring large amounts of data between the computer's main memory and the dedicated GPU memory located on the video card. In addition to being able to compute and render on the video card, the CUDA framework allows for a large speedup in the situation, such as with a tiled display wall, where the rendered pixels are to be displayed in a different location than where they are rendered. A CUDA extension for VirtualGL was developed allowing for faster read back at high resolutions. This paper examines several aspects of rendering OpenGL graphics on large displays using VirtualGL and VNC. It demonstrates how performance can be significantly improved in rendering on a tiled monitor wall. We present a CUDA enhanced version of VirtualGL as well as the advantages to having multiple VNC servers. It will discuss restrictions caused by read back and blitting rates and how they are affected by different sizes of virtual displays being rendered.
International Nuclear Information System (INIS)
Badal, Andreu; Badano, Aldo
2009-01-01
Purpose: It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). Methods: A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDA programming model (NVIDIA Corporation, Santa Clara, CA). Results: An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. Conclusions: The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.
Energy Technology Data Exchange (ETDEWEB)
Badal, Andreu; Badano, Aldo [Division of Imaging and Applied Mathematics, OSEL, CDRH, U.S. Food and Drug Administration, Silver Spring, Maryland 20993-0002 (United States)
2009-11-15
Purpose: It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). Methods: A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDA programming model (NVIDIA Corporation, Santa Clara, CA). Results: An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. Conclusions: The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.
Badal, Andreu; Badano, Aldo
2009-11-01
It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDATM programming model (NVIDIA Corporation, Santa Clara, CA). An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.
ALEX LAIER BORDIGNON
2006-01-01
Nesse trabalho, mostramos como simular um fluido em duas dimensÃµes em um domÃnio com fronteiras arbitrÃ¡rias. Nosso trabalho Ã© baseado no esquema stable fluids desenvolvido por Joe Stam. A implementaÃ§Ã£o Ã© feita na GPU (Graphics Processing Unit), permitindo velocidade de interaÃ§Ã£o com o fluido. Fazemos uso da linguagem Cg (C for Graphics), desenvolvida pela companhia NVidia. Nossas principais contribuiÃ§Ãµes sÃ£o o tratamento das mÃºltiplas fronteiras, o...
GPU in Physics Computation: Case Geant4 Navigation
Seiskari, Otto; Niemi, Tapio
2012-01-01
General purpose computing on graphic processing units (GPU) is a potential method of speeding up scientific computation with low cost and high energy efficiency. We experimented with the particle physics simulation toolkit Geant4 used at CERN to benchmark its geometry navigation functionality on a GPU. The goal was to find out whether Geant4 physics simulations could benefit from GPU acceleration and how difficult it is to modify Geant4 code to run in a GPU. We ported selected parts of Geant4 code to C99 & CUDA and implemented a simple gamma physics simulation utilizing this code to measure efficiency. The performance of the program was tested by running it on two different platforms: NVIDIA GeForce 470 GTX GPU and a 12-core AMD CPU system. Our conclusion was that GPUs can be a competitive alternate for multi-core computers but porting existing software in an efficient way is challenging.
GPU-based high-performance computing for radiation therapy
International Nuclear Information System (INIS)
Jia, Xun; Jiang, Steve B; Ziegenhein, Peter
2014-01-01
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of study has been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this paper, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. (topical review)
Ren, Shanshan; Bertels, Koen; Al-Ars, Zaid
2018-01-01
GATK HaplotypeCaller (HC) is a popular variant caller, which is widely used to identify variants in complex genomes. However, due to its high variants detection accuracy, it suffers from long execution time. In GATK HC, the pair-HMMs forward algorithm accounts for a large percentage of the total execution time. This article proposes to accelerate the pair-HMMs forward algorithm on graphics processing units (GPUs) to improve the performance of GATK HC. This article presents several GPU-based implementations of the pair-HMMs forward algorithm. It also analyzes the performance bottlenecks of the implementations on an NVIDIA Tesla K40 card with various data sets. Based on these results and the characteristics of GATK HC, we are able to identify the GPU-based implementations with the highest performance for the various analyzed data sets. Experimental results show that the GPU-based implementations of the pair-HMMs forward algorithm achieve a speedup of up to 5.47× over existing GPU-based implementations.
Identification of Learning Processes by Means of Computer Graphics.
Sorensen, Birgitte Holm
1993-01-01
Describes a development project for the use of computer graphics and video in connection with an inservice training course for primary education teachers in Denmark. Topics addressed include research approaches to computers; computer graphics in learning processes; activities relating to computer graphics; the role of the teacher; and student…
International Nuclear Information System (INIS)
Kudela, Henryk; Kosior, Andrzej
2014-01-01
Understanding the dynamics and the mutual interaction among various types of vortical motions is a key ingredient in clarifying and controlling fluid motion. In the paper several different cases related to vortex tube interactions are presented. Due to problems with very long computation times on the single processor, the vortex-in-cell (VIC) method is implemented on the multicore architecture of a graphics processing unit (GPU). Numerical results of leapfrogging of two vortex rings for inviscid and viscous fluid are presented as test cases for the new multi-GPU implementation of the VIC method. Influence of the Reynolds number on the reconnection process is shown for two examples: antiparallel vortex tubes and orthogonally offset vortex tubes. Our aim is to show the great potential of the VIC method for solutions of three-dimensional flow problems and that the VIC method is very well suited for parallel computation. (paper)
GRAPHICAL MODELS OF THE AIRCRAFT MAINTENANCE PROCESS
Directory of Open Access Journals (Sweden)
Stanislav Vladimirovich Daletskiy
2017-01-01
Full Text Available The aircraft maintenance is realized by a rapid sequence of maintenance organizational and technical states, its re- search and analysis are carried out by statistical methods. The maintenance process concludes aircraft technical states con- nected with the objective patterns of technical qualities changes of the aircraft as a maintenance object and organizational states which determine the subjective organization and planning process of aircraft using. The objective maintenance pro- cess is realized in Maintenance and Repair System which does not include maintenance organization and planning and is a set of related elements: aircraft, Maintenance and Repair measures, executors and documentation that sets rules of their interaction for maintaining of the aircraft reliability and readiness for flight. The aircraft organizational and technical states are considered, their characteristics and heuristic estimates of connection in knots and arcs of graphs and of aircraft organi- zational states during regular maintenance and at technical state failure are given. It is shown that in real conditions of air- craft maintenance, planned aircraft technical state control and maintenance control through it, is only defined by Mainte- nance and Repair conditions at a given Maintenance and Repair type and form structures, and correspondingly by setting principles of Maintenance and Repair work types to the execution, due to maintenance, by aircraft and all its units mainte- nance and reconstruction strategies. The realization of planned Maintenance and Repair process determines the one of the constant maintenance component. The proposed graphical models allow to reveal quantitative correlations between graph knots to improve maintenance processes by statistical research methods, what reduces manning, timetable and expenses for providing safe civil aviation aircraft maintenance.
The Metropolis Monte Carlo method with CUDA enabled Graphic Processing Units
Energy Technology Data Exchange (ETDEWEB)
Hall, Clifford [Computational Materials Science Center, George Mason University, 4400 University Dr., Fairfax, VA 22030 (United States); School of Physics, Astronomy, and Computational Sciences, George Mason University, 4400 University Dr., Fairfax, VA 22030 (United States); Ji, Weixiao [Computational Materials Science Center, George Mason University, 4400 University Dr., Fairfax, VA 22030 (United States); Blaisten-Barojas, Estela, E-mail: blaisten@gmu.edu [Computational Materials Science Center, George Mason University, 4400 University Dr., Fairfax, VA 22030 (United States); School of Physics, Astronomy, and Computational Sciences, George Mason University, 4400 University Dr., Fairfax, VA 22030 (United States)
2014-02-01
We present a CPU–GPU system for runtime acceleration of large molecular simulations using GPU computation and memory swaps. The memory architecture of the GPU can be used both as container for simulation data stored on the graphics card and as floating-point code target, providing an effective means for the manipulation of atomistic or molecular data on the GPU. To fully take advantage of this mechanism, efficient GPU realizations of algorithms used to perform atomistic and molecular simulations are essential. Our system implements a versatile molecular engine, including inter-molecule interactions and orientational variables for performing the Metropolis Monte Carlo (MMC) algorithm, which is one type of Markov chain Monte Carlo. By combining memory objects with floating-point code fragments we have implemented an MMC parallel engine that entirely avoids the communication time of molecular data at runtime. Our runtime acceleration system is a forerunner of a new class of CPU–GPU algorithms exploiting memory concepts combined with threading for avoiding bus bandwidth and communication. The testbed molecular system used here is a condensed phase system of oligopyrrole chains. A benchmark shows a size scaling speedup of 60 for systems with 210,000 pyrrole monomers. Our implementation can easily be combined with MPI to connect in parallel several CPU–GPU duets. -- Highlights: •We parallelize the Metropolis Monte Carlo (MMC) algorithm on one CPU—GPU duet. •The Adaptive Tempering Monte Carlo employs MMC and profits from this CPU—GPU implementation. •Our benchmark shows a size scaling-up speedup of 62 for systems with 225,000 particles. •The testbed involves a polymeric system of oligopyrroles in the condensed phase. •The CPU—GPU parallelization includes dipole—dipole and Mie—Jones classic potentials.
The Metropolis Monte Carlo method with CUDA enabled Graphic Processing Units
International Nuclear Information System (INIS)
Hall, Clifford; Ji, Weixiao; Blaisten-Barojas, Estela
2014-01-01
We present a CPU–GPU system for runtime acceleration of large molecular simulations using GPU computation and memory swaps. The memory architecture of the GPU can be used both as container for simulation data stored on the graphics card and as floating-point code target, providing an effective means for the manipulation of atomistic or molecular data on the GPU. To fully take advantage of this mechanism, efficient GPU realizations of algorithms used to perform atomistic and molecular simulations are essential. Our system implements a versatile molecular engine, including inter-molecule interactions and orientational variables for performing the Metropolis Monte Carlo (MMC) algorithm, which is one type of Markov chain Monte Carlo. By combining memory objects with floating-point code fragments we have implemented an MMC parallel engine that entirely avoids the communication time of molecular data at runtime. Our runtime acceleration system is a forerunner of a new class of CPU–GPU algorithms exploiting memory concepts combined with threading for avoiding bus bandwidth and communication. The testbed molecular system used here is a condensed phase system of oligopyrrole chains. A benchmark shows a size scaling speedup of 60 for systems with 210,000 pyrrole monomers. Our implementation can easily be combined with MPI to connect in parallel several CPU–GPU duets. -- Highlights: •We parallelize the Metropolis Monte Carlo (MMC) algorithm on one CPU—GPU duet. •The Adaptive Tempering Monte Carlo employs MMC and profits from this CPU—GPU implementation. •Our benchmark shows a size scaling-up speedup of 62 for systems with 225,000 particles. •The testbed involves a polymeric system of oligopyrroles in the condensed phase. •The CPU—GPU parallelization includes dipole—dipole and Mie—Jones classic potentials.
GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform
Directory of Open Access Journals (Sweden)
Ronglin Jiang
2014-01-01
Full Text Available This paper introduces a (finite difference time domain FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI and Open Multiprocessing (OpenMP. Since both Central Processing Unit (CPU and Graphics Processing Unit (GPU resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with 16×18 elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.
Survey of using GPU CUDA programming model in medical image analysis
Directory of Open Access Journals (Sweden)
T. Kalaiselvi
2017-01-01
Full Text Available With the technology development of medical industry, processing data is expanding rapidly and computation time also increases due to many factors like 3D, 4D treatment planning, the increasing sophistication of MRI pulse sequences and the growing complexity of algorithms. Graphics processing unit (GPU addresses these problems and gives the solutions for using their features such as, high computation throughput, high memory bandwidth, support for floating-point arithmetic and low cost. Compute unified device architecture (CUDA is a popular GPU programming model introduced by NVIDIA for parallel computing. This review paper briefly discusses the need of GPU CUDA computing in the medical image analysis. The GPU performances of existing algorithms are analyzed and the computational gain is discussed. A few open issues, hardware configurations and optimization principles of existing methods are discussed. This survey concludes the few optimization techniques with the medical imaging algorithms on GPU. Finally, limitation and future scope of GPU programming are discussed.
Personal Supercomputing for Monte Carlo Simulation Using a GPU
Energy Technology Data Exchange (ETDEWEB)
Oh, Jae-Yong; Koo, Yang-Hyun; Lee, Byung-Ho [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of)
2008-05-15
Since the usability, accessibility, and maintenance of a personal computer (PC) are very good, a PC is a useful computer simulation tool for researchers. It has enough calculation power to simulate a small scale system with the improved performance of a PC's CPU. However, if a system is large or long time scale, we need a cluster computer or supercomputer. Recently great changes have occurred in the PC calculation environment. A graphic process unit (GPU) on a graphic card, only used to calculate display data, has a superior calculation capability to a PC's CPU. This GPU calculation performance is a match for the supercomputer in 2000. Although it has such a great calculation potential, it is not easy to program a simulation code for GPU due to difficult programming techniques for converting a calculation matrix to a 3D rendering image using graphic APIs. In 2006, NVIDIA provided the Software Development Kit (SDK) for the programming environment for NVIDIA's graphic cards, which is called the Compute Unified Device Architecture (CUDA). It makes the programming on the GPU easy without knowledge of the graphic APIs. This paper describes the basic architectures of NVIDIA's GPU and CUDA, and carries out a performance benchmark for the Monte Carlo simulation.
Personal Supercomputing for Monte Carlo Simulation Using a GPU
International Nuclear Information System (INIS)
Oh, Jae-Yong; Koo, Yang-Hyun; Lee, Byung-Ho
2008-01-01
Since the usability, accessibility, and maintenance of a personal computer (PC) are very good, a PC is a useful computer simulation tool for researchers. It has enough calculation power to simulate a small scale system with the improved performance of a PC's CPU. However, if a system is large or long time scale, we need a cluster computer or supercomputer. Recently great changes have occurred in the PC calculation environment. A graphic process unit (GPU) on a graphic card, only used to calculate display data, has a superior calculation capability to a PC's CPU. This GPU calculation performance is a match for the supercomputer in 2000. Although it has such a great calculation potential, it is not easy to program a simulation code for GPU due to difficult programming techniques for converting a calculation matrix to a 3D rendering image using graphic APIs. In 2006, NVIDIA provided the Software Development Kit (SDK) for the programming environment for NVIDIA's graphic cards, which is called the Compute Unified Device Architecture (CUDA). It makes the programming on the GPU easy without knowledge of the graphic APIs. This paper describes the basic architectures of NVIDIA's GPU and CUDA, and carries out a performance benchmark for the Monte Carlo simulation
Quick plasma equilibrium reconstruction based on GPU
International Nuclear Information System (INIS)
Xiao Bingjia; Huang, Y.; Luo, Z.P.; Yuan, Q.P.; Lao, L.
2014-01-01
A parallel code named P-EFIT which could complete an equilibrium reconstruction iteration in 250 μs is described. It is built with the CUDA TM architecture by using Graphical Processing Unit (GPU). It is described for the optimization of middle-scale matrix multiplication on GPU and an algorithm which could solve block tri-diagonal linear system efficiently in parallel. Benchmark test is conducted. Static test proves the accuracy of the P-EFIT and simulation-test proves the feasibility of using P-EFIT for real-time reconstruction on 65x65 computation grids. (author)
GPU Pro 5 advanced rendering techniques
Engel, Wolfgang
2014-01-01
In GPU Pro5: Advanced Rendering Techniques, section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Marius Bjorge have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book covers rendering, lighting, effects in image space, mobile devices, 3D engine design, and compute. It explores rasterization of liquids, ray tracing of art assets that would otherwise be used in a rasterized engine, physically based area lights, volumetric light
Processing-in-Memory Enabled Graphics Processors for 3D Rendering
Energy Technology Data Exchange (ETDEWEB)
Xie, Chenhao; Song, Shuaiwen; Wang, Jing; Zhang, Weigong; Fu, Xin
2017-02-06
The performance of 3D rendering of Graphics Processing Unit that convents 3D vector stream into 2D frame with 3D image effects significantly impact users’ gaming experience on modern computer systems. Due to the high texture throughput in 3D rendering, main memory bandwidth becomes a critical obstacle for improving the overall rendering performance. 3D stacked memory systems such as Hybrid Memory Cube (HMC) provide opportunities to significantly overcome the memory wall by directly connecting logic controllers to DRAM dies. Based on the observation that texel fetches significantly impact off-chip memory traffic, we propose two architectural designs to enable Processing-In-Memory based GPU for efficient 3D rendering.
Hallock, Michael J; Stone, John E; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida
2014-05-01
Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli . Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.
International Nuclear Information System (INIS)
Bellezzo, Murillo
2014-01-01
As the most accurate method to estimate absorbed dose in radiotherapy, Monte Carlo Method (MCM) has been widely used in radiotherapy treatment planning. Nevertheless, its efficiency can be improved for clinical routine applications. In this thesis, the CUBMC code is presented, a GPU-based MC photon transport algorithm for dose calculation under the Compute Unified Device Architecture (CUDA) platform. The simulation of physical events is based on the algorithm used in PENELOPE, and the cross section table used is the one generated by the MATERIAL routine, also present in PENELOPE code. Photons are transported in voxel-based geometries with different compositions. There are two distinct approaches used for transport simulation. The rst of them forces the photon to stop at every voxel frontier, the second one is the Woodcock method, where the photon ignores the existence of borders and travels in homogeneous fictitious media. The CUBMC code aims to be an alternative of Monte Carlo simulator code that, by using the capability of parallel processing of graphics processing units (GPU), provide high performance simulations in low cost compact machines, and thus can be applied in clinical cases and incorporated in treatment planning systems for radiotherapy. (author)
International Nuclear Information System (INIS)
Dudnik, V.A.; Kudryavtsev, V.I.; Sereda, T.M.; Us, S.A.; Shestakov, M.V.
2009-01-01
The new opportunities of modern graphic processors (GPU) for acceleration of the scientific and technical calculations with the help of paralleling of a calculating task between the central processor and GPU are described. The description of using the technology NVIDIA CUDA for connection of parallel computing opportunities of GPU within the programme of the some intensive mathematical tasks is resulted. The examples of comparison of parameters of productivity in the process of these tasks' calculation without application of GPU and with use of opportunities NVIDIA CUDA for graphic processor GeForce 8800 are resulted
Graphics processor efficiency for realization of rapid tabular computations
International Nuclear Information System (INIS)
Dudnik, V.A.; Kudryavtsev, V.I.; Us, S.A.; Shestakov, M.V.
2016-01-01
Capabilities of graphics processing units (GPU) and central processing units (CPU) have been investigated for realization of fast-calculation algorithms with the use of tabulated functions. The realization of tabulated functions is exemplified by the GPU/CPU architecture-based processors. Comparison is made between the operating efficiencies of GPU and CPU, employed for tabular calculations at different conditions of use. Recommendations are formulated for the use of graphical and central processors to speed up scientific and engineering computations through the use of tabulated functions
Weigel, Martin
2011-09-01
Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
International Nuclear Information System (INIS)
Levine, Benjamin G.; Stone, John E.; Kohlmeyer, Axel
2011-01-01
The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.
GPU-accelerated computation of electron transfer.
Höfinger, Siegfried; Acocella, Angela; Pop, Sergiu C; Narumi, Tetsu; Yasuoka, Kenji; Beu, Titus; Zerbetto, Francesco
2012-11-05
Electron transfer is a fundamental process that can be studied with the help of computer simulation. The underlying quantum mechanical description renders the problem a computationally intensive application. In this study, we probe the graphics processing unit (GPU) for suitability to this type of problem. Time-critical components are identified via profiling of an existing implementation and several different variants are tested involving the GPU at increasing levels of abstraction. A publicly available library supporting basic linear algebra operations on the GPU turns out to accelerate the computation approximately 50-fold with minor dependence on actual problem size. The performance gain does not compromise numerical accuracy and is of significant value for practical purposes. Copyright © 2012 Wiley Periodicals, Inc.
Overview of implementation of DARPA GPU program in SAIC
Braunreiter, Dennis; Furtek, Jeremy; Chen, Hai-Wen; Healy, Dennis
2008-04-01
This paper reviews the implementation of DARPA MTO STAP-BOY program for both Phase I and II conducted at Science Applications International Corporation (SAIC). The STAP-BOY program conducts fast covariance factorization and tuning techniques for space-time adaptive process (STAP) Algorithm Implementation on Graphics Processor unit (GPU) Architectures for Embedded Systems. The first part of our presentation on the DARPA STAP-BOY program will focus on GPU implementation and algorithm innovations for a prototype radar STAP algorithm. The STAP algorithm will be implemented on the GPU, using stream programming (from companies such as PeakStream, ATI Technologies' CTM, and NVIDIA) and traditional graphics APIs. This algorithm will include fast range adaptive STAP weight updates and beamforming applications, each of which has been modified to exploit the parallel nature of graphics architectures.
Anandakrishnan, Ramu; Scogland, Tom R W; Fenley, Andrew T; Gordon, John C; Feng, Wu-chun; Onufriev, Alexey V
2010-06-01
Tools that compute and visualize biomolecular electrostatic surface potential have been used extensively for studying biomolecular function. However, determining the surface potential for large biomolecules on a typical desktop computer can take days or longer using currently available tools and methods. Two commonly used techniques to speed-up these types of electrostatic computations are approximations based on multi-scale coarse-graining and parallelization across multiple processors. This paper demonstrates that for the computation of electrostatic surface potential, these two techniques can be combined to deliver significantly greater speed-up than either one separately, something that is in general not always possible. Specifically, the electrostatic potential computation, using an analytical linearized Poisson-Boltzmann (ALPB) method, is approximated using the hierarchical charge partitioning (HCP) multi-scale method, and parallelized on an ATI Radeon 4870 graphical processing unit (GPU). The implementation delivers a combined 934-fold speed-up for a 476,040 atom viral capsid, compared to an equivalent non-parallel implementation on an Intel E6550 CPU without the approximation. This speed-up is significantly greater than the 42-fold speed-up for the HCP approximation alone or the 182-fold speed-up for the GPU alone. Copyright (c) 2010 Elsevier Inc. All rights reserved.
Anandakrishnan, Ramu; Scogland, Tom R. W.; Fenley, Andrew T.; Gordon, John C.; Feng, Wu-chun; Onufriev, Alexey V.
2010-01-01
Tools that compute and visualize biomolecular electrostatic surface potential have been used extensively for studying biomolecular function. However, determining the surface potential for large biomolecules on a typical desktop computer can take days or longer using currently available tools and methods. Two commonly used techniques to speed up these types of electrostatic computations are approximations based on multi-scale coarse-graining and parallelization across multiple processors. This paper demonstrates that for the computation of electrostatic surface potential, these two techniques can be combined to deliver significantly greater speed-up than either one separately, something that is in general not always possible. Specifically, the electrostatic potential computation, using an analytical linearized Poisson Boltzmann (ALPB) method, is approximated using the hierarchical charge partitioning (HCP) multiscale method, and parallelized on an ATI Radeon 4870 graphical processing unit (GPU). The implementation delivers a combined 934-fold speed-up for a 476,040 atom viral capsid, compared to an equivalent non-parallel implementation on an Intel E6550 CPU without the approximation. This speed-up is significantly greater than the 42-fold speed-up for the HCP approximation alone or the 182-fold speed-up for the GPU alone. PMID:20452792
Wang, Kun; Huang, Chao; Kao, Yu-Jiun; Chou, Cheng-Ying; Oraevsky, Alexander A; Anastasio, Mark A
2013-02-01
Optoacoustic tomography (OAT) is inherently a three-dimensional (3D) inverse problem. However, most studies of OAT image reconstruction still employ two-dimensional imaging models. One important reason is because 3D image reconstruction is computationally burdensome. The aim of this work is to accelerate existing image reconstruction algorithms for 3D OAT by use of parallel programming techniques. Parallelization strategies are proposed to accelerate a filtered backprojection (FBP) algorithm and two different pairs of projection/backprojection operations that correspond to two different numerical imaging models. The algorithms are designed to fully exploit the parallel computing power of graphics processing units (GPUs). In order to evaluate the parallelization strategies for the projection/backprojection pairs, an iterative image reconstruction algorithm is implemented. Computer simulation and experimental studies are conducted to investigate the computational efficiency and numerical accuracy of the developed algorithms. The GPU implementations improve the computational efficiency by factors of 1000, 125, and 250 for the FBP algorithm and the two pairs of projection/backprojection operators, respectively. Accurate images are reconstructed by use of the FBP and iterative image reconstruction algorithms from both computer-simulated and experimental data. Parallelization strategies for 3D OAT image reconstruction are proposed for the first time. These GPU-based implementations significantly reduce the computational time for 3D image reconstruction, complementing our earlier work on 3D OAT iterative image reconstruction.
Space Object Collision Probability via Monte Carlo on the Graphics Processing Unit
Vittaldev, Vivek; Russell, Ryan P.
2017-09-01
Fast and accurate collision probability computations are essential for protecting space assets. Monte Carlo (MC) simulation is the most accurate but computationally intensive method. A Graphics Processing Unit (GPU) is used to parallelize the computation and reduce the overall runtime. Using MC techniques to compute the collision probability is common in literature as the benchmark. An optimized implementation on the GPU, however, is a challenging problem and is the main focus of the current work. The MC simulation takes samples from the uncertainty distributions of the Resident Space Objects (RSOs) at any time during a time window of interest and outputs the separations at closest approach. Therefore, any uncertainty propagation method may be used and the collision probability is automatically computed as a function of RSO collision radii. Integration using a fixed time step and a quartic interpolation after every Runge Kutta step ensures that no close approaches are missed. Two orders of magnitude speedups over a serial CPU implementation are shown, and speedups improve moderately with higher fidelity dynamics. The tool makes the MC approach tractable on a single workstation, and can be used as a final product, or for verifying surrogate and analytical collision probability methods.
Area-delay trade-offs of texture decompressors for a graphics processing unit
Novoa Súñer, Emilio; Ituero, Pablo; López-Vallejo, Marisa
2011-05-01
Graphics Processing Units have become a booster for the microelectronics industry. However, due to intellectual property issues, there is a serious lack of information on implementation details of the hardware architecture that is behind GPUs. For instance, the way texture is handled and decompressed in a GPU to reduce bandwidth usage has never been dealt with in depth from a hardware point of view. This work addresses a comparative study on the hardware implementation of different texture decompression algorithms for both conventional (PCs and video game consoles) and mobile platforms. Circuit synthesis is performed targeting both a reconfigurable hardware platform and a 90nm standard cell library. Area-delay trade-offs have been extensively analyzed, which allows us to compare the complexity of decompressors and thus determine suitability of algorithms for systems with limited hardware resources.
How General-Purpose can a GPU be?
Directory of Open Access Journals (Sweden)
Philip Machanick
2015-12-01
Full Text Available The use of graphics processing units (GPUs in general-purpose computation (GPGPU is a growing field. GPU instruction sets, while implementing a graphics pipeline, draw from a range of single instruction multiple datastream (SIMD architectures characteristic of the heyday of supercomputers. Yet only one of these SIMD instruction sets has been of application on a wide enough range of problems to survive the era when the full range of supercomputer design variants was being explored: vector instructions. This paper proposes a reconceptualization of the GPU as a multicore design with minimal exotic modes of parallelism so as to make GPGPU truly general.
Qin, Cheng-Zhi; Zhan, Lijun
2012-06-01
As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU
GPU computing and applications
See, Simon
2015-01-01
This book presents a collection of state of the art research on GPU Computing and Application. The major part of this book is selected from the work presented at the 2013 Symposium on GPU Computing and Applications held in Nanyang Technological University, Singapore (Oct 9, 2013). Three major domains of GPU application are covered in the book including (1) Engineering design and simulation; (2) Biomedical Sciences; and (3) Interactive & Digital Media. The book also addresses the fundamental issues in GPU computing with a focus on big data processing. Researchers and developers in GPU Computing and Applications will benefit from this book. Training professionals and educators can also benefit from this book to learn the possible application of GPU technology in various areas.
Tokuda, Junichi; Plishker, William; Torabi, Meysam; Olubiyi, Olutayo I; Zaki, George; Tatli, Servet; Silverman, Stuart G; Shekher, Raj; Hata, Nobuhiko
2015-06-01
Accuracy and speed are essential for the intraprocedural nonrigid magnetic resonance (MR) to computed tomography (CT) image registration in the assessment of tumor margins during CT-guided liver tumor ablations. Although both accuracy and speed can be improved by limiting the registration to a region of interest (ROI), manual contouring of the ROI prolongs the registration process substantially. To achieve accurate and fast registration without the use of an ROI, we combined a nonrigid registration technique on the basis of volume subdivision with hardware acceleration using a graphics processing unit (GPU). We compared the registration accuracy and processing time of GPU-accelerated volume subdivision-based nonrigid registration technique to the conventional nonrigid B-spline registration technique. Fourteen image data sets of preprocedural MR and intraprocedural CT images for percutaneous CT-guided liver tumor ablations were obtained. Each set of images was registered using the GPU-accelerated volume subdivision technique and the B-spline technique. Manual contouring of ROI was used only for the B-spline technique. Registration accuracies (Dice similarity coefficient [DSC] and 95% Hausdorff distance [HD]) and total processing time including contouring of ROIs and computation were compared using a paired Student t test. Accuracies of the GPU-accelerated registrations and B-spline registrations, respectively, were 88.3 ± 3.7% versus 89.3 ± 4.9% (P = .41) for DSC and 13.1 ± 5.2 versus 11.4 ± 6.3 mm (P = .15) for HD. Total processing time of the GPU-accelerated registration and B-spline registration techniques was 88 ± 14 versus 557 ± 116 seconds (P processing time. The GPU-accelerated volume subdivision technique may enable the implementation of nonrigid registration into routine clinical practice. Copyright © 2015 AUR. Published by Elsevier Inc. All rights reserved.
Energy Technology Data Exchange (ETDEWEB)
Fierro, Andrew, E-mail: andrew.fierro@ttu.edu; Dickens, James; Neuber, Andreas [Center for Pulsed Power and Power Electronics, Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, Texas 79409 (United States)
2014-12-15
A 3-dimensional particle-in-cell/Monte Carlo collision simulation that is fully implemented on a graphics processing unit (GPU) is described and used to determine low-temperature plasma characteristics at high reduced electric field, E/n, in nitrogen gas. Details of implementation on the GPU using the NVIDIA Compute Unified Device Architecture framework are discussed with respect to efficient code execution. The software is capable of tracking around 10 × 10{sup 6} particles with dynamic weighting and a total mesh size larger than 10{sup 8} cells. Verification of the simulation is performed by comparing the electron energy distribution function and plasma transport parameters to known Boltzmann Equation (BE) solvers. Under the assumption of a uniform electric field and neglecting the build-up of positive ion space charge, the simulation agrees well with the BE solvers. The model is utilized to calculate plasma characteristics of a pulsed, parallel plate discharge. A photoionization model provides the simulation with additional electrons after the initial seeded electron density has drifted towards the anode. Comparison of the performance benefits between the GPU-implementation versus a CPU-implementation is considered, and a speed-up factor of 13 for a 3D relaxation Poisson solver is obtained. Furthermore, a factor 60 speed-up is realized for parallelization of the electron processes.
A Relational Reasoning Approach to Text-Graphic Processing
Danielson, Robert W.; Sinatra, Gale M.
2017-01-01
We propose that research on text-graphic processing could be strengthened by the inclusion of relational reasoning perspectives. We briefly outline four aspects of relational reasoning: "analogies," "anomalies," "antinomies", and "antitheses". Next, we illustrate how text-graphic researchers have been…
GPU-accelerated micromagnetic simulations using cloud computing
International Nuclear Information System (INIS)
Jermain, C.L.; Rowlands, G.E.; Buhrman, R.A.; Ralph, D.C.
2016-01-01
Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics. - Highlights: • The benefits of cloud computing for GPU-accelerated micromagnetics are examined. • We present the MuCloud software for running simulations on cloud computing. • Simulation run times are measured to benchmark cloud computing performance. • Comparison benchmarks are analyzed between CPU and GPU based solvers.
GPU-accelerated micromagnetic simulations using cloud computing
Energy Technology Data Exchange (ETDEWEB)
Jermain, C.L., E-mail: clj72@cornell.edu [Cornell University, Ithaca, NY 14853 (United States); Rowlands, G.E.; Buhrman, R.A. [Cornell University, Ithaca, NY 14853 (United States); Ralph, D.C. [Cornell University, Ithaca, NY 14853 (United States); Kavli Institute at Cornell, Ithaca, NY 14853 (United States)
2016-03-01
Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics. - Highlights: • The benefits of cloud computing for GPU-accelerated micromagnetics are examined. • We present the MuCloud software for running simulations on cloud computing. • Simulation run times are measured to benchmark cloud computing performance. • Comparison benchmarks are analyzed between CPU and GPU based solvers.
Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU.
Arefan, D; Talebpour, A; Ahmadinejhad, N; Kamali Asl, A
2015-06-01
Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU). At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU) card and the Graphics Processing Unit (GPU). It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU).
Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms
Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel
2016-04-01
Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and
Energy Technology Data Exchange (ETDEWEB)
Maurer, S. A.; Kussmann, J.; Ochsenfeld, C., E-mail: Christian.Ochsenfeld@cup.uni-muenchen.de [Chair of Theoretical Chemistry, Department of Chemistry, University of Munich (LMU), Butenandtstr. 7, D-81377 München (Germany); Center for Integrated Protein Science (CIPSM) at the Department of Chemistry, University of Munich (LMU), Butenandtstr. 5–13, D-81377 München (Germany)
2014-08-07
We present a low-prefactor, cubically scaling scaled-opposite-spin second-order Møller-Plesset perturbation theory (SOS-MP2) method which is highly suitable for massively parallel architectures like graphics processing units (GPU). The scaling is reduced from O(N{sup 5}) to O(N{sup 3}) by a reformulation of the MP2-expression in the atomic orbital basis via Laplace transformation and the resolution-of-the-identity (RI) approximation of the integrals in combination with efficient sparse algebra for the 3-center integral transformation. In contrast to previous works that employ GPUs for post Hartree-Fock calculations, we do not simply employ GPU-based linear algebra libraries to accelerate the conventional algorithm. Instead, our reformulation allows to replace the rate-determining contraction step with a modified J-engine algorithm, that has been proven to be highly efficient on GPUs. Thus, our SOS-MP2 scheme enables us to treat large molecular systems in an accurate and efficient manner on a single GPU-server.
Maurer, S A; Kussmann, J; Ochsenfeld, C
2014-08-07
We present a low-prefactor, cubically scaling scaled-opposite-spin second-order Møller-Plesset perturbation theory (SOS-MP2) method which is highly suitable for massively parallel architectures like graphics processing units (GPU). The scaling is reduced from O(N⁵) to O(N³) by a reformulation of the MP2-expression in the atomic orbital basis via Laplace transformation and the resolution-of-the-identity (RI) approximation of the integrals in combination with efficient sparse algebra for the 3-center integral transformation. In contrast to previous works that employ GPUs for post Hartree-Fock calculations, we do not simply employ GPU-based linear algebra libraries to accelerate the conventional algorithm. Instead, our reformulation allows to replace the rate-determining contraction step with a modified J-engine algorithm, that has been proven to be highly efficient on GPUs. Thus, our SOS-MP2 scheme enables us to treat large molecular systems in an accurate and efficient manner on a single GPU-server.
Advantages of GPU technology in DFT calculations of intercalated graphene
Pešić, J.; Gajić, R.
2014-09-01
Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an
Advantages of GPU technology in DFT calculations of intercalated graphene
International Nuclear Information System (INIS)
Pešić, J; Gajić, R
2014-01-01
Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an
SU-D-BRD-03: A Gateway for GPU Computing in Cancer Radiotherapy Research
Energy Technology Data Exchange (ETDEWEB)
Jia, X; Folkerts, M [The University of Texas Southwestern Medical Ctr, Dallas, TX (United States); Shi, F; Yan, H; Yan, Y; Jiang, S [UT Southwestern Medical Center, Dallas, TX (United States); Sivagnanam, S; Majumdar, A [University of California San Diego, La Jolla, CA (United States)
2014-06-01
Purpose: Graphics Processing Unit (GPU) has become increasingly important in radiotherapy. However, it is still difficult for general clinical researchers to access GPU codes developed by other researchers, and for developers to objectively benchmark their codes. Moreover, it is quite often to see repeated efforts spent on developing low-quality GPU codes. The goal of this project is to establish an infrastructure for testing GPU codes, cross comparing them, and facilitating code distributions in radiotherapy community. Methods: We developed a system called Gateway for GPU Computing in Cancer Radiotherapy Research (GCR2). A number of GPU codes developed by our group and other developers can be accessed via a web interface. To use the services, researchers first upload their test data or use the standard data provided by our system. Then they can select the GPU device on which the code will be executed. Our system offers all mainstream GPU hardware for code benchmarking purpose. After the code running is complete, the system automatically summarizes and displays the computing results. We also released a SDK to allow the developers to build their own algorithm implementation and submit their binary codes to the system. The submitted code is then systematically benchmarked using a variety of GPU hardware and representative data provided by our system. The developers can also compare their codes with others and generate benchmarking reports. Results: It is found that the developed system is fully functioning. Through a user-friendly web interface, researchers are able to test various GPU codes. Developers also benefit from this platform by comprehensively benchmarking their codes on various GPU platforms and representative clinical data sets. Conclusion: We have developed an open platform allowing the clinical researchers and developers to access the GPUs and GPU codes. This development will facilitate the utilization of GPU in radiation therapy field.
An efficient spectral crystal plasticity solver for GPU architectures
Malahe, Michael
2018-03-01
We present a spectral crystal plasticity (CP) solver for graphics processing unit (GPU) architectures that achieves a tenfold increase in efficiency over prior GPU solvers. The approach makes use of a database containing a spectral decomposition of CP simulations performed using a conventional iterative solver over a parameter space of crystal orientations and applied velocity gradients. The key improvements in efficiency come from reducing global memory transactions, exposing more instruction-level parallelism, reducing integer instructions and performing fast range reductions on trigonometric arguments. The scheme also makes more efficient use of memory than prior work, allowing for larger problems to be solved on a single GPU. We illustrate these improvements with a simulation of 390 million crystal grains on a consumer-grade GPU, which executes at a rate of 2.72 s per strain step.
Iterative Methods for MPC on Graphical Processing Units
DEFF Research Database (Denmark)
Gade-Nielsen, Nicolai Fog; Jørgensen, John Bagterp; Dammann, Bernd
2012-01-01
The high oating point performance and memory bandwidth of Graphical Processing Units (GPUs) makes them ideal for a large number of computations which often arises in scientic computing, such as matrix operations. GPUs achieve this performance by utilizing massive par- allelism, which requires ree...... as to avoid the use of dense matrices, which may be too large for the limited memory capacity of current graphics cards.......The high oating point performance and memory bandwidth of Graphical Processing Units (GPUs) makes them ideal for a large number of computations which often arises in scientic computing, such as matrix operations. GPUs achieve this performance by utilizing massive par- allelism, which requires...
Computation of large covariance matrices by SAMMY on graphical processing units and multicore CPUs
Energy Technology Data Exchange (ETDEWEB)
Arbanas, G.; Dunn, M.E.; Wiarda, D., E-mail: arbanasg@ornl.gov, E-mail: dunnme@ornl.gov, E-mail: wiardada@ornl.gov [Oak Ridge National Laboratory, Oak Ridge, TN (United States)
2011-07-01
Computational power of Graphical Processing Units and multicore CPUs was harnessed by the nuclear data evaluation code SAMMY to speed up computations of large Resonance Parameter Covariance Matrices (RPCMs). This was accomplished by linking SAMMY to vendor-optimized implementations of the matrix-matrix multiplication subroutine of the Basic Linear Algebra Library to compute the most time-consuming step. The {sup 235}U RPCM computed previously using a triple-nested loop was re-computed using the NVIDIA implementation of the subroutine on a single Tesla Fermi Graphical Processing Unit, and also using the Intel's Math Kernel Library implementation on two different multicore CPU systems. A multiplication of two matrices of dimensions 16,000×20,000 that had previously taken days, took approximately one minute on the GPU. Comparable performance was achieved on a dual six-core CPU system. The magnitude of the speed-up suggests that these, or similar, combinations of hardware and libraries may be useful for large matrix operations in SAMMY. Uniform interfaces of standard linear algebra libraries make them a promising candidate for a programming framework of a new generation of SAMMY for the emerging heterogeneous computing platforms. (author)
Computation of large covariance matrices by SAMMY on graphical processing units and multicore CPUs
International Nuclear Information System (INIS)
Arbanas, G.; Dunn, M.E.; Wiarda, D.
2011-01-01
Computational power of Graphical Processing Units and multicore CPUs was harnessed by the nuclear data evaluation code SAMMY to speed up computations of large Resonance Parameter Covariance Matrices (RPCMs). This was accomplished by linking SAMMY to vendor-optimized implementations of the matrix-matrix multiplication subroutine of the Basic Linear Algebra Library to compute the most time-consuming step. The 235 U RPCM computed previously using a triple-nested loop was re-computed using the NVIDIA implementation of the subroutine on a single Tesla Fermi Graphical Processing Unit, and also using the Intel's Math Kernel Library implementation on two different multicore CPU systems. A multiplication of two matrices of dimensions 16,000×20,000 that had previously taken days, took approximately one minute on the GPU. Comparable performance was achieved on a dual six-core CPU system. The magnitude of the speed-up suggests that these, or similar, combinations of hardware and libraries may be useful for large matrix operations in SAMMY. Uniform interfaces of standard linear algebra libraries make them a promising candidate for a programming framework of a new generation of SAMMY for the emerging heterogeneous computing platforms. (author)
GPU-Boosted Camera-Only Indoor Localization
DEFF Research Database (Denmark)
Özkil, Ali Gürcan; Fan, Zhun; Kristensen, Jens Klæstrup
relies on local image features detection, description and matching; by parallelizing these computationally intensive tasks on the graphical processing unit (GPU), it is possible to do online localization using a Topometric Appearance Map. The method is developed as an integral part of a mobile service...
Parallel generation of architecture on the GPU
Steinberger, Markus
2014-05-01
In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars on the graphics processing unit (GPU). Unlike previous approaches that are either limited in the kind of shapes they allow, the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling including stochasticity and context-sensitivity. To increase parallelism, we explicitly express independence in the grammar, reduce inter-rule dependencies required for context-sensitive evaluation, and introduce intra-rule parallelism. Our rule scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by dynamically grouping rules in on-chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude faster than the standard in CPU-based rule evaluation, while offering equal expressive power. In comparison to the state of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context-sensitivity. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
GPU Accelerated Vector Median Filter
Aras, Rifat; Shen, Yuzhong
2011-01-01
Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
A Parallel Algebraic Multigrid Solver on Graphics Processing Units
Haase, Gundolf
2010-01-01
The paper presents a multi-GPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCG-AMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrix-vector multiplication scheme underlying the PCG-AMG algorithm is presented for the many-core GPU architecture. A performance comparison of the parallel solver shows that a singe Nvidia Tesla C1060 GPU board delivers the performance of a sixteen node Infiniband cluster and a multi-GPU configuration with eight GPUs is about 100 times faster than a typical server CPU core. © 2010 Springer-Verlag.
GPUmotif: an ultra-fast and energy-efficient motif analysis program using graphics processing units.
Zandevakili, Pooya; Hu, Ming; Qin, Zhaohui
2012-01-01
Computational detection of TF binding patterns has become an indispensable tool in functional genomics research. With the rapid advance of new sequencing technologies, large amounts of protein-DNA interaction data have been produced. Analyzing this data can provide substantial insight into the mechanisms of transcriptional regulation. However, the massive amount of sequence data presents daunting challenges. In our previous work, we have developed a novel algorithm called Hybrid Motif Sampler (HMS) that enables more scalable and accurate motif analysis. Despite much improvement, HMS is still time-consuming due to the requirement to calculate matching probabilities position-by-position. Using the NVIDIA CUDA toolkit, we developed a graphics processing unit (GPU)-accelerated motif analysis program named GPUmotif. We proposed a "fragmentation" technique to hide data transfer time between memories. Performance comparison studies showed that commonly-used model-based motif scan and de novo motif finding procedures such as HMS can be dramatically accelerated when running GPUmotif on NVIDIA graphics cards. As a result, energy consumption can also be greatly reduced when running motif analysis using GPUmotif. The GPUmotif program is freely available at http://sourceforge.net/projects/gpumotif/
MGUPGMA: A Fast UPGMA Algorithm With Multiple Graphics Processing Units Using NCCL
Directory of Open Access Journals (Sweden)
Guan-Jie Hua
2017-10-01
Full Text Available A phylogenetic tree is a visual diagram of the relationship between a set of biological species. The scientists usually use it to analyze many characteristics of the species. The distance-matrix methods, such as Unweighted Pair Group Method with Arithmetic Mean and Neighbor Joining, construct a phylogenetic tree by calculating pairwise genetic distances between taxa. These methods have the computational performance issue. Although several new methods with high-performance hardware and frameworks have been proposed, the issue still exists. In this work, a novel parallel Unweighted Pair Group Method with Arithmetic Mean approach on multiple Graphics Processing Units is proposed to construct a phylogenetic tree from extremely large set of sequences. The experimental results present that the proposed approach on a DGX-1 server with 8 NVIDIA P100 graphic cards achieves approximately 3-fold to 7-fold speedup over the implementation of Unweighted Pair Group Method with Arithmetic Mean on a modern CPU and a single GPU, respectively.
GPUmotif: an ultra-fast and energy-efficient motif analysis program using graphics processing units.
Directory of Open Access Journals (Sweden)
Pooya Zandevakili
Full Text Available Computational detection of TF binding patterns has become an indispensable tool in functional genomics research. With the rapid advance of new sequencing technologies, large amounts of protein-DNA interaction data have been produced. Analyzing this data can provide substantial insight into the mechanisms of transcriptional regulation. However, the massive amount of sequence data presents daunting challenges. In our previous work, we have developed a novel algorithm called Hybrid Motif Sampler (HMS that enables more scalable and accurate motif analysis. Despite much improvement, HMS is still time-consuming due to the requirement to calculate matching probabilities position-by-position. Using the NVIDIA CUDA toolkit, we developed a graphics processing unit (GPU-accelerated motif analysis program named GPUmotif. We proposed a "fragmentation" technique to hide data transfer time between memories. Performance comparison studies showed that commonly-used model-based motif scan and de novo motif finding procedures such as HMS can be dramatically accelerated when running GPUmotif on NVIDIA graphics cards. As a result, energy consumption can also be greatly reduced when running motif analysis using GPUmotif. The GPUmotif program is freely available at http://sourceforge.net/projects/gpumotif/
Parallel implementation of DNA sequences matching algorithms using PWM on GPU architecture.
Sharma, Rahul; Gupta, Nitin; Narang, Vipin; Mittal, Ankush
2011-01-01
Positional Weight Matrices (PWMs) are widely used in representation and detection of Transcription Factor Of Binding Sites (TFBSs) on DNA. We implement online PWM search algorithm over parallel architecture. A large PWM data can be processed on Graphic Processing Unit (GPU) systems in parallel which can help in matching sequences at a faster rate. Our method employs extensive usage of highly multithreaded architecture and shared memory of multi-cored GPU. An efficient use of shared memory is required to optimise parallel reduction in CUDA. Our optimised method has a speedup of 230-280x over linear implementation on GPU named GeForce GTX 280.
Application of GPU to Multi-interfaces Advection and Reconstruction Solver (MARS)
International Nuclear Information System (INIS)
Nagatake, Taku; Takase, Kazuyuki; Kunugi, Tomoaki
2010-01-01
In the nuclear engineering fields, a high performance computer system is necessary to perform the large scale computations. Recently, a Graphics Processing Unit (GPU) has been developed as a rendering computational system in order to reduce a Central Processing Unit (CPU) load. In the graphics processing, the high performance computing is needed to render the high-quality 3D objects in some video games. Thus the GPU consists of many processing units and a wide memory bandwidth. In this study, the Multi-interfaces Advection and Reconstruction Solver (MARS) which is one of the interface volume tracking methods for multi-phase flows has been performed. The multi-phase flow computation is very important for the nuclear reactors and other engineering fields. The MARS consists of two computing parts: the interface tracking part and the fluid motion computing part. As for the interface tracking part, the performance of GPU (GTX280) was 6 times faster than that of the CPU (Dual-Xeon 5040), and in the fluid motion computing part the Poisson Solver by the GPU (GTX285) was 22 times faster than that by the CPU(Core i7). As for the Dam Breaking Problem, the result of GPU-MARS showed slightly different from the experimental result. Because the GPU-MARS was developed using the single-precision GPU, it can be considered that the round-off error might be accumulated. (author)
GRAPHICAL MODELS OF THE AIRCRAFT MAINTENANCE PROCESS
Stanislav Vladimirovich Daletskiy; Stanislav Stanislavovich Daletskiy
2017-01-01
The aircraft maintenance is realized by a rapid sequence of maintenance organizational and technical states, its re- search and analysis are carried out by statistical methods. The maintenance process concludes aircraft technical states con- nected with the objective patterns of technical qualities changes of the aircraft as a maintenance object and organizational states which determine the subjective organization and planning process of aircraft using. The objective maintenance pro- cess is ...
CPU and GPU (Cuda Template Matching Comparison
Directory of Open Access Journals (Sweden)
Evaldas Borcovas
2014-05-01
Full Text Available Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I, NVidia GeForce GT320M CUDAcompliable graphics card (GPU I and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II, NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II.Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have.
A Parallel Algebraic Multigrid Solver on Graphics Processing Units
Haase, Gundolf; Liebmann, Manfred; Douglas, Craig C.; Plank, Gernot
2010-01-01
-vector multiplication scheme underlying the PCG-AMG algorithm is presented for the many-core GPU architecture. A performance comparison of the parallel solver shows that a singe Nvidia Tesla C1060 GPU board delivers the performance of a sixteen node Infiniband cluster
Implementation and Optimization of GPU-Based Static State Security Analysis in Power Systems
Directory of Open Access Journals (Sweden)
Yong Chen
2017-01-01
Full Text Available Static state security analysis (SSSA is one of the most important computations to check whether a power system is in normal and secure operating state. It is a challenge to satisfy real-time requirements with CPU-based concurrent methods due to the intensive computations. A sensitivity analysis-based method with Graphics processing unit (GPU is proposed for power systems, which can reduce calculation time by 40% compared to the execution on a 4-core CPU. The proposed method involves load flow analysis and sensitivity analysis. In load flow analysis, a multifrontal method for sparse LU factorization is explored on GPU through dynamic frontal task scheduling between CPU and GPU. The varying matrix operations during sensitivity analysis on GPU are highly optimized in this study. The results of performance evaluations show that the proposed GPU-based SSSA with optimized matrix operations can achieve a significant reduction in computation time.
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.
Nagaoka, Tomoaki; Watanabe, Soichi
2010-01-01
Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
Simulation of isothermal multi-phase fuel-coolant interaction using MPS method with GPU acceleration
Energy Technology Data Exchange (ETDEWEB)
Gou, W.; Zhang, S.; Zheng, Y. [Zhejiang Univ., Hangzhou (China). Center for Engineering and Scientific Computation
2016-07-15
The energetic fuel-coolant interaction (FCI) has been one of the primary safety concerns in nuclear power plants. Graphical processing unit (GPU) implementation of the moving particle semi-implicit (MPS) method is presented and used to simulate the fuel coolant interaction problem. The governing equations are discretized with the particle interaction model of MPS. Detailed implementation on single-GPU is introduced. The three-dimensional broken dam is simulated to verify the developed GPU acceleration MPS method. The proposed GPU acceleration algorithm and developed code are then used to simulate the FCI problem. As a summary of results, the developed GPU-MPS method showed a good agreement with the experimental observation and theoretical prediction.
Towards a Unified Sentiment Lexicon Based on Graphics Processing Units
Directory of Open Access Journals (Sweden)
Liliana Ibeth Barbosa-Santillán
2014-01-01
Full Text Available This paper presents an approach to create what we have called a Unified Sentiment Lexicon (USL. This approach aims at aligning, unifying, and expanding the set of sentiment lexicons which are available on the web in order to increase their robustness of coverage. One problem related to the task of the automatic unification of different scores of sentiment lexicons is that there are multiple lexical entries for which the classification of positive, negative, or neutral {P,N,Z} depends on the unit of measurement used in the annotation methodology of the source sentiment lexicon. Our USL approach computes the unified strength of polarity of each lexical entry based on the Pearson correlation coefficient which measures how correlated lexical entries are with a value between 1 and −1, where 1 indicates that the lexical entries are perfectly correlated, 0 indicates no correlation, and −1 means they are perfectly inversely correlated and so is the UnifiedMetrics procedure for CPU and GPU, respectively. Another problem is the high processing time required for computing all the lexical entries in the unification task. Thus, the USL approach computes a subset of lexical entries in each of the 1344 GPU cores and uses parallel processing in order to unify 155802 lexical entries. The results of the analysis conducted using the USL approach show that the USL has 95.430 lexical entries, out of which there are 35.201 considered to be positive, 22.029 negative, and 38.200 neutral. Finally, the runtime was 10 minutes for 95.430 lexical entries; this allows a reduction of the time computing for the UnifiedMetrics by 3 times.
Real time 3D structural and Doppler OCT imaging on graphics processing units
Sylwestrzak, Marcin; Szlag, Daniel; Szkulmowski, Maciej; Gorczyńska, Iwona; Bukowska, Danuta; Wojtkowski, Maciej; Targowski, Piotr
2013-03-01
In this report the application of graphics processing unit (GPU) programming for real-time 3D Fourier domain Optical Coherence Tomography (FdOCT) imaging with implementation of Doppler algorithms for visualization of the flows in capillary vessels is presented. Generally, the time of the data processing of the FdOCT data on the main processor of the computer (CPU) constitute a main limitation for real-time imaging. Employing additional algorithms, such as Doppler OCT analysis, makes this processing even more time consuming. Lately developed GPUs, which offers a very high computational power, give a solution to this problem. Taking advantages of them for massively parallel data processing, allow for real-time imaging in FdOCT. The presented software for structural and Doppler OCT allow for the whole processing with visualization of 2D data consisting of 2000 A-scans generated from 2048 pixels spectra with frame rate about 120 fps. The 3D imaging in the same mode of the volume data build of 220 × 100 A-scans is performed at a rate of about 8 frames per second. In this paper a software architecture, organization of the threads and optimization applied is shown. For illustration the screen shots recorded during real time imaging of the phantom (homogeneous water solution of Intralipid in glass capillary) and the human eye in-vivo is presented.
García-Calvo, Raúl; Guisado, J L; Diaz-Del-Rio, Fernando; Córdoba, Antonio; Jiménez-Morales, Francisco
2018-01-01
Understanding the regulation of gene expression is one of the key problems in current biology. A promising method for that purpose is the determination of the temporal dynamics between known initial and ending network states, by using simple acting rules. The huge amount of rule combinations and the nonlinear inherent nature of the problem make genetic algorithms an excellent candidate for finding optimal solutions. As this is a computationally intensive problem that needs long runtimes in conventional architectures for realistic network sizes, it is fundamental to accelerate this task. In this article, we study how to develop efficient parallel implementations of this method for the fine-grained parallel architecture of graphics processing units (GPUs) using the compute unified device architecture (CUDA) platform. An exhaustive and methodical study of various parallel genetic algorithm schemes-master-slave, island, cellular, and hybrid models, and various individual selection methods (roulette, elitist)-is carried out for this problem. Several procedures that optimize the use of the GPU's resources are presented. We conclude that the implementation that produces better results (both from the performance and the genetic algorithm fitness perspectives) is simulating a few thousands of individuals grouped in a few islands using elitist selection. This model comprises 2 mighty factors for discovering the best solutions: finding good individuals in a short number of generations, and introducing genetic diversity via a relatively frequent and numerous migration. As a result, we have even found the optimal solution for the analyzed gene regulatory network (GRN). In addition, a comparative study of the performance obtained by the different parallel implementations on GPU versus a sequential application on CPU is carried out. In our tests, a multifold speedup was obtained for our optimized parallel implementation of the method on medium class GPU over an equivalent
GPU PRO 3 Advanced rendering techniques
Engel, Wolfgang
2012-01-01
GPU Pro3, the third volume in the GPU Pro book series, offers practical tips and techniques for creating real-time graphics that are useful to beginners and seasoned game and graphics programmers alike. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Wessam Bahnassi, and Sebastien St-Laurent have once again brought together a high-quality collection of cutting-edge techniques for advanced GPU programming. With contributions by more than 50 experts, GPU Pro3: Advanced Rendering Techniques covers battle-tested tips and tricks for creating interesting geometry, realistic sha
Reflector antenna analysis using physical optics on Graphics Processing Units
DEFF Research Database (Denmark)
Borries, Oscar Peter; Sørensen, Hans Henrik Brandenborg; Dammann, Bernd
2014-01-01
The Physical Optics approximation is a widely used asymptotic method for calculating the scattering from electrically large bodies. It requires significant computational work and little memory, and is thus well suited for application on a Graphics Processing Unit. Here, we investigate the perform......The Physical Optics approximation is a widely used asymptotic method for calculating the scattering from electrically large bodies. It requires significant computational work and little memory, and is thus well suited for application on a Graphics Processing Unit. Here, we investigate...
Development of High-speed Visualization System of Hypocenter Data Using CUDA-based GPU computing
Kumagai, T.; Okubo, K.; Uchida, N.; Matsuzawa, T.; Kawada, N.; Takeuchi, N.
2014-12-01
After the Great East Japan Earthquake on March 11, 2011, intelligent visualization of seismic information is becoming important to understand the earthquake phenomena. On the other hand, to date, the quantity of seismic data becomes enormous as a progress of high accuracy observation network; we need to treat many parameters (e.g., positional information, origin time, magnitude, etc.) to efficiently display the seismic information. Therefore, high-speed processing of data and image information is necessary to handle enormous amounts of seismic data. Recently, GPU (Graphic Processing Unit) is used as an acceleration tool for data processing and calculation in various study fields. This movement is called GPGPU (General Purpose computing on GPUs). In the last few years the performance of GPU keeps on improving rapidly. GPU computing gives us the high-performance computing environment at a lower cost than before. Moreover, use of GPU has an advantage of visualization of processed data, because GPU is originally architecture for graphics processing. In the GPU computing, the processed data is always stored in the video memory. Therefore, we can directly write drawing information to the VRAM on the video card by combining CUDA and the graphics API. In this study, we employ CUDA and OpenGL and/or DirectX to realize full-GPU implementation. This method makes it possible to write drawing information to the VRAM on the video card without PCIe bus data transfer: It enables the high-speed processing of seismic data. The present study examines the GPU computing-based high-speed visualization and the feasibility for high-speed visualization system of hypocenter data.
2015-06-01
implementation of the direct interaction called particle-to-particle kernel for a shared-memory single GPU device using the Compute Unified Device Architecture ...GPU-defined P2P kernel we developed using the Compute Unified Device Architecture (CUDA).9 A brief outline of the rest of this work follows. The...Employed The computing environment used for this work is a 64-node heterogeneous cluster consisting of 48 IBM dx360M4 nodes, each with one Intel Phi
Xu, Jing; Wong, Kevin; Jian, Yifan; Sarunic, Marinko V
2014-02-01
In this report, we describe a graphics processing unit (GPU)-accelerated processing platform for real-time acquisition and display of flow contrast images with Fourier domain optical coherence tomography (FDOCT) in mouse and human eyes in vivo. Motion contrast from blood flow is processed using the speckle variance OCT (svOCT) technique, which relies on the acquisition of multiple B-scan frames at the same location and tracking the change of the speckle pattern. Real-time mouse and human retinal imaging using two different custom-built OCT systems with processing and display performed on GPU are presented with an in-depth analysis of performance metrics. The display output included structural OCT data, en face projections of the intensity data, and the svOCT en face projections of retinal microvasculature; these results compare projections with and without speckle variance in the different retinal layers to reveal significant contrast improvements. As a demonstration, videos of real-time svOCT for in vivo human and mouse retinal imaging are included in our results. The capability of performing real-time svOCT imaging of the retinal vasculature may be a useful tool in a clinical environment for monitoring disease-related pathological changes in the microcirculation such as diabetic retinopathy.
Significantly reducing registration time in IGRT using graphics processing units
DEFF Research Database (Denmark)
Noe, Karsten Østergaard; Denis de Senneville, Baudouin; Tanderup, Kari
2008-01-01
respiration phases in a free breathing volunteer and 41 anatomical landmark points in each image series. The registration method used is a multi-resolution GPU implementation of the 3D Horn and Schunck algorithm. It is based on the CUDA framework from Nvidia. Results On an Intel Core 2 CPU at 2.4GHz each...... registration took 30 minutes. On an Nvidia Geforce 8800GTX GPU in the same machine this registration took 37 seconds, making the GPU version 48.7 times faster. The nine image series of different respiration phases were registered to the same reference image (full inhale). Accuracy was evaluated on landmark...
GPU-BSM: a GPU-based tool to map bisulfite-treated reads.
Directory of Open Access Journals (Sweden)
Andrea Manconi
Full Text Available Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.
Parallel Computer System for 3D Visualization Stereo on GPU
Al-Oraiqat, Anas M.; Zori, Sergii A.
2018-03-01
This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
Yi, Faliu; Lee, Jieun; Moon, Inkyu
2014-05-01
The reconstruction of multiple depth images with a ray back-propagation algorithm in three-dimensional (3D) computational integral imaging is computationally burdensome. Further, a reconstructed depth image consists of a focus and an off-focus area. Focus areas are 3D points on the surface of an object that are located at the reconstructed depth, while off-focus areas include 3D points in free-space that do not belong to any object surface in 3D space. Generally, without being removed, the presence of an off-focus area would adversely affect the high-level analysis of a 3D object, including its classification, recognition, and tracking. Here, we use a graphics processing unit (GPU) that supports parallel processing with multiple processors to simultaneously reconstruct multiple depth images using a lookup table containing the shifted values along the x and y directions for each elemental image in a given depth range. Moreover, each 3D point on a depth image can be measured by analyzing its statistical variance with its corresponding samples, which are captured by the two-dimensional (2D) elemental images. These statistical variances can be used to classify depth image pixels as either focus or off-focus points. At this stage, the measurement of focus and off-focus points in multiple depth images is also implemented in parallel on a GPU. Our proposed method is conducted based on the assumption that there is no occlusion of the 3D object during the capture stage of the integral imaging process. Experimental results have demonstrated that this method is capable of removing off-focus points in the reconstructed depth image. The results also showed that using a GPU to remove the off-focus points could greatly improve the overall computational speed compared with using a CPU.
Massive Parallelism of Monte-Carlo Simulation on Low-End Hardware using Graphic Processing Units
Energy Technology Data Exchange (ETDEWEB)
Mburu, Joe Mwangi; Hah, Chang Joo Hah [KEPCO International Nuclear Graduate School, Ulsan (Korea, Republic of)
2014-05-15
Within the past decade, research has been done on utilizing GPU massive parallelization in core simulation with impressive results but unfortunately, not much commercial application has been done in the nuclear field especially in reactor core simulation. The purpose of this paper is to give an introductory concept on the topic and illustrate the potential of exploiting the massive parallel nature of GPU computing on a simple monte-carlo simulation with very minimal hardware specifications. To do a comparative analysis, a simple two dimension monte-carlo simulation is implemented for both the CPU and GPU in order to evaluate performance gain based on the computing devices. The heterogeneous platform utilized in this analysis is done on a slow notebook with only 1GHz processor. The end results are quite surprising whereby high speedups obtained are almost a factor of 10. In this work, we have utilized heterogeneous computing in a GPU-based approach in applying potential high arithmetic intensive calculation. By applying a complex monte-carlo simulation on GPU platform, we have speed up the computational process by almost a factor of 10 based on one million neutrons. This shows how easy, cheap and efficient it is in using GPU in accelerating scientific computing and the results should encourage in exploring further this avenue especially in nuclear reactor physics simulation where deterministic and stochastic calculations are quite favourable in parallelization.
Massive Parallelism of Monte-Carlo Simulation on Low-End Hardware using Graphic Processing Units
International Nuclear Information System (INIS)
Mburu, Joe Mwangi; Hah, Chang Joo Hah
2014-01-01
Within the past decade, research has been done on utilizing GPU massive parallelization in core simulation with impressive results but unfortunately, not much commercial application has been done in the nuclear field especially in reactor core simulation. The purpose of this paper is to give an introductory concept on the topic and illustrate the potential of exploiting the massive parallel nature of GPU computing on a simple monte-carlo simulation with very minimal hardware specifications. To do a comparative analysis, a simple two dimension monte-carlo simulation is implemented for both the CPU and GPU in order to evaluate performance gain based on the computing devices. The heterogeneous platform utilized in this analysis is done on a slow notebook with only 1GHz processor. The end results are quite surprising whereby high speedups obtained are almost a factor of 10. In this work, we have utilized heterogeneous computing in a GPU-based approach in applying potential high arithmetic intensive calculation. By applying a complex monte-carlo simulation on GPU platform, we have speed up the computational process by almost a factor of 10 based on one million neutrons. This shows how easy, cheap and efficient it is in using GPU in accelerating scientific computing and the results should encourage in exploring further this avenue especially in nuclear reactor physics simulation where deterministic and stochastic calculations are quite favourable in parallelization
Heterogeneous Multicore Parallel Programming for Graphics Processing Units
Directory of Open Access Journals (Sweden)
Francois Bodin
2009-01-01
Full Text Available Hybrid parallel multicore architectures based on graphics processing units (GPUs can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting GPUs from existing applications is a difficult task that requires non-portable rewriting of the code. In this paper, we present HMPP, a Heterogeneous Multicore Parallel Programming workbench with compilers, developed by CAPS entreprise, that allows the integration of heterogeneous hardware accelerators in a unintrusive manner while preserving the legacy code.
Graphic Arts: Book Three. The Press and Related Processes.
Farajollahi, Karim; And Others
The third of a three-volume set of instructional materials for a graphic arts course, this manual consists of nine instructional units dealing with presses and related processes. Covered in the units are basic press fundamentals, offset press systems, offset press operating procedures, offset inks and dampening chemistry, preventive maintenance…
The Use of Computer Graphics in the Design Process.
Palazzi, Maria
This master's thesis examines applications of computer technology to the field of industrial design and ways in which technology can transform the traditional process. Following a statement of the problem, the history and applications of the fields of computer graphics and industrial design are reviewed. The traditional industrial design process…
Visualisation for Stochastic Process Algebras: The Graphic Truth
DEFF Research Database (Denmark)
Smith, Michael James Andrew; Gilmore, Stephen
2011-01-01
and stochastic activity networks provide an automaton-based view of the model, which may be easier to visualise, at the expense of portability. In this paper, we argue that we can achieve the benefits of both approaches by generating a graphical view of a stochastic process algebra model, which is synchronised...
Nikolskiy, V. P.; Stegailov, V. V.
2018-01-01
Metal nanoparticles (NPs) serve as important tools for many modern technologies. However, the proper microscopic models of the interaction between ultrashort laser pulses and metal NPs are currently not very well developed in many cases. One part of the problem is the description of the warm dense matter that is formed in NPs after intense irradiation. Another part of the problem is the description of the electromagnetic waves around NPs. Description of wave propagation requires the solution of Maxwell’s equations and the finite-difference time-domain (FDTD) method is the classic approach for solving them. There are many commercial and free implementations of FDTD, including the open source software that supports graphics processing unit (GPU) acceleration. In this report we present the results on the FDTD calculations for different cases of the interaction between ultrashort laser pulses and metal nanoparticles. Following our previous results, we analyze the efficiency of the GPU acceleration of the FDTD algorithm.
Implementing the lattice Boltzmann model on commodity graphics hardware
International Nuclear Information System (INIS)
Kaufman, Arie; Fan, Zhe; Petkov, Kaloian
2009-01-01
Modern graphics processing units (GPUs) can perform general-purpose computations in addition to the native specialized graphics operations. Due to the highly parallel nature of graphics processing, the GPU has evolved into a many-core coprocessor that supports high data parallelism. Its performance has been growing at a rate of squared Moore's law, and its peak floating point performance exceeds that of the CPU by an order of magnitude. Therefore, it is a viable platform for time-sensitive and computationally intensive applications. The lattice Boltzmann model (LBM) computations are carried out via linear operations at discrete lattice sites, which can be implemented efficiently using a GPU-based architecture. Our simulations produce results comparable to the CPU version while improving performance by an order of magnitude. We have demonstrated that the GPU is well suited for interactive simulations in many applications, including simulating fire, smoke, lightweight objects in wind, jellyfish swimming in water, and heat shimmering and mirage (using the hybrid thermal LBM). We further advocate the use of a GPU cluster for large scale LBM simulations and for high performance computing. The Stony Brook Visual Computing Cluster has been the platform for several applications, including simulations of real-time plume dispersion in complex urban environments and thermal fluid dynamics in a pressurized water reactor. Major GPU vendors have been targeting the high performance computing market with GPU hardware implementations. Software toolkits such as NVIDIA CUDA provide a convenient development platform that abstracts the GPU and allows access to its underlying stream computing architecture. However, software programming for a GPU cluster remains a challenging task. We have therefore developed the Zippy framework to simplify GPU cluster programming. Zippy is based on global arrays combined with the stream programming model and it hides the low-level details of the
GPU-based Parallel Application Design for Emerging Mobile Devices
Gupta, Kshitij
A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as
Fully 3D GPU PET reconstruction
Energy Technology Data Exchange (ETDEWEB)
Herraiz, J.L., E-mail: joaquin@nuclear.fis.ucm.es [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Espana, S. [Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA (United States); Cal-Gonzalez, J. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Vaquero, J.J. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Desco, M. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Unidad de Medicina y Cirugia Experimental, Hospital General Universitario Gregorio Maranon, Madrid (Spain); Udias, J.M. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain)
2011-08-21
Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.
Fully 3D GPU PET reconstruction
International Nuclear Information System (INIS)
Herraiz, J.L.; Espana, S.; Cal-Gonzalez, J.; Vaquero, J.J.; Desco, M.; Udias, J.M.
2011-01-01
Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.
ALICE HLT high speed tracking on GPU
Gorbunov, Sergey; Aamodt, Kenneth; Alt, Torsten; Appelshauser, Harald; Arend, Andreas; Bach, Matthias; Becker, Bruce; Bottger, Stefan; Breitner, Timo; Busching, Henner; Chattopadhyay, Sukalyan; Cleymans, Jean; Cicalo, Corrado; Das, Indranil; Djuvsland, Oystein; Engel, Heiko; Erdal, Hege Austrheim; Fearick, Roger; Haaland, Oystein Senneset; Hille, Per Thomas; Kalcher, Sebastian; Kanaki, Kalliopi; Kebschull, Udo Wolfgang; Kisel, Ivan; Kretz, Matthias; Lara, Camillo; Lindal, Sven; Lindenstruth, Volker; Masoodi, Arshad Ahmad; Ovrebekk, Gaute; Panse, Ralf; Peschek, Jorg; Ploskon, Mateusz; Pocheptsov, Timur; Ram, Dinesh; Rascanu, Theodor; Richter, Matthias; Rohrich, Dieter; Ronchetti, Federico; Skaali, Bernhard; Smorholm, Olav; Stokkevag, Camilla; Steinbeck, Timm Morten; Szostak, Artur; Thader, Jochen; Tveter, Trine; Ullaland, Kjetil; Vilakazi, Zeblon; Weis, Robert; Yin, Zhong-Bao; Zelnicek, Pierre
2011-01-01
The on-line event reconstruction in ALICE is performed by the High Level Trigger, which should process up to 2000 events per second in proton-proton collisions and up to 300 central events per second in heavy-ion collisions, corresponding to an inp ut data stream of 30 GB/s. In order to fulfill the time requirements, a fast on-line tracker has been developed. The algorithm combines a Cellular Automaton method being used for a fast pattern recognition and the Kalman Filter method for fitting of found trajectories and for the final track selection. The tracker was adapted to run on Graphics Processing Units (GPU) using the NVIDIA Compute Unified Device Architecture (CUDA) framework. The implementation of the algorithm had to be adjusted at many points to allow for an efficient usage of the graphics cards. In particular, achieving a good overall workload for many processor cores, efficient transfer to and from the GPU, as well as optimized utilization of the different memories the GPU offers turned out to be cri...
Numerical simulation of lava flow using a GPU SPH model
Directory of Open Access Journals (Sweden)
Eugenio Rustico
2011-12-01
Full Text Available A smoothed particle hydrodynamics (SPH method for lava-flow modeling was implemented on a graphical processing unit (GPU using the compute unified device architecture (CUDA developed by NVIDIA. This resulted in speed-ups of up to two orders of magnitude. The three-dimensional model can simulate lava flow on a real topography with free-surface, non-Newtonian fluids, and with phase change. The entire SPH code has three main components, neighbor list construction, force computation, and integration of the equation of motion, and it is computed on the GPU, fully exploiting the computational power. The simulation speed achieved is one to two orders of magnitude faster than the equivalent central processing unit (CPU code. This GPU implementation of SPH allows high resolution SPH modeling in hours and days, rather than in weeks and months, on inexpensive and readily available hardware.
GPU-Based Techniques for Global Illumination Effects
Szirmay-Kalos, László; Sbert, Mateu
2008-01-01
This book presents techniques to render photo-realistic images by programming the Graphics Processing Unit (GPU). We discuss effects such as mirror reflections, refractions, caustics, diffuse or glossy indirect illumination, radiosity, single or multiple scattering in participating media, tone reproduction, glow, and depth of field. This book targets game developers, graphics programmers, and also students with some basic understanding of computer graphics algorithms, rendering APIs like Direct3D or OpenGL, and shader programming. In order to make this book self-contained, the most important c
GPU Based Software Correlators - Perspectives for VLBI2010
Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun
2010-01-01
Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
Thermal/Heat Transfer Analysis Using a Graphic Processing Unit (GPU) Enabled Computing Environment
National Aeronautics and Space Administration — Simulation technology plays an important role in propulsion test facility design and development by assessing risks, identifying failure modes, and predicting...
A Block-Asynchronous Relaxation Method for Graphics Processing Units
Anzt, H.; Dongarra, J.; Heuveline, Vincent; Tomov, S.
2011-01-01
In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). For this purpose, we developed a set of asynchronous iteration algorithms in CUDA and compared them with a parallel implementation of synchronous relaxation methods on CPU-based systems. For a set of test matrices taken from the University of Florida Matrix Collection we monitor the convergence behavior, the average iteration time and the total time-to-solution time. Analyzing the r...
Parallel GPU implementation of iterative PCA algorithms.
Andrecut, M
2009-11-01
Principal component analysis (PCA) is a key statistical technique for multivariate data analysis. For large data sets, the common approach to PCA computation is based on the standard NIPALS-PCA algorithm, which unfortunately suffers from loss of orthogonality, and therefore its applicability is usually limited to the estimation of the first few components. Here we present an algorithm based on Gram-Schmidt orthogonalization (called GS-PCA), which eliminates this shortcoming of NIPALS-PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel implementation of both NIPALS-PCA and GS-PCA algorithms. The numerical results show that the GPU parallel optimized versions, based on CUBLAS (NVIDIA), are substantially faster (up to 12 times) than the CPU optimized versions based on CBLAS (GNU Scientific Library).
Directory of Open Access Journals (Sweden)
Maskell Douglas L
2009-05-01
Full Text Available Abstract Background The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware. Findings Our CUDASW++ implementation (benchmarked on a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card provides a significant performance improvement compared to other publicly available implementations, such as SWPS3, CBESW, SW-CUDA, and NCBI-BLAST. CUDASW++ supports query sequences of length up to 59K and for query sequences ranging in length from 144 to 5,478 in Swiss-Prot release 56.6, the single-GPU version achieves an average performance of 9.509 GCUPS with a lowest performance of 9.039 GCUPS and a highest performance of 9.660 GCUPS, and the dual-GPU version achieves an average performance of 14.484 GCUPS with a lowest performance of 10.660 GCUPS and a highest performance of 16.087 GCUPS. Conclusion CUDASW++ is publicly available open-source software. It provides a significant performance improvement for Smith-Waterman-based protein sequence database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs.
Masset, Frédéric
2015-09-01
GFARGO is a GPU version of FARGO. It is written in C and C for CUDA and runs only on NVIDIA’s graphics cards. Though it corresponds to the standard, isothermal version of FARGO, not all functionnalities of the CPU version have been translated to CUDA. The code is available in single and double precision versions, the latter compatible with FERMI architectures. GFARGO can run on a graphics card connected to the display, allowing the user to see in real time how the fields evolve.
gPGA: GPU Accelerated Population Genetics Analyses.
Directory of Open Access Journals (Sweden)
Chunbao Zhou
Full Text Available The isolation with migration (IM model is important for studies in population genetics and phylogeography. IM program applies the IM model to genetic data drawn from a pair of closely related populations or species based on Markov chain Monte Carlo (MCMC simulations of gene genealogies. But computational burden of IM program has placed limits on its application.With strong computational power, Graphics Processing Unit (GPU has been widely used in many fields. In this article, we present an effective implementation of IM program on one GPU based on Compute Unified Device Architecture (CUDA, which we call gPGA.Compared with IM program, gPGA can achieve up to 52.30X speedup on one GPU. The evaluation results demonstrate that it allows datasets to be analyzed effectively and rapidly for research on divergence population genetics. The software is freely available with source code at https://github.com/chunbaozhou/gPGA.
International Nuclear Information System (INIS)
Ding Yu; Qi Yujin; Zhang Xuezhu; Zhao Cuilan
2011-01-01
In this paper, we report the development of a high-performance image processing platform, which is based on CPU-GPU heterogeneous cluster. Currently, it consists of a Dell Precision T7500 and HP XW8600 workstations with parallel programming and runtime environment, using the message-passing interface (MPI) and CUDA (Compute Unified Device Architecture). We succeeded in developing parallel image processing techniques for 3D image reconstruction of X-ray micro-CT imaging. The results show that a GPU provides a computing efficiency of about 194 times faster than a single CPU, and the CPU-GPU clusters provides a computing efficiency of about 46 times faster than the CPU clusters. These meet the requirements of rapid 3D image reconstruction and real time image display. In conclusion, the use of CPU-GPU heterogeneous cluster is an effective way to build high-performance image processing platform. (authors)
GPU accelerated manifold correction method for spinning compact binaries
Ran, Chong-xi; Liu, Song; Zhong, Shuang-ying
2018-04-01
The graphics processing unit (GPU) acceleration of the manifold correction algorithm based on the compute unified device architecture (CUDA) technology is designed to simulate the dynamic evolution of the Post-Newtonian (PN) Hamiltonian formulation of spinning compact binaries. The feasibility and the efficiency of parallel computation on GPU have been confirmed by various numerical experiments. The numerical comparisons show that the accuracy on GPU execution of manifold corrections method has a good agreement with the execution of codes on merely central processing unit (CPU-based) method. The acceleration ability when the codes are implemented on GPU can increase enormously through the use of shared memory and register optimization techniques without additional hardware costs, implying that the speedup is nearly 13 times as compared with the codes executed on CPU for phase space scan (including 314 × 314 orbits). In addition, GPU-accelerated manifold correction method is used to numerically study how dynamics are affected by the spin-induced quadrupole-monopole interaction for black hole binary system.
Incompressible SPH (ISPH) with fast Poisson solver on a GPU
Chow, Alex D.; Rogers, Benedict D.; Lind, Steven J.; Stansby, Peter K.
2018-05-01
This paper presents a fast incompressible SPH (ISPH) solver implemented to run entirely on a graphics processing unit (GPU) capable of simulating several millions of particles in three dimensions on a single GPU. The ISPH algorithm is implemented by converting the highly optimised open-source weakly-compressible SPH (WCSPH) code DualSPHysics to run ISPH on the GPU, combining it with the open-source linear algebra library ViennaCL for fast solutions of the pressure Poisson equation (PPE). Several challenges are addressed with this research: constructing a PPE matrix every timestep on the GPU for moving particles, optimising the limited GPU memory, and exploiting fast matrix solvers. The ISPH pressure projection algorithm is implemented as 4 separate stages, each with a particle sweep, including an algorithm for the population of the PPE matrix suitable for the GPU, and mixed precision storage methods. An accurate and robust ISPH boundary condition ideal for parallel processing is also established by adapting an existing WCSPH boundary condition for ISPH. A variety of validation cases are presented: an impulsively started plate, incompressible flow around a moving square in a box, and dambreaks (2-D and 3-D) which demonstrate the accuracy, flexibility, and speed of the methodology. Fragmentation of the free surface is shown to influence the performance of matrix preconditioners and therefore the PPE matrix solution time. The Jacobi preconditioner demonstrates robustness and reliability in the presence of fragmented flows. For a dambreak simulation, GPU speed ups demonstrate up to 10-18 times and 1.1-4.5 times compared to single-threaded and 16-threaded CPU run times respectively.
García-Calvo, Raúl; Guisado, JL; Diaz-del-Rio, Fernando; Córdoba, Antonio; Jiménez-Morales, Francisco
2018-01-01
Understanding the regulation of gene expression is one of the key problems in current biology. A promising method for that purpose is the determination of the temporal dynamics between known initial and ending network states, by using simple acting rules. The huge amount of rule combinations and the nonlinear inherent nature of the problem make genetic algorithms an excellent candidate for finding optimal solutions. As this is a computationally intensive problem that needs long runtimes in conventional architectures for realistic network sizes, it is fundamental to accelerate this task. In this article, we study how to develop efficient parallel implementations of this method for the fine-grained parallel architecture of graphics processing units (GPUs) using the compute unified device architecture (CUDA) platform. An exhaustive and methodical study of various parallel genetic algorithm schemes—master-slave, island, cellular, and hybrid models, and various individual selection methods (roulette, elitist)—is carried out for this problem. Several procedures that optimize the use of the GPU’s resources are presented. We conclude that the implementation that produces better results (both from the performance and the genetic algorithm fitness perspectives) is simulating a few thousands of individuals grouped in a few islands using elitist selection. This model comprises 2 mighty factors for discovering the best solutions: finding good individuals in a short number of generations, and introducing genetic diversity via a relatively frequent and numerous migration. As a result, we have even found the optimal solution for the analyzed gene regulatory network (GRN). In addition, a comparative study of the performance obtained by the different parallel implementations on GPU versus a sequential application on CPU is carried out. In our tests, a multifold speedup was obtained for our optimized parallel implementation of the method on medium class GPU over an equivalent
MHD code using multi graphical processing units: SMAUG+
Gyenge, N.; Griffiths, M. K.; Erdélyi, R.
2018-01-01
This paper introduces the Sheffield Magnetohydrodynamics Algorithm Using GPUs (SMAUG+), an advanced numerical code for solving magnetohydrodynamic (MHD) problems, using multi-GPU systems. Multi-GPU systems facilitate the development of accelerated codes and enable us to investigate larger model sizes and/or more detailed computational domain resolutions. This is a significant advancement over the parent single-GPU MHD code, SMAUG (Griffiths et al., 2015). Here, we demonstrate the validity of the SMAUG + code, describe the parallelisation techniques and investigate performance benchmarks. The initial configuration of the Orszag-Tang vortex simulations are distributed among 4, 16, 64 and 100 GPUs. Furthermore, different simulation box resolutions are applied: 1000 × 1000, 2044 × 2044, 4000 × 4000 and 8000 × 8000 . We also tested the code with the Brio-Wu shock tube simulations with model size of 800 employing up to 10 GPUs. Based on the test results, we observed speed ups and slow downs, depending on the granularity and the communication overhead of certain parallel tasks. The main aim of the code development is to provide massively parallel code without the memory limitation of a single GPU. By using our code, the applied model size could be significantly increased. We demonstrate that we are able to successfully compute numerically valid and large 2D MHD problems.
Directory of Open Access Journals (Sweden)
G Boroni
2017-03-01
Full Text Available Lattice Boltzmann Method (LBM has shown great potential in fluid simulations, but performance issues and difficulties to manage complex boundary conditions have hindered a wider application. The upcoming of Graphic Processing Units (GPU Computing offered a possible solution for the performance issue, and methods like the Immersed Boundary (IB algorithm proved to be a flexible solution to boundaries. Unfortunately, the implicit IB algorithm makes the LBM implementation in GPU a non-trivial task. This work presents a fully parallel GPU implementation of LBM in combination with IB. The fluid-boundary interaction is implemented via GPU kernels, using execution configurations and data structures specifically designed to accelerate each code execution. Simulations were validated against experimental and analytical data showing good agreement and improving the computational time. Substantial reductions of calculation rates were achieved, lowering down the required time to execute the same model in a CPU to about two magnitude orders.
gWEGA: GPU-accelerated WEGA for molecular superposition and shape comparison.
Yan, Xin; Li, Jiabo; Gu, Qiong; Xu, Jun
2014-06-05
Virtual screening of a large chemical library for drug lead identification requires searching/superimposing a large number of three-dimensional (3D) chemical structures. This article reports a graphic processing unit (GPU)-accelerated weighted Gaussian algorithm (gWEGA) that expedites shape or shape-feature similarity score-based virtual screening. With 86 GPU nodes (each node has one GPU card), gWEGA can screen 110 million conformations derived from an entire ZINC drug-like database with diverse antidiabetic agents as query structures within 2 s (i.e., screening more than 55 million conformations per second). The rapid screening speed was accomplished through the massive parallelization on multiple GPU nodes and rapid prescreening of 3D structures (based on their shape descriptors and pharmacophore feature compositions). Copyright © 2014 Wiley Periodicals, Inc.
Wu, Xin; Koslowski, Axel; Thiel, Walter
2012-07-10
In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
Fast analytical scatter estimation using graphics processing units.
Ingleby, Harry; Lippuner, Jonas; Rickey, Daniel W; Li, Yue; Elbakri, Idris
2015-01-01
To develop a fast patient-specific analytical estimator of first-order Compton and Rayleigh scatter in cone-beam computed tomography, implemented using graphics processing units. The authors developed an analytical estimator for first-order Compton and Rayleigh scatter in a cone-beam computed tomography geometry. The estimator was coded using NVIDIA's CUDA environment for execution on an NVIDIA graphics processing unit. Performance of the analytical estimator was validated by comparison with high-count Monte Carlo simulations for two different numerical phantoms. Monoenergetic analytical simulations were compared with monoenergetic and polyenergetic Monte Carlo simulations. Analytical and Monte Carlo scatter estimates were compared both qualitatively, from visual inspection of images and profiles, and quantitatively, using a scaled root-mean-square difference metric. Reconstruction of simulated cone-beam projection data of an anthropomorphic breast phantom illustrated the potential of this method as a component of a scatter correction algorithm. The monoenergetic analytical and Monte Carlo scatter estimates showed very good agreement. The monoenergetic analytical estimates showed good agreement for Compton single scatter and reasonable agreement for Rayleigh single scatter when compared with polyenergetic Monte Carlo estimates. For a voxelized phantom with dimensions 128 × 128 × 128 voxels and a detector with 256 × 256 pixels, the analytical estimator required 669 seconds for a single projection, using a single NVIDIA 9800 GX2 video card. Accounting for first order scatter in cone-beam image reconstruction improves the contrast to noise ratio of the reconstructed images. The analytical scatter estimator, implemented using graphics processing units, provides rapid and accurate estimates of single scatter and with further acceleration and a method to account for multiple scatter may be useful for practical scatter correction schemes.
Acceleration for 2D time-domain elastic full waveform inversion using a single GPU card
Jiang, Jinpeng; Zhu, Peimin
2018-05-01
Full waveform inversion (FWI) is a challenging procedure due to the high computational cost related to the modeling, especially for the elastic case. The graphics processing unit (GPU) has become a popular device for the high-performance computing (HPC). To reduce the long computation time, we design and implement the GPU-based 2D elastic FWI (EFWI) in time domain using a single GPU card. We parallelize the forward modeling and gradient calculations using the CUDA programming language. To overcome the limitation of relatively small global memory on GPU, the boundary saving strategy is exploited to reconstruct the forward wavefield. Moreover, the L-BFGS optimization method used in the inversion increases the convergence of the misfit function. A multiscale inversion strategy is performed in the workflow to obtain the accurate inversion results. In our tests, the GPU-based implementations using a single GPU device achieve >15 times speedup in forward modeling, and about 12 times speedup in gradient calculation, compared with the eight-core CPU implementations optimized by OpenMP. The test results from the GPU implementations are verified to have enough accuracy by comparing the results obtained from the CPU implementations.
GPU-accelerated Gibbs ensemble Monte Carlo simulations of Lennard-Jonesium
Mick, Jason; Hailat, Eyad; Russo, Vincent; Rushaidat, Kamel; Schwiebert, Loren; Potoff, Jeffrey
2013-12-01
This work describes an implementation of canonical and Gibbs ensemble Monte Carlo simulations on graphics processing units (GPUs). The pair-wise energy calculations, which consume the majority of the computational effort, are parallelized using the energetic decomposition algorithm. While energetic decomposition is relatively inefficient for traditional CPU-bound codes, the algorithm is ideally suited to the architecture of the GPU. The performance of the CPU and GPU codes are assessed for a variety of CPU and GPU combinations for systems containing between 512 and 131,072 particles. For a system of 131,072 particles, the GPU-enabled canonical and Gibbs ensemble codes were 10.3 and 29.1 times faster (GTX 480 GPU vs. i5-2500K CPU), respectively, than an optimized serial CPU-bound code. Due to overhead from memory transfers from system RAM to the GPU, the CPU code was slightly faster than the GPU code for simulations containing less than 600 particles. The critical temperature Tc∗=1.312(2) and density ρc∗=0.316(3) were determined for the tail corrected Lennard-Jones potential from simulations of 10,000 particle systems, and found to be in exact agreement with prior mixed field finite-size scaling calculations [J.J. Potoff, A.Z. Panagiotopoulos, J. Chem. Phys. 109 (1998) 10914].
Ultrafast convolution/superposition using tabulated and exponential kernels on GPU
Energy Technology Data Exchange (ETDEWEB)
Chen Quan; Chen Mingli; Lu Weiguo [TomoTherapy Inc., 1240 Deming Way, Madison, Wisconsin 53717 (United States)
2011-03-15
Purpose: Collapsed-cone convolution/superposition (CCCS) dose calculation is the workhorse for IMRT dose calculation. The authors present a novel algorithm for computing CCCS dose on the modern graphic processing unit (GPU). Methods: The GPU algorithm includes a novel TERMA calculation that has no write-conflicts and has linear computation complexity. The CCCS algorithm uses either tabulated or exponential cumulative-cumulative kernels (CCKs) as reported in literature. The authors have demonstrated that the use of exponential kernels can reduce the computation complexity by order of a dimension and achieve excellent accuracy. Special attentions are paid to the unique architecture of GPU, especially the memory accessing pattern, which increases performance by more than tenfold. Results: As a result, the tabulated kernel implementation in GPU is two to three times faster than other GPU implementations reported in literature. The implementation of CCCS showed significant speedup on GPU over single core CPU. On tabulated CCK, speedups as high as 70 are observed; on exponential CCK, speedups as high as 90 are observed. Conclusions: Overall, the GPU algorithm using exponential CCK is 1000-3000 times faster over a highly optimized single-threaded CPU implementation using tabulated CCK, while the dose differences are within 0.5% and 0.5 mm. This ultrafast CCCS algorithm will allow many time-sensitive applications to use accurate dose calculation.
Image processing and computer graphics in radiology. Pt. A
International Nuclear Information System (INIS)
Toennies, K.D.
1993-01-01
The reports give a full review of all aspects of digital imaging in radiology which are of significance to image processing and the subsequent picture archiving and communication techniques. The review strongly clings to practice and illustrates the various contributions from specialized areas of the computer sciences, such as computer vision, computer graphics, database systems and information and communication systems, man-machine interactions and software engineering. Methods and models available are explained and assessed for their respective performance and value, and basic principles are briefly explained. (DG) [de
Image processing and computer graphics in radiology. Pt. B
International Nuclear Information System (INIS)
Toennies, K.D.
1993-01-01
The reports give a full review of all aspects of digital imaging in radiology which are of significance to image processing and the subsequent picture archiving and communication techniques. The review strongly clings to practice and illustrates the various contributions from specialized areas of the computer sciences, such as computer vision, computer graphics, database systems and information and communication systems, man-machine interactions and software engineering. Methods and models available are explained and assessed for their respective performance and value, and basic principles are briefly explained. (DG) [de
Touch-sensitive graphics terminal applied to process control
International Nuclear Information System (INIS)
Bennion, S.I.; Creager, J.D.; VanHouten, R.D.
1981-01-01
Limited initial demonstrations of the system described took place during September 1980. A single CRT was used an an input device in the control center while operating a furnace and a pellet inspection gage. These two process line devices were completely controlled, despite the longer than desired response times noted, using a single control station located in the control center. The operator could conveniently execute any function from this remote location which could be performed locally at the hard-wired control panels. With the installation of the enhancements, the integrated touchscreen/graphics terminal will provide a preferable alternative to normal keyboard command input devices
Gamma camera image processing and graphical analysis mutual software system
International Nuclear Information System (INIS)
Wang Zhiqian; Chen Yongming; Ding Ailian; Ling Zhiye; Jin Yongjie
1992-01-01
GCCS gamma camera image processing and graphical analysis system is a special mutual software system. It is mainly used to analyse various patient data acquired from gamma camera. This system is used on IBM PC, PC/XT or PC/AT. It consists of several parts: system management, data management, device management, program package and user programs. The system provides two kinds of user interfaces: command menu and command characters. It is easy to change and enlarge this system because it is best modularized. The user programs include almost all the clinical protocols used now
Ice-sheet modelling accelerated by graphics cards
Brædstrup, Christian Fredborg; Damsgaard, Anders; Egholm, David Lundbek
2014-11-01
Studies of glaciers and ice sheets have increased the demand for high performance numerical ice flow models over the past decades. When exploring the highly non-linear dynamics of fast flowing glaciers and ice streams, or when coupling multiple flow processes for ice, water, and sediment, researchers are often forced to use super-computing clusters. As an alternative to conventional high-performance computing hardware, the Graphical Processing Unit (GPU) is capable of massively parallel computing while retaining a compact design and low cost. In this study, we present a strategy for accelerating a higher-order ice flow model using a GPU. By applying the newest GPU hardware, we achieve up to 180× speedup compared to a similar but serial CPU implementation. Our results suggest that GPU acceleration is a competitive option for ice-flow modelling when compared to CPU-optimised algorithms parallelised by the OpenMP or Message Passing Interface (MPI) protocols.
GPU based Monte Carlo for PET image reconstruction: detector modeling
International Nuclear Information System (INIS)
Légrády; Cserkaszky, Á; Lantos, J.; Patay, G.; Bükki, T.
2011-01-01
Monte Carlo (MC) calculations and Graphical Processing Units (GPUs) are almost like the dedicated hardware designed for the specific task given the similarities between visible light transport and neutral particle trajectories. A GPU based MC gamma transport code has been developed for Positron Emission Tomography iterative image reconstruction calculating the projection from unknowns to data at each iteration step taking into account the full physics of the system. This paper describes the simplified scintillation detector modeling and its effect on convergence. (author)
Basket Option Pricing Using GP-GPU Hardware Acceleration
Douglas, Craig C.
2010-08-01
We introduce a basket option pricing problem arisen in financial mathematics. We discretized the problem based on the alternating direction implicit (ADI) method and parallel cyclic reduction is applied to solve the set of tridiagonal matrices generated by the ADI method. To reduce the computational time of the problem, a general purpose graphics processing units (GP-GPU) environment is considered. Numerical results confirm the convergence and efficiency of the proposed method. © 2010 IEEE.
GPU implementation of Bayesian neural network construction for data-intensive applications
International Nuclear Information System (INIS)
Perry, Michelle; Meyer-Baese, Anke; Prosper, Harrison B
2014-01-01
We describe a graphical processing unit (GPU) implementation of the Hybrid Markov Chain Monte Carlo (HMC) method for training Bayesian Neural Networks (BNN). Our implementation uses NVIDIA's parallel computing architecture, CUDA. We briefly review BNNs and the HMC method and we describe our implementations and give preliminary results.
Streaming Parallel GPU Acceleration of Large-Scale filter-based Spiking Neural Networks
L.P. Slazynski (Leszek); S.M. Bohte (Sander)
2012-01-01
htmlabstractThe arrival of graphics processing (GPU) cards suitable for massively parallel computing promises a↵ordable large-scale neural network simulation previously only available at supercomputing facil- ities. While the raw numbers suggest that GPUs may outperform CPUs by at least an order of
Efficient GPU-based texture interpolation using uniform B-splines
Ruijters, D.; Haar Romenij, ter B.M.; Suetens, P.
2008-01-01
This article presents uniform B-spline interpolation, completely contained on the graphics processing unit (GPU). This implies that the CPU does not need to compute any lookup tables or B-spline basis functions. The cubic interpolation can be decomposed into several linear interpolations [Sigg and
Representation stigma: Perceptions of tools and processes for design graphics
Directory of Open Access Journals (Sweden)
David Barbarash
2016-12-01
Full Text Available Practicing designers and design students across multiple fields were surveyed to measure preference and perception of traditional hand and digital tools to determine if common biases for an individual toolset are realized in practice. Significant results were found, primarily with age being a determinant in preference of graphic tools and processes; this finding demonstrates a hard line between generations of designers. Results show that while there are strong opinions in tools and processes, the realities of modern business practice and production gravitate towards digital methods despite a traditional tool preference in more experienced designers. While negative stigmas regarding computers remain, younger generations are more accepting of digital tools and images, which should eventually lead to a paradigm shift in design professions.
Zhang, Bo; Yang, Xiang; Yang, Fei; Yang, Xin; Qin, Chenghu; Han, Dong; Ma, Xibo; Liu, Kai; Tian, Jie
2010-09-13
In molecular imaging (MI), especially the optical molecular imaging, bioluminescence tomography (BLT) emerges as an effective imaging modality for small animal imaging. The finite element methods (FEMs), especially the adaptive finite element (AFE) framework, play an important role in BLT. The processing speed of the FEMs and the AFE framework still needs to be improved, although the multi-thread CPU technology and the multi CPU technology have already been applied. In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU. The numerical experiments are designed to show the necessity, effect and application of the proposed CUBLAS and CULA based GPU acceleration. From the results of the experiments, we can reach the conclusion that the proposed CUBLAS and CULA based GPU acceleration method can improve the processing speed of the AFE framework very much while getting a balance between cost and performance.
Bayer image parallel decoding based on GPU
Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
2012-11-01
In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Integrating post-Newtonian equations on graphics processing units
Energy Technology Data Exchange (ETDEWEB)
Herrmann, Frank; Tiglio, Manuel [Department of Physics, Center for Fundamental Physics, and Center for Scientific Computation and Mathematical Modeling, University of Maryland, College Park, MD 20742 (United States); Silberholz, John [Center for Scientific Computation and Mathematical Modeling, University of Maryland, College Park, MD 20742 (United States); Bellone, Matias [Facultad de Matematica, Astronomia y Fisica, Universidad Nacional de Cordoba, Cordoba 5000 (Argentina); Guerberoff, Gustavo, E-mail: tiglio@umd.ed [Facultad de Ingenieria, Instituto de Matematica y Estadistica ' Prof. Ing. Rafael Laguardia' , Universidad de la Republica, Montevideo (Uruguay)
2010-02-07
We report on early results of a numerical and statistical study of binary black hole inspirals. The two black holes are evolved using post-Newtonian approximations starting with initially randomly distributed spin vectors. We characterize certain aspects of the distribution shortly before merger. In particular we note the uniform distribution of black hole spin vector dot products shortly before merger and a high correlation between the initial and final black hole spin vector dot products in the equal-mass, maximally spinning case. More than 300 million simulations were performed on graphics processing units, and we demonstrate a speed-up of a factor 50 over a more conventional CPU implementation. (fast track communication)
A real-time spike sorting method based on the embedded GPU.
Zelan Yang; Kedi Xu; Xiang Tian; Shaomin Zhang; Xiaoxiang Zheng
2017-07-01
Microelectrode arrays with hundreds of channels have been widely used to acquire neuron population signals in neuroscience studies. Online spike sorting is becoming one of the most important challenges for high-throughput neural signal acquisition systems. Graphic processing unit (GPU) with high parallel computing capability might provide an alternative solution for increasing real-time computational demands on spike sorting. This study reported a method of real-time spike sorting through computing unified device architecture (CUDA) which was implemented on an embedded GPU (NVIDIA JETSON Tegra K1, TK1). The sorting approach is based on the principal component analysis (PCA) and K-means. By analyzing the parallelism of each process, the method was further optimized in the thread memory model of GPU. Our results showed that the GPU-based classifier on TK1 is 37.92 times faster than the MATLAB-based classifier on PC while their accuracies were the same with each other. The high-performance computing features of embedded GPU demonstrated in our studies suggested that the embedded GPU provide a promising platform for the real-time neural signal processing.
Ludmila P. Bobrik; Leonid V. Markin
2013-01-01
Current state of technical universities students’ engineering grounding and “Engineering graphics” course place in MAI are analyzed in this paper. Also bachelor degree problems and experience of creation of issuing specialty based on «Engineering graphics» department are considered.
Directory of Open Access Journals (Sweden)
Ludmila P. Bobrik
2013-01-01
Full Text Available Current state of technical universities students’ engineering grounding and “Engineering graphics” course place in MAI are analyzed in this paper. Also bachelor degree problems and experience of creation of issuing specialty based on «Engineering graphics» department are considered.
GPU-computing in econophysics and statistical physics
Preis, T.
2011-03-01
A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples. In particular computationally expensive analyses employed in financial market context are coded on a graphics card architecture which leads to a significant reduction of computing time. In order to demonstrate the wide range of possible applications, a standard model in statistical physics - the Ising model - is ported to a graphics card architecture as well, resulting in large speedup values.
Real-Time GPU Implementation of Transverse Oscillation Vector Velocity Flow Imaging
DEFF Research Database (Denmark)
Bradway, David; Pihl, Michael Johannes; Krebs, Andreas
2014-01-01
Rapid estimation of blood velocity and visualization of complex flow patterns are important for clinical use of diagnostic ultrasound. This paper presents real-time processing for two-dimensional (2-D) vector flow imaging which utilizes an off-the-shelf graphics processing unit (GPU). In this work...... vector flow acquisition takes 2.3 milliseconds seconds on an Advanced Micro Devices Radeon HD 7850 GPU card. The detected velocities are accurate to within the precision limit of the output format of the display routine. Because this tool was developed as a module external to the scanner’s built...
SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU
Energy Technology Data Exchange (ETDEWEB)
Moriya, S; Sato, M [Komazawa University, Setagaya, Tokyo (Japan); Tachibana, H [National Cancer Center Hospital East, Kashiwa, Chiba (Japan)
2015-06-15
Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation running on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.
GPU-based prompt gamma ray imaging from boron neutron capture therapy
International Nuclear Information System (INIS)
Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae; Jo Hong, Key; Sil Lee, Keum
2015-01-01
Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations
Critique and Process: Signature Pedagogies in the Graphic Design Classroom
Motley, Phillip
2017-01-01
Like many disciplines in design and the visual fine arts, critique is a signature pedagogy in the graphic design classroom. It serves as both a formative and summative assessment while also giving students the opportunity to practice the habits of graphic design. Critiques help students become keen observers of relevant disciplinary criteria;…
Parallel Block Structured Adaptive Mesh Refinement on Graphics Processing Units
Energy Technology Data Exchange (ETDEWEB)
Beckingsale, D. A. [Atomic Weapons Establishment (AWE), Aldermaston (United Kingdom); Gaudin, W. P. [Atomic Weapons Establishment (AWE), Aldermaston (United Kingdom); Hornung, R. D. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Gunney, B. T. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Gamblin, T. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Herdman, J. A. [Atomic Weapons Establishment (AWE), Aldermaston (United Kingdom); Jarvis, S. A. [Atomic Weapons Establishment (AWE), Aldermaston (United Kingdom)
2014-11-17
Block-structured adaptive mesh refinement is a technique that can be used when solving partial differential equations to reduce the number of zones necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a native GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an eight-node cluster, and over four thousand nodes of Oak Ridge National Laboratory’s Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and has been scaled to over four thousand GPUs using a combination of MPI and CUDA.
Energy Technology Data Exchange (ETDEWEB)
Ha, Woo Seok; Kim, Soo Mee; Park, Min Jae; Lee, Dong Soo; Lee, Jae Sung [Seoul National University, Seoul (Korea, Republic of)
2009-10-15
The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 sec, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 sec, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries
International Nuclear Information System (INIS)
Ha, Woo Seok; Kim, Soo Mee; Park, Min Jae; Lee, Dong Soo; Lee, Jae Sung
2009-01-01
The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 sec, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 sec, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries
Directory of Open Access Journals (Sweden)
Christley Scott
2010-08-01
Full Text Available Abstract Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a
Yarovyi, Andrii A.; Timchenko, Leonid I.; Kozhemiako, Volodymyr P.; Kokriatskaia, Nataliya I.; Hamdi, Rami R.; Savchuk, Tamara O.; Kulyk, Oleksandr O.; Surtel, Wojciech; Amirgaliyev, Yedilkhan; Kashaganova, Gulzhan
2017-08-01
The paper deals with a problem of insufficient productivity of existing computer means for large image processing, which do not meet modern requirements posed by resource-intensive computing tasks of laser beam profiling. The research concentrated on one of the profiling problems, namely, real-time processing of spot images of the laser beam profile. Development of a theory of parallel-hierarchic transformation allowed to produce models for high-performance parallel-hierarchical processes, as well as algorithms and software for their implementation based on the GPU-oriented architecture using GPGPU technologies. The analyzed performance of suggested computerized tools for processing and classification of laser beam profile images allows to perform real-time processing of dynamic images of various sizes.
Development of the spent fuel disassembling process by utilizing the 3D graphic design technology
International Nuclear Information System (INIS)
Song, T. K.; Lee, J. Y.; Kim, S. H.; Yun, J. S.
2001-01-01
For developing the spent fuel disassembling process, the 3D graphic simulation has been established by utilizing the 3D graphic design technology which is widely used in the industry. The spent fuel disassembling process consists of a downender, a rod extraction device, a rod cutting device, a pellet extracting device and a skeleton compaction device. In this study, the 3D graphical design model of these devices is implemented by conceptual design and established the virtual workcell within kinematics to motion of each device. By implementing this graphic simulation, all the unit process involved in the spent fuel disassembling processes are analyzed and optimized. The 3D graphical model and the 3D graphic simulation can be effectively used for designing the process equipment, as well as the optimized process and maintenance process
Optimizing a mobile robot control system using GPU acceleration
Tuck, Nat; McGuinness, Michael; Martin, Fred
2012-01-01
This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
International Nuclear Information System (INIS)
Gong Chunye; Liu Jie; Chi Lihua; Huang Haowei; Fang Jingyue; Gong Zhenghu
2011-01-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (S n ) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu
2011-07-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU
Directory of Open Access Journals (Sweden)
Arefan D
2015-06-01
Full Text Available Digital Breast Tomosynthesis (DBT is a technology that creates three dimensional (3D images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU. At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU card and the Graphics Processing Unit (GPU. It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU.
International Nuclear Information System (INIS)
Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T
2011-01-01
We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30–16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9–67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning. (note)
Interior Point Methods on GPU with application to Model Predictive Control
DEFF Research Database (Denmark)
Gade-Nielsen, Nicolai Fog
The goal of this thesis is to investigate the application of interior point methods to solve dynamical optimization problems, using a graphical processing unit (GPU) with a focus on problems arising in Model Predictice Control (MPC). Multi-core processors have been available for over ten years now...... software package called GPUOPT, available under the non-restrictive MIT license. GPUOPT includes includes a primal-dual interior-point method, which supports both the CPU and the GPU. It is implemented as multiple components, where the matrix operations and solver for the Newton directions is separated...
Energy Technology Data Exchange (ETDEWEB)
Gallarno, George [Christian Brothers University; Rogers, James H [ORNL; Maxwell, Don E [ORNL
2015-01-01
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
Use of general purpose graphics processing units with MODFLOW
Hughes, Joseph D.; White, Jeremy T.
2013-01-01
To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.
DEFF Research Database (Denmark)
Bailey, Nicholas; Ingebrigtsen, Trond; Hansen, Jesper Schmidt
2017-01-01
RUMD is a general purpose, high-performance molecular dynamics (MD) simulation package running on graphical processing units (GPU’s). RUMD addresses the challenge of utilizing the many-core nature of modern GPU hardware when simulating small to medium system sizes (roughly from a few thousand up...
A low-cost system for graphical process monitoring with colour video symbol display units
International Nuclear Information System (INIS)
Grauer, H.; Jarsch, V.; Mueller, W.
1977-01-01
A system for computer controlled graphic process supervision, using color symbol video displays is described. It has the following characteristics: - compact unit: no external memory for image storage - problem oriented simple descriptive cut to the process program - no restriction of the graphical representation of process variables - computer and display independent, by implementation of colours and parameterized code creation for the display. (WB) [de
Fast DRR splat rendering using common consumer graphics hardware
International Nuclear Information System (INIS)
Spoerk, Jakob; Bergmann, Helmar; Wanschitz, Felix; Dong, Shuo; Birkfellner, Wolfgang
2007-01-01
Digitally rendered radiographs (DRR) are a vital part of various medical image processing applications such as 2D/3D registration for patient pose determination in image-guided radiotherapy procedures. This paper presents a technique to accelerate DRR creation by using conventional graphics hardware for the rendering process. DRR computation itself is done by an efficient volume rendering method named wobbled splatting. For programming the graphics hardware, NVIDIAs C for Graphics (Cg) is used. The description of an algorithm used for rendering DRRs on the graphics hardware is presented, together with a benchmark comparing this technique to a CPU-based wobbled splatting program. Results show a reduction of rendering time by about 70%-90% depending on the amount of data. For instance, rendering a volume of 2x10 6 voxels is feasible at an update rate of 38 Hz compared to 6 Hz on a common Intel-based PC using the graphics processing unit (GPU) of a conventional graphics adapter. In addition, wobbled splatting using graphics hardware for DRR computation provides higher resolution DRRs with comparable image quality due to special processing characteristics of the GPU. We conclude that DRR generation on common graphics hardware using the freely available Cg environment is a major step toward 2D/3D registration in clinical routine
Zhao, Zi-Fang; Li, Xue-Zhu; Wan, You
2017-12-01
The local field potential (LFP) is a signal reflecting the electrical activity of neurons surrounding the electrode tip. Synchronization between LFP signals provides important details about how neural networks are organized. Synchronization between two distant brain regions is hard to detect using linear synchronization algorithms like correlation and coherence. Synchronization likelihood (SL) is a non-linear synchronization-detecting algorithm widely used in studies of neural signals from two distant brain areas. One drawback of non-linear algorithms is the heavy computational burden. In the present study, we proposed a graphic processing unit (GPU)-accelerated implementation of an SL algorithm with optional 2-dimensional time-shifting. We tested the algorithm with both artificial data and raw LFP data. The results showed that this method revealed detailed information from original data with the synchronization values of two temporal axes, delay time and onset time, and thus can be used to reconstruct the temporal structure of a neural network. Our results suggest that this GPU-accelerated method can be extended to other algorithms for processing time-series signals (like EEG and fMRI) using similar recording techniques.
Using GPU to calculate electron dose for hybrid pencil beam model
International Nuclear Information System (INIS)
Guo Chengjun; Li Xia; Hou Qing; Wu Zhangwen
2011-01-01
Hybrid pencil beam model (HPBM) offers an efficient approach to calculate the three-dimension dose distribution from a clinical electron beam. Still, clinical radiation treatment activity desires faster treatment plan process. Our work presented the fast implementation of HPBM-based electron dose calculation using graphics processing unit (GPU). The HPBM algorithm was implemented in compute unified device architecture running on the GPU, and C running on the CPU, respectively. Several tests with various sizes of the field, beamlet and voxel were used to evaluate our implementation. On an NVIDIA GeForce GTX470 GPU card, we achieved speedup factors of 2.18- 98.23 with acceptable accuracy, compared with the results from a Pentium E5500 2.80 GHz Dual-core CPU. (authors)
Fully 3-D list-mode positron emission tomography image reconstruction on a multi-GPU cluster
Energy Technology Data Exchange (ETDEWEB)
Cui, Jingyu [Stanford Univ., CA (United States). Dept. of Electrical Engineering; Prevrhal, Sven; Shao, Lingxiong [Philips Healthcare, San Jose, CA (United States); Pratx, Guillem [Stanford Univ., CA (United States). Dept. of Radiation Oncology; Levin, Craig S. [Stanford Univ., CA (United States). Dept. of Radiology, Electrical Engineering, and Physics; Stanford Univ., CA (United States). Molecular Imaging Program at Stanford (MIPS); Stanford Univ., CA (United States). School of Medicine
2011-07-01
List-mode processing is an efficient way of dealing with the sparse nature of PET data sets, and is the processing method of choice for time-of-flight (ToF) PET. We present a novel method of computing line projection operations required for list-mode ordered subsets expectation maximization (OSEM) for fully 3-D PET image reconstruction on a graphics processing unit (GPU) using the compute unified device architecture (CUDA) framework. Our method overcomes challenges such as compute thread divergence, and exploits GPU capabilities such as shared memory and atomic operations. When applied to line projection operations for list-mode time-of-flight PET, this new GPU-CUDA reformulation is 188X faster than a single-threaded reference CPU implementation. When embedded in a multi-process environment on a GPU-equipped small cluster, a speedup of 4X was observed over the same configuration but without GPU support. Image quality is preserved with root mean squared (RMS) deviation of 0.05% between CPU and GPU-generated images, which has negligible effect in typical clinical applications. (orig.)
The gputools package enables GPU computing in R.
Buckner, Joshua; Wilson, Justin; Seligman, Mark; Athey, Brian; Watson, Stanley; Meng, Fan
2010-01-01
By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers. R users can take advantage of the better performance provided by an Nvidia GPU. The package is available from CRAN, the R project's repository of packages, at http://cran.r-project.org/web/packages/gputools More information about our gputools R package is available at http://brainarray.mbni.med.umich.edu/brainarray/Rgpgpu
GPU accelerated FDTD solver and its application in MRI.
Chi, J; Liu, F; Jin, J; Mason, D G; Crozier, S
2010-01-01
The finite difference time domain (FDTD) method is a popular technique for computational electromagnetics (CEM). The large computational power often required, however, has been a limiting factor for its applications. In this paper, we will present a graphics processing unit (GPU)-based parallel FDTD solver and its successful application to the investigation of a novel B1 shimming scheme for high-field magnetic resonance imaging (MRI). The optimized shimming scheme exhibits considerably improved transmit B(1) profiles. The GPU implementation dramatically shortened the runtime of FDTD simulation of electromagnetic field compared with its CPU counterpart. The acceleration in runtime has made such investigation possible, and will pave the way for other studies of large-scale computational electromagnetic problems in modern MRI which were previously impractical.
Watanabe, Yuuki; Maeno, Seiya; Aoshima, Kenji; Hasegawa, Haruyuki; Koseki, Hitoshi
2010-09-01
The real-time display of full-range, 2048?axial pixelx1024?lateral pixel, Fourier-domain optical-coherence tomography (FD-OCT) images is demonstrated. The required speed was achieved by using dual graphic processing units (GPUs) with many stream processors to realize highly parallel processing. We used a zero-filling technique, including a forward Fourier transform, a zero padding to increase the axial data-array size to 8192, an inverse-Fourier transform back to the spectral domain, a linear interpolation from wavelength to wavenumber, a lateral Hilbert transform to obtain the complex spectrum, a Fourier transform to obtain the axial profiles, and a log scaling. The data-transfer time of the frame grabber was 15.73?ms, and the processing time, which includes the data transfer between the GPU memory and the host computer, was 14.75?ms, for a total time shorter than the 36.70?ms frame-interval time using a line-scan CCD camera operated at 27.9?kHz. That is, our OCT system achieved a processed-image display rate of 27.23 frames/s.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
Directory of Open Access Journals (Sweden)
Chun-Liang Lee
Full Text Available The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo
Energy Technology Data Exchange (ETDEWEB)
Kim, H; Duchaineau, M; Max, N
2011-09-21
We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.
Compact multimode fiber beam-shaping system based on GPU accelerated digital holography.
Plöschner, Martin; Čižmár, Tomáš
2015-01-15
Real-time, on-demand, beam shaping at the end of the multimode fiber has recently been made possible by exploiting the computational power of rapidly evolving graphics processing unit (GPU) technology [Opt. Express 22, 2933 (2014)]. However, the current state-of-the-art system requires the presence of an acousto-optic deflector (AOD) to produce images at the end of the fiber without interference effects between neighboring output points. Here, we present a system free from the AOD complexity where we achieve the removal of the undesired interference effects computationally using GPU implemented Gerchberg-Saxton and Yang-Gu algorithms. The GPU implementation is two orders of magnitude faster than the CPU implementation which allows video-rate image control at the distal end of the fiber virtually free of interference effects.
Uso de tarjetas GPU para acelerar el procesado de señales
Amat Sanz, David
2017-01-01
This Bachelor's Degree Final Project aims to analyze and implement another way to process digital signals, improving their performance and speed of execution. DSP and FPGA are the most commonly used elements for any kind of signal processing. This project focuses on the use of graphics cards (GPU) to exploit to the maximum the parallelism that is available today. Current processors (CPUs) have a few cores and work sequentially which can be very time consuming if large amounts of data are bein...
2012-02-17
to be solved. Disclaimer: Reference herein to any specific commercial company , product, process, or service by trade name, trademark...data processing rather than data caching and control flow. To make use of this computational power, NVIDIA introduced a general purpose parallel...GPU implementations were run on an Intel Nehalem Xeon E5520 2.26GHz processor with an NVIDIA Tesla C2070 graphics card for varying numbers of
GPU-accelerated adjoint algorithmic differentiation
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
Heterogeneous Gpu&Cpu Cluster For High Performance Computing In Cryptography
Directory of Open Access Journals (Sweden)
Michał Marks
2012-01-01
Full Text Available This paper addresses issues associated with distributed computing systems andthe application of mixed GPU&CPU technology to data encryption and decryptionalgorithms. We describe a heterogenous cluster HGCC formed by twotypes of nodes: Intel processor with NVIDIA graphics processing unit and AMDprocessor with AMD graphics processing unit (formerly ATI, and a novel softwareframework that hides the heterogeneity of our cluster and provides toolsfor solving complex scientific and engineering problems. Finally, we present theresults of numerical experiments. The considered case study is concerned withparallel implementations of selected cryptanalysis algorithms. The main goal ofthe paper is to show the wide applicability of the GPU&CPU technology tolarge scale computation and data processing.
TU-FG-BRB-07: GPU-Based Prompt Gamma Ray Imaging From Boron Neutron Capture Therapy
Energy Technology Data Exchange (ETDEWEB)
Kim, S; Suh, T; Yoon, D; Jung, J; Shin, H; Kim, M [The catholic university of Korea, Seoul (Korea, Republic of)
2016-06-15
Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusion: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray reconstruction using the GPU computation for BNCT simulations.
Accelerating the XGBoost algorithm using GPU computing
Directory of Open Access Journals (Sweden)
Rory Mitchell
2017-07-01
Full Text Available We present a CUDA-based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the graphics processing unit (GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelised, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort-based approach for larger depths. We show speedups of between 3× and 6× using a Titan X compared to a 4 core i7 CPU, and 1.2× using a Titan X compared to 2× Xeon CPUs (24 cores. We show that it is possible to process the Higgs dataset (10 million instances, 28 features entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks.
3D data processing with advanced computer graphics tools
Zhang, Song; Ekstrand, Laura; Grieve, Taylor; Eisenmann, David J.; Chumbley, L. Scott
2012-09-01
Often, the 3-D raw data coming from an optical profilometer contains spiky noises and irregular grid, which make it difficult to analyze and difficult to store because of the enormously large size. This paper is to address these two issues for an optical profilometer by substantially reducing the spiky noise of the 3-D raw data from an optical profilometer, and by rapidly re-sampling the raw data into regular grids at any pixel size and any orientation with advanced computer graphics tools. Experimental results will be presented to demonstrate the effectiveness of the proposed approach.
GPU-based cone beam computed tomography.
Noël, Peter B; Walczak, Alan M; Xu, Jinhui; Corso, Jason J; Hoffmann, Kenneth R; Schafer, Sebastian
2010-06-01
The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 s). In many situations, the short scanning time of CBCT is followed by a time-consuming 3D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 256(3) takes up to 25 min on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high-performance computing solutions at a low cost, allowing their use in many scientific problems. We have implemented an algorithm for 3D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Corporation, Santa Clara, California), which was executed on a NVIDIA GeForce GTX 280. Our implementation results in improved reconstruction times from minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe if differences occur between CPU and GPU-based reconstructions. By using our approach, the computation time for 256(3) is reduced from 25 min on the CPU to 3.2 s on the GPU. The GPU reconstruction time for 512(3) volumes is 8.5 s. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU
Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha
2018-03-01
Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.
Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.
Mei, Gang; Xu, Nengxiong; Xu, Liangliang
2016-01-01
This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.
Directory of Open Access Journals (Sweden)
Paul Richmond
2011-05-01
Full Text Available High performance computing on the Graphics Processing Unit (GPU is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism, achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic" for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.
International Nuclear Information System (INIS)
Ammazzalorso, F; Jelen, U; Bednarz, T
2014-01-01
We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
Ammazzalorso, F.; Bednarz, T.; Jelen, U.
2014-03-01
We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.
Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H
2012-09-01
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC
Commercial Off-The-Shelf (COTS) Graphics Processing Board (GPB) Radiation Test Evaluation Report
Salazar, George A.; Steele, Glen F.
2013-01-01
Large round trip communications latency for deep space missions will require more onboard computational capabilities to enable the space vehicle to undertake many tasks that have traditionally been ground-based, mission control responsibilities. As a result, visual display graphics will be required to provide simpler vehicle situational awareness through graphical representations, as well as provide capabilities never before done in a space mission, such as augmented reality for in-flight maintenance or Telepresence activities. These capabilities will require graphics processors and associated support electronic components for high computational graphics processing. In an effort to understand the performance of commercial graphics card electronics operating in the expected radiation environment, a preliminary test was performed on five commercial offthe- shelf (COTS) graphics cards. This paper discusses the preliminary evaluation test results of five COTS graphics processing cards tested to the International Space Station (ISS) low earth orbit radiation environment. Three of the five graphics cards were tested to a total dose of 6000 rads (Si). The test articles, test configuration, preliminary results, and recommendations are discussed.
Energy Technology Data Exchange (ETDEWEB)
Wu, Meng; Fessler, Jeffrey A. [Michigan Univ., Ann Arbor, MI (United States). Dept. of Electrical Engineering and Computer Science
2011-07-01
Iterative 3D image reconstruction methods can improve image quality over conventional filtered back projection (FBP) in X-ray computed tomography. However, high computational costs deter the routine use of iterative reconstruction clinically. The separable footprint method for forward and back-projection simplifies the integrals over a detector cell in a way that is quite accurate and also has a relatively efficient CPU implementation. In this project, we implemented the separable footprints method for both forward and backward projection on a graphics processing unit (GPU) with NVDIA's parallel computing architecture (CUDA). This paper describes our GPU kernels for the separable footprint method and simulation results. (orig.)
MAGI: a Node.js web service for fast microRNA-Seq analysis in a GPU infrastructure
Kim, Jihoon; Levy, Eric; Ferbrache, Alex; Stepanowsky, Petra; Farcas, Claudiu; Wang, Shuang; Brunner, Stefan; Bath, Tyler; Wu, Yuan; Ohno-Machado, Lucila
2014-01-01
Summary: MAGI is a web service for fast MicroRNA-Seq data analysis in a graphics processing unit (GPU) infrastructure. Using just a browser, users have access to results as web reports in just a few hours—>600% end-to-end performance improvement over state of the art. MAGI’s salient features are (i) transfer of large input files in native FASTA with Qualities (FASTQ) format through drag-and-drop operations, (ii) rapid prediction of microRNA target genes leveraging parallel computing with GPU ...
LDPC Decoding on GPU for Mobile Device
Directory of Open Access Journals (Sweden)
Yiqin Lu
2016-01-01
Full Text Available A flexible software LDPC decoder that exploits data parallelism for simultaneous multicode words decoding on the mobile device is proposed in this paper, supported by multithreading on OpenCL based graphics processing units. By dividing the check matrix into several parts to make full use of both the local memory and private memory on GPU and properly modify the code capacity each time, our implementation on a mobile phone shows throughputs above 100 Mbps and delay is less than 1.6 millisecond in decoding, which make high-speed communication like video calling possible. To realize efficient software LDPC decoding on the mobile device, the LDPC decoding feature on communication baseband chip should be replaced to save the cost and make it easier to upgrade decoder to be compatible with a variety of channel access schemes.
Scheinost, Dustin; Hampson, Michelle; Qiu, Maolin; Bhawnani, Jitendra; Constable, R Todd; Papademetris, Xenophon
2013-07-01
Real-time functional magnetic resonance imaging (rt-fMRI) has recently gained interest as a possible means to facilitate the learning of certain behaviors. However, rt-fMRI is limited by processing speed and available software, and continued development is needed for rt-fMRI to progress further and become feasible for clinical use. In this work, we present an open-source rt-fMRI system for biofeedback powered by a novel Graphics Processing Unit (GPU) accelerated motion correction strategy as part of the BioImage Suite project ( www.bioimagesuite.org ). Our system contributes to the development of rt-fMRI by presenting a motion correction algorithm that provides an estimate of motion with essentially no processing delay as well as a modular rt-fMRI system design. Using empirical data from rt-fMRI scans, we assessed the quality of motion correction in this new system. The present algorithm performed comparably to standard (non real-time) offline methods and outperformed other real-time methods based on zero order interpolation of motion parameters. The modular approach to the rt-fMRI system allows the system to be flexible to the experiment and feedback design, a valuable feature for many applications. We illustrate the flexibility of the system by describing several of our ongoing studies. Our hope is that continuing development of open-source rt-fMRI algorithms and software will make this new technology more accessible and adaptable, and will thereby accelerate its application in the clinical and cognitive neurosciences.
A GPU code for analytic continuation through a sampling method
Directory of Open Access Journals (Sweden)
Johan Nordström
2016-01-01
Full Text Available We here present a code for performing analytic continuation of fermionic Green’s functions and self-energies as well as bosonic susceptibilities on a graphics processing unit (GPU. The code is based on the sampling method introduced by Mishchenko et al. (2000, and is written for the widely used CUDA platform from NVidia. Detailed scaling tests are presented, for two different GPUs, in order to highlight the advantages of this code with respect to standard CPU computations. Finally, as an example of possible applications, we provide the analytic continuation of model Gaussian functions, as well as more realistic test cases from many-body physics.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
International Nuclear Information System (INIS)
Tian, Zhen; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B.; Peng, Fei
2015-01-01
Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Energy Technology Data Exchange (ETDEWEB)
Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun; Jia, Xun, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Jiang, Steve B., E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu [Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, Texas 75390 (United States); Peng, Fei [Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 (United States)
2015-06-15
Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is
Qin, Nan; Shen, Chenyang; Tsai, Min-Yu; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B; Parodi, Katia; Jia, Xun
2018-01-01
One of the major benefits of carbon ion therapy is enhanced biological effectiveness at the Bragg peak region. For intensity modulated carbon ion therapy (IMCT), it is desirable to use Monte Carlo (MC) methods to compute the properties of each pencil beam spot for treatment planning, because of their accuracy in modeling physics processes and estimating biological effects. We previously developed goCMC, a graphics processing unit (GPU)-oriented MC engine for carbon ion therapy. The purpose of the present study was to build a biological treatment plan optimization system using goCMC. The repair-misrepair-fixation model was implemented to compute the spatial distribution of linear-quadratic model parameters for each spot. A treatment plan optimization module was developed to minimize the difference between the prescribed and actual biological effect. We used a gradient-based algorithm to solve the optimization problem. The system was embedded in the Varian Eclipse treatment planning system under a client-server architecture to achieve a user-friendly planning environment. We tested the system with a 1-dimensional homogeneous water case and 3 3-dimensional patient cases. Our system generated treatment plans with biological spread-out Bragg peaks covering the targeted regions and sparing critical structures. Using 4 NVidia GTX 1080 GPUs, the total computation time, including spot simulation, optimization, and final dose calculation, was 0.6 hour for the prostate case (8282 spots), 0.2 hour for the pancreas case (3795 spots), and 0.3 hour for the brain case (6724 spots). The computation time was dominated by MC spot simulation. We built a biological treatment plan optimization system for IMCT that performs simulations using a fast MC engine, goCMC. To the best of our knowledge, this is the first time that full MC-based IMCT inverse planning has been achieved in a clinically viable time frame. Copyright © 2017 Elsevier Inc. All rights reserved.
Chun, Poo-Reum; Lee, Se-Ah; Yook, Yeong-Geun; Choi, Kwang-Sung; Cho, Deog-Geun; Yu, Dong-Hun; Chang, Won-Seok; Kwon, Deuk-Chul; Im, Yeon-Ho
2013-09-01
Although plasma etch profile simulation has been attracted much interest for developing reliable plasma etching, there still exist big gaps between current research status and predictable modeling due to the inherent complexity of plasma process. As an effort to address this issue, we present 3D feature profile simulation coupled with well-defined plasma-surface kinetic model for silicon dioxide etching process under fluorocarbon plasmas. To capture the realistic plasma surface reaction behaviors, a polymer layer based surface kinetic model was proposed to consider the simultaneous polymer deposition and oxide etching. Finally, the realistic plasma surface model was used for calculation of speed function for 3D topology simulation, which consists of multiple level set based moving algorithm, and ballistic transport module. In addition, the time consumable computations in the ballistic transport calculation were improved drastically by GPU based numerical computation, leading to the real time computation. Finally, we demonstrated that the surface kinetic model could be coupled successfully for 3D etch profile simulations in high-aspect ratio contact hole plasma etching.
The Research and Test of Fast Radio Burst Real-time Search Algorithm Based on GPU Acceleration
Wang, J.; Chen, M. Z.; Pei, X.; Wang, Z. Q.
2017-03-01
In order to satisfy the research needs of Nanshan 25 m radio telescope of Xinjiang Astronomical Observatory (XAO) and study the key technology of the planned QiTai radio Telescope (QTT), the receiver group of XAO studied the GPU (Graphics Processing Unit) based real-time FRB searching algorithm which developed from the original FRB searching algorithm based on CPU (Central Processing Unit), and built the FRB real-time searching system. The comparison of the GPU system and the CPU system shows that: on the basis of ensuring the accuracy of the search, the speed of the GPU accelerated algorithm is improved by 35-45 times compared with the CPU algorithm.
Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun
2015-01-01
The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
Accelerating Matrix-Vector Multiplication on Hierarchical Matrices Using Graphical Processing Units
Boukaram, W.
2015-03-25
Large dense matrices arise from the discretization of many physical phenomena in computational sciences. In statistics very large dense covariance matrices are used for describing random fields and processes. One can, for instance, describe distribution of dust particles in the atmosphere, concentration of mineral resources in the earth\\'s crust or uncertain permeability coefficient in reservoir modeling. When the problem size grows, storing and computing with the full dense matrix becomes prohibitively expensive both in terms of computational complexity and physical memory requirements. Fortunately, these matrices can often be approximated by a class of data sparse matrices called hierarchical matrices (H-matrices) where various sub-blocks of the matrix are approximated by low rank matrices. These matrices can be stored in memory that grows linearly with the problem size. In addition, arithmetic operations on these H-matrices, such as matrix-vector multiplication, can be completed in almost linear time. Originally the H-matrix technique was developed for the approximation of stiffness matrices coming from partial differential and integral equations. Parallelizing these arithmetic operations on the GPU has been the focus of this work and we will present work done on the matrix vector operation on the GPU using the KSPARSE library.
Directory of Open Access Journals (Sweden)
Ciznicki Milosz
2014-12-01
Full Text Available Heterogeneous many-core computing resources are increasingly popular among users due to their improved performance over homogeneous systems. Many developers have realized that heterogeneous systems, e.g. a combination of a shared memory multi-core CPU machine with massively parallel Graphics Processing Units (GPUs, can provide significant performance opportunities to a wide range of applications. However, the best overall performance can only be achieved if application tasks are efficiently assigned to different types of processor units in time taking into account their specific resource requirements. Additionally, one should note that available heterogeneous resources have been designed as general purpose units, however, with many built-in features accelerating specific application operations. In other words, the same algorithm or application functionality can be implemented as a different task for CPU or GPU. Nevertheless, from the perspective of various evaluation criteria, e.g. the total execution time or energy consumption, we may observe completely different results. Therefore, as tasks can be scheduled and managed in many alternative ways on both many-core CPUs or GPUs and consequently have a huge impact on the overall computing resources performance, there are needs for new and improved resource management techniques. In this paper we discuss results achieved during experimental performance studies of selected task scheduling methods in heterogeneous computing systems. Additionally, we present a new architecture for resource allocation and task scheduling library which provides a generic application programming interface at the operating system level for improving scheduling polices taking into account a diversity of tasks and heterogeneous computing resources characteristics.
Directory of Open Access Journals (Sweden)
Nicholas P. Bailey, Trond S. Ingebrigtsen, Jesper Schmidt Hansen, Arno A. Veldhorst, Lasse Bøhling, Claire A. Lemarchand, Andreas E. Olsen, Andreas K. Bacher, Lorenzo Costigliola, Ulf R. Pedersen, Heine Larsen, Jeppe C. Dyre, Thomas B. Schrøder
2017-12-01
Full Text Available RUMD is a general purpose, high-performance molecular dynamics (MD simulation package running on graphical processing units (GPU's. RUMD addresses the challenge of utilizing the many-core nature of modern GPU hardware when simulating small to medium system sizes (roughly from a few thousand up to hundred thousand particles. It has a performance that is comparable to other GPU-MD codes at large system sizes and substantially better at smaller sizes.RUMD is open-source and consists of a library written in C++ and the CUDA extension to C, an easy-to-use Python interface, and a set of tools for set-up and post-simulation data analysis. The paper describes RUMD's main features, optimizations and performance benchmarks.
Multi-GPU based acceleration of a list-mode DRAMA toward real-time OpenPET imaging
Energy Technology Data Exchange (ETDEWEB)
Kinouchi, Shoko [Chiba Univ. (Japan); National Institute of Radiological Sciences, Chiba (Japan); Yamaya, Taiga; Yoshida, Eiji; Tashima, Hideaki [National Institute of Radiological Sciences, Chiba (Japan); Kudo, Hiroyuki [Tsukuba Univ., Ibaraki (Japan); Suga, Mikio [Chiba Univ. (Japan)
2011-07-01
OpenPET, which has a physical gap between two detector rings, is our new PET geometry. In order to realize future radiation therapy guided by OpenPET, real-time imaging is required. Therefore we developed a list-mode image reconstruction method using general purpose graphic processing units (GPUs). For GPU implementation, the efficiency of acceleration depends on the implementation method which is required to avoid conditional statements. Therefore, in our previous study, we developed a new system model which was suited for the GPU implementation. In this paper, we implemented our image reconstruction method using 4 GPUs to get further acceleration. We applied the developed reconstruction method to a small OpenPET prototype. We obtained calculation times of total iteration using 4 GPUs that were 3.4 times faster than using a single GPU. Compared to using a single CPU, we achieved the reconstruction time speed-up of 142 times using 4 GPUs. (orig.)
GPU-accelerated ray-tracing for real-time treatment planning
International Nuclear Information System (INIS)
Heinrich, H; Ziegenhein, P; Kamerling, C P; Oelfke, U; Froening, H
2014-01-01
Dose calculation methods in radiotherapy treatment planning require the radiological depth information of the voxels that represent the patient volume to correct for tissue inhomogeneities. This information is acquired by time consuming ray-tracing-based calculations. For treatment planning scenarios with changing geometries and real-time constraints this is a severe bottleneck. We implemented an algorithm for the graphics processing unit (GPU) which implements a ray-matrix approach to reduce the number of rays to trace. Furthermore, we investigated the impact of different strategies of accessing memory in kernel implementations as well as strategies for rapid data transfers between main memory and memory of the graphics device. Our study included the overlapping of computations and memory transfers to reduce the overall runtime using Hyper-Q. We tested our approach on a prostate case (9 beams, coplanar). The measured execution times for a complete ray-tracing range from 28 msec for the computations on the GPU to 99 msec when considering data transfers to and from the graphics device. Our GPU-based algorithm performed the ray-tracing in real-time. The strategies efficiently reduce the time consumption of memory accesses and data transfer overhead. The achieved runtimes demonstrate the viability of this approach and allow improved real-time performance for dose calculation methods in clinical routine.
Li, Pengcheng; Liu, Celong; Li, Xianpeng; He, Honghui; Ma, Hui
2016-09-20
In earlier studies, we developed scattering models and the corresponding CPU-based Monte Carlo simulation programs to study the behavior of polarized photons as they propagate through complex biological tissues. Studying the simulation results in high degrees of freedom that created a demand for massive simulation tasks. In this paper, we report a parallel implementation of the simulation program based on the compute unified device architecture running on a graphics processing unit (GPU). Different schemes for sphere-only simulations and sphere-cylinder mixture simulations were developed. Diverse optimizing methods were employed to achieve the best acceleration. The final-version GPU program is hundreds of times faster than the CPU version. Dependence of the performance on input parameters and precision were also studied. It is shown that using single precision in the GPU simulations results in very limited losses in accuracy. Consumer-level graphics cards, even those in laptop computers, are more cost-effective than scientific graphics cards for single-precision computation.
Directory of Open Access Journals (Sweden)
M. A. Farkov
2014-01-01
Full Text Available An analysis of numerical optimization methods for solving a problem of molecular docking has been performed. Some additional requirements for optimization methods according to GPU architecture features were specified. A promising method for implementation on GPU was selected. Its implementation was described and performance and accuracy tests were performed.
iMOSFLM: a new graphical interface for diffraction-image processing with MOSFLM
International Nuclear Information System (INIS)
Battye, T. Geoff G.; Kontogiannis, Luke; Johnson, Owen; Powell, Harold R.; Leslie, Andrew G. W.
2011-01-01
A new graphical user interface to the MOSFLM program has been developed to simplify the processing of macromolecular diffraction data. The interface, iMOSFLM, allows data processing via a series of clearly defined tasks and provides visual feedback on the progress of each stage. iMOSFLM is a graphical user interface to the diffraction data-integration program MOSFLM. It is designed to simplify data processing by dividing the process into a series of steps, which are normally carried out sequentially. Each step has its own display pane, allowing control over parameters that influence that step and providing graphical feedback to the user. Suitable values for integration parameters are set automatically, but additional menus provide a detailed level of control for experienced users. The image display and the interfaces to the different tasks (indexing, strategy calculation, cell refinement, integration and history) are described. The most important parameters for each step and the best way of assessing success or failure are discussed
GASPRNG: GPU accelerated scalable parallel random number generator library
Gao, Shuang; Peterson, Gregory D.
2013-04-01
Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or
Parallel computing in cluster of GPU applied to a problem of nuclear engineering
International Nuclear Information System (INIS)
Moraes, Sergio Ricardo S.; Heimlich, Adino; Resende, Pedro
2013-01-01
Cluster computing has been widely used as a low cost alternative for parallel processing in scientific applications. With the use of Message-Passing Interface (MPI) protocol development became even more accessible and widespread in the scientific community. A more recent trend is the use of Graphic Processing Unit (GPU), which is a powerful co-processor able to perform hundreds of instructions in parallel, reaching a capacity of hundreds of times the processing of a CPU. However, a standard PC does not allow, in general, more than two GPUs. Hence, it is proposed in this work development and evaluation of a hybrid low cost parallel approach to the solution to a nuclear engineering typical problem. The idea is to use clusters parallelism technology (MPI) together with GPU programming techniques (CUDA - Compute Unified Device Architecture) to simulate neutron transport through a slab using Monte Carlo method. By using a cluster comprised by four quad-core computers with 2 GPU each, it has been developed programs using MPI and CUDA technologies. Experiments, applying different configurations, from 1 to 8 GPUs has been performed and results were compared with the sequential (non-parallel) version. A speed up of about 2.000 times has been observed when comparing the 8-GPU with the sequential version. Results here presented are discussed and analyzed with the objective of outlining gains and possible limitations of the proposed approach. (author)
Cai, Xiaohui; Liu, Yang; Ren, Zhiming
2018-06-01
Reverse-time migration (RTM) is a powerful tool for imaging geologically complex structures such as steep-dip and subsalt. However, its implementation is quite computationally expensive. Recently, as a low-cost solution, the graphic processing unit (GPU) was introduced to improve the efficiency of RTM. In the paper, we develop three ameliorative strategies to implement RTM on GPU card. First, given the high accuracy and efficiency of the adaptive optimal finite-difference (FD) method based on least squares (LS) on central processing unit (CPU), we study the optimal LS-based FD method on GPU. Second, we develop the CPU-based hybrid absorbing boundary condition (ABC) to the GPU-based one by addressing two issues of the former when introduced to GPU card: time-consuming and chaotic threads. Third, for large-scale data, the combinatorial strategy for optimal checkpointing and efficient boundary storage is introduced for the trade-off between memory and recomputation. To save the time of communication between host and disk, the portable operating system interface (POSIX) thread is utilized to create the other CPU core at the checkpoints. Applications of the three strategies on GPU with the compute unified device architecture (CUDA) programming language in RTM demonstrate their efficiency and validity.
Graphical user interface for image acquisition and processing
Goldberg, Kenneth A.
2002-01-01
An event-driven GUI-based image acquisition interface for the IDL programming environment designed for CCD camera control and image acquisition directly into the IDL environment where image manipulation and data analysis can be performed, and a toolbox of real-time analysis applications. Running the image acquisition hardware directly from IDL removes the necessity of first saving images in one program and then importing the data into IDL for analysis in a second step. Bringing the data directly into IDL creates an opportunity for the implementation of IDL image processing and display functions in real-time. program allows control over the available charge coupled device (CCD) detector parameters, data acquisition, file saving and loading, and image manipulation and processing, all from within IDL. The program is built using IDL's widget libraries to control the on-screen display and user interface.
Accelerating Malware Detection via a Graphics Processing Unit
2010-09-01
Processing Unit . . . . . . . . . . . . . . . . . . 4 PE Portable Executable . . . . . . . . . . . . . . . . . . . . . 4 COFF Common Object File Format...operating systems for the future [Szo05]. The PE format is an updated version of the common object file format ( COFF ) [Mic06]. Microsoft released a new...NAs02]. These alerts can be costly in terms of time and resources for individuals and organizations to investigate each misidentified file [YWL07] [Vak10
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
Rostrup, Scott; De Sterck, Hans
2010-12-01
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL
Accelerating Solution Proposal of AES Using a Graphic Processor
Directory of Open Access Journals (Sweden)
STRATULAT, M.
2011-11-01
Full Text Available The main goal of this work is to analyze the possibility of using a graphic processing unit in non graphical calculations. Graphic Processing Units are being used nowadays not only for game engines and movie encoding/decoding, but also for a vast area of applications, like Cryptography. We used the graphic processing unit as a cryptographic coprocessor in order accelerate AES algorithm. Our implementation of AES is on a GPU using CUDA architecture. The performances obtained show that the CUDA implementation can offer speedups of 11.95Gbps. The tests are conducted in two directions: running the tests on small data sizes that are located in memory and large data that are stored in files on hard drives.
Graphical analysis of processes with multiple activation energies
International Nuclear Information System (INIS)
Lachter, J.; Bragg, R.H.; Close, E.
1986-01-01
The activation energies characterizing a kinetic process are derived from the slopes of the Arrhenius diagrams obtained by plotting rate constants versus reciprocal temperature. Those rate constants correspond to the shifts along the time axis needed to superpose the successive isotherms. A general method based on Chebyshev interpolation is proposed for the optimization of the superposition of the experimental data points. This method is applied to determine the activation energies of the graphitization kinetics of the interlayer spacings of pitch coke and pyrocarbon samples
Energy Technology Data Exchange (ETDEWEB)
Qi, X. Sharon, E-mail: xqi@mednet.ucla.edu [Department of Radiation Oncology, University of California Los Angeles, Los Angeles, California (United States); Santhanam, Anand; Neylon, John; Min, Yugang; Armstrong, Tess; Sheng, Ke [Department of Radiation Oncology, University of California Los Angeles, Los Angeles, California (United States); Staton, Robert J.; Pukala, Jason [Department of Radiation Oncology, UF Health Cancer Center - Orlando Health, Orlando, Florida (United States); Pham, Andrew; Low, Daniel A.; Lee, Steve P. [Department of Radiation Oncology, University of California Los Angeles, Los Angeles, California (United States); Steinberg, Michael; Manon, Rafael [Department of Radiation Oncology, UF Health Cancer Center - Orlando Health, Orlando, Florida (United States); Chen, Allen M.; Kupelian, Patrick [Department of Radiation Oncology, University of California Los Angeles, Los Angeles, California (United States)
2015-06-01
Purpose: The purpose of this study was to systematically monitor anatomic variations and their dosimetric consequences during intensity modulated radiation therapy (IMRT) for head and neck (H&N) cancer by using a graphics processing unit (GPU)-based deformable image registration (DIR) framework. Methods and Materials: Eleven IMRT H&N patients undergoing IMRT with daily megavoltage computed tomography (CT) and weekly kilovoltage CT (kVCT) scans were included in this analysis. Pretreatment kVCTs were automatically registered with their corresponding planning CTs through a GPU-based DIR framework. The deformation of each contoured structure in the H&N region was computed to account for nonrigid change in the patient setup. The Jacobian determinant of the planning target volumes and the surrounding critical structures were used to quantify anatomical volume changes. The actual delivered dose was calculated accounting for the organ deformation. The dose distribution uncertainties due to registration errors were estimated using a landmark-based gamma evaluation. Results: Dramatic interfractional anatomic changes were observed. During the treatment course of 6 to 7 weeks, the parotid gland volumes changed up to 34.7%, and the center-of-mass displacement of the 2 parotid glands varied in the range of 0.9 to 8.8 mm. For the primary treatment volume, the cumulative minimum and mean and equivalent uniform doses assessed by the weekly kVCTs were lower than the planned doses by up to 14.9% (P=.14), 2% (P=.39), and 7.3% (P=.05), respectively. The cumulative mean doses were significantly higher than the planned dose for the left parotid (P=.03) and right parotid glands (P=.006). The computation including DIR and dose accumulation was ultrafast (∼45 seconds) with registration accuracy at the subvoxel level. Conclusions: A systematic analysis of anatomic variations in the H&N region and their dosimetric consequences is critical in improving treatment efficacy. Nearly real
FastGCN: a GPU accelerated tool for fast gene co-expression networks.
Directory of Open Access Journals (Sweden)
Meimei Liang
Full Text Available Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
CMFD and GPU acceleration on method of characteristics for hexagonal cores
International Nuclear Information System (INIS)
Han, Yu; Jiang, Xiaofeng; Wang, Dezhong
2014-01-01
Highlights: • A merged hex-mesh CMFD method solved via tri-diagonal matrix inversion. • Alternative hardware acceleration of using inexpensive GPU. • A hex-core benchmark with solution to confirm two acceleration methods. - Abstract: Coarse Mesh Finite Difference (CMFD) has been widely adopted as an effective way to accelerate the source iteration of transport calculation. However in a core with hexagonal assemblies there are non-hexagonal meshes around the edges of assemblies, causing a problem for CMFD if the CMFD equations are still to be solved via tri-diagonal matrix inversion by simply scanning the whole core meshes in different directions. To solve this problem, we propose an unequal mesh CMFD formulation that combines the non-hexagonal cells on the boundary of neighboring assemblies into non-regular hexagonal cells. We also investigated the alternative hardware acceleration of using graphics processing units (GPU) with graphics card in a personal computer. The tool CUDA is employed, which is a parallel computing platform and programming model invented by the company NVIDIA for harnessing the power of GPU. To investigate and implement these two acceleration methods, a 2-D hexagonal core transport code using the method of characteristics (MOC) is developed. A hexagonal mini-core benchmark problem is established to confirm the accuracy of the MOC code and to assess the effectiveness of CMFD and GPU parallel acceleration. For this benchmark problem, the CMFD acceleration increases the speed 16 times while the GPU acceleration speeds it up 25 times. When used simultaneously, they provide a speed gain of 292 times
CMFD and GPU acceleration on method of characteristics for hexagonal cores
Energy Technology Data Exchange (ETDEWEB)
Han, Yu, E-mail: hanyu1203@gmail.com [School of Nuclear Science and Engineering, Shanghai Jiaotong University, Shanghai 200240 (China); Jiang, Xiaofeng [Shanghai NuStar Nuclear Power Technology Co., Ltd., No. 81 South Qinzhou Road, XuJiaHui District, Shanghai 200000 (China); Wang, Dezhong [School of Nuclear Science and Engineering, Shanghai Jiaotong University, Shanghai 200240 (China)
2014-12-15
Highlights: • A merged hex-mesh CMFD method solved via tri-diagonal matrix inversion. • Alternative hardware acceleration of using inexpensive GPU. • A hex-core benchmark with solution to confirm two acceleration methods. - Abstract: Coarse Mesh Finite Difference (CMFD) has been widely adopted as an effective way to accelerate the source iteration of transport calculation. However in a core with hexagonal assemblies there are non-hexagonal meshes around the edges of assemblies, causing a problem for CMFD if the CMFD equations are still to be solved via tri-diagonal matrix inversion by simply scanning the whole core meshes in different directions. To solve this problem, we propose an unequal mesh CMFD formulation that combines the non-hexagonal cells on the boundary of neighboring assemblies into non-regular hexagonal cells. We also investigated the alternative hardware acceleration of using graphics processing units (GPU) with graphics card in a personal computer. The tool CUDA is employed, which is a parallel computing platform and programming model invented by the company NVIDIA for harnessing the power of GPU. To investigate and implement these two acceleration methods, a 2-D hexagonal core transport code using the method of characteristics (MOC) is developed. A hexagonal mini-core benchmark problem is established to confirm the accuracy of the MOC code and to assess the effectiveness of CMFD and GPU parallel acceleration. For this benchmark problem, the CMFD acceleration increases the speed 16 times while the GPU acceleration speeds it up 25 times. When used simultaneously, they provide a speed gain of 292 times.
Parallel GPU implementation of PWR reactor burnup
International Nuclear Information System (INIS)
Heimlich, A.; Silva, F.C.; Martinez, A.S.
2016-01-01
Highlights: • Three GPU algorithms used to evaluate the burn-up in a PWR reactor. • Exhibit speed improvement exceeding 200 times over the sequential. • The C++ container is expansible to accept new nuclides chains. - Abstract: This paper surveys three methods, implemented for multi-core CPU and graphic processor unit (GPU), to evaluate the fuel burn-up in a pressurized light water nuclear reactor (PWR) using the solutions of a large system of coupled ordinary differential equations. The reactor physics simulation of a PWR reactor spends a long execution time with burnup calculations, so performance improvement using GPU can imply in better core design and thus extended fuel life cycle. The results of this study exhibit speed improvement exceeding 200 times over the sequential solver, within 1% accuracy.
Higher-order ice-sheet modelling accelerated by multigrid on graphics cards
Brædstrup, Christian; Egholm, David
2013-04-01
Higher-order ice flow modelling is a very computer intensive process owing primarily to the nonlinear influence of the horizontal stress coupling. When applied for simulating long-term glacial landscape evolution, the ice-sheet models must consider very long time series, while both high temporal and spatial resolution is needed to resolve small effects. The use of higher-order and full stokes models have therefore seen very limited usage in this field. However, recent advances in graphics card (GPU) technology for high performance computing have proven extremely efficient in accelerating many large-scale scientific computations. The general purpose GPU (GPGPU) technology is cheap, has a low power consumption and fits into a normal desktop computer. It could therefore provide a powerful tool for many glaciologists working on ice flow models. Our current research focuses on utilising the GPU as a tool in ice-sheet and glacier modelling. To this extent we have implemented the Integrated Second-Order Shallow Ice Approximation (iSOSIA) equations on the device using the finite difference method. To accelerate the computations, the GPU solver uses a non-linear Red-Black Gauss-Seidel iterator coupled with a Full Approximation Scheme (FAS) multigrid setup to further aid convergence. The GPU finite difference implementation provides the inherent parallelization that scales from hundreds to several thousands of cores on newer cards. We demonstrate the efficiency of the GPU multigrid solver using benchmark experiments.
Length-Bounded Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection
Directory of Open Access Journals (Sweden)
Yi-Shan Lin
2017-01-01
Full Text Available Since frequent communication between applications takes place in high speed networks, deep packet inspection (DPI plays an important role in the network application awareness. The signature-based network intrusion detection system (NIDS contains a DPI technique that examines the incoming packet payloads by employing a pattern matching algorithm that dominates the overall inspection performance. Existing studies focused on implementing efficient pattern matching algorithms by parallel programming on software platforms because of the advantages of lower cost and higher scalability. Either the central processing unit (CPU or the graphic processing unit (GPU were involved. Our studies focused on designing a pattern matching algorithm based on the cooperation between both CPU and GPU. In this paper, we present an enhanced design for our previous work, a length-bounded hybrid CPU/GPU pattern matching algorithm (LHPMA. In the preliminary experiment, the performance and comparison with the previous work are displayed, and the experimental results show that the LHPMA can achieve not only effective CPU/GPU cooperation but also higher throughput than the previous method.
Directory of Open Access Journals (Sweden)
M. Huang
2015-09-01
Full Text Available The planetary boundary layer (PBL is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds, and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF model of which the Yonsei University (YSU scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.
Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.
Directory of Open Access Journals (Sweden)
Cheng-Ying Chou
Full Text Available Positron emission tomography (PET is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU, NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.
A fast and accurate image reconstruction using GPU for OpenPET prototype
International Nuclear Information System (INIS)
Kinouchi, Shoko; Suga, Mikio; Yamaya, Taiga; Yoshida, Eiji
2010-01-01
The OpenPET (positron emission tomography), which have a physically opened space between two detector rings, is our new geometry to enable PET imaging during radiation therapy if the real-time imaging system is realized. In this paper, therefore, we developed a list-mode image reconstruction method using general purpose graphic processing units (GPUs). We used the list-mode dynamic row-action maximum likelihood algorithm (DRAMA). For GPU implementation, the efficiency of acceleration depends on the implementation method which is required to avoid conditional statements. We developed a system model in which each element of system matrix is calculated as the value of detector response function (DRF) of the length between the center of a voxel and a line of response (LOR). The system model was suited to GPU implementations that enable us to calculate each element of the system matrix with reduced number of the conditional statements. We applied the developed method to a small OpenPET prototype, which was developed for a proof-of-concept. We measured the micro-Derenzo phantom placed at the gap. The results showed that the same quality of reconstructed images using GPU as using central processing unit (CPU) were achieved, and calculation speed on the GPU was 35.5 times faster than that on the CPU. (author)
GPU-accelerated Kernel Regression Reconstruction for Freehand 3D Ultrasound Imaging.
Wen, Tiexiang; Li, Ling; Zhu, Qingsong; Qin, Wenjian; Gu, Jia; Yang, Feng; Xie, Yaoqin
2017-07-01
Volume reconstruction method plays an important role in improving reconstructed volumetric image quality for freehand three-dimensional (3D) ultrasound imaging. By utilizing the capability of programmable graphics processing unit (GPU), we can achieve a real-time incremental volume reconstruction at a speed of 25-50 frames per second (fps). After incremental reconstruction and visualization, hole-filling is performed on GPU to fill remaining empty voxels. However, traditional pixel nearest neighbor-based hole-filling fails to reconstruct volume with high image quality. On the contrary, the kernel regression provides an accurate volume reconstruction method for 3D ultrasound imaging but with the cost of heavy computational complexity. In this paper, a GPU-based fast kernel regression method is proposed for high-quality volume after the incremental reconstruction of freehand ultrasound. The experimental results show that improved image quality for speckle reduction and details preservation can be obtained with the parameter setting of kernel window size of [Formula: see text] and kernel bandwidth of 1.0. The computational performance of the proposed GPU-based method can be over 200 times faster than that on central processing unit (CPU), and the volume with size of 50 million voxels in our experiment can be reconstructed within 10 seconds.
Fast Simulation of Large-Scale Floods Based on GPU Parallel Computing
Directory of Open Access Journals (Sweden)
Qiang Liu
2018-05-01
Full Text Available Computing speed is a significant issue of large-scale flood simulations for real-time response to disaster prevention and mitigation. Even today, most of the large-scale flood simulations are generally run on supercomputers due to the massive amounts of data and computations necessary. In this work, a two-dimensional shallow water model based on an unstructured Godunov-type finite volume scheme was proposed for flood simulation. To realize a fast simulation of large-scale floods on a personal computer, a Graphics Processing Unit (GPU-based, high-performance computing method using the OpenACC application was adopted to parallelize the shallow water model. An unstructured data management method was presented to control the data transportation between the GPU and CPU (Central Processing Unit with minimum overhead, and then both computation and data were offloaded from the CPU to the GPU, which exploited the computational capability of the GPU as much as possible. The parallel model was validated using various benchmarks and real-world case studies. The results demonstrate that speed-ups of up to one order of magnitude can be achieved in comparison with the serial model. The proposed parallel model provides a fast and reliable tool with which to quickly assess flood hazards in large-scale areas and, thus, has a bright application prospect for dynamic inundation risk identification and disaster assessment.
Applying graphics processor units to Monte Carlo dose calculation in radiation therapy
Directory of Open Access Journals (Sweden)
Bakhtiari M
2010-01-01
Full Text Available We investigate the potential in using of using a graphics processor unit (GPU for Monte-Carlo (MC-based radiation dose calculations. The percent depth dose (PDD of photons in a medium with known absorption and scattering coefficients is computed using a MC simulation running on both a standard CPU and a GPU. We demonstrate that the GPU′s capability for massive parallel processing provides a significant acceleration in the MC calculation, and offers a significant advantage for distributed stochastic simulations on a single computer. Harnessing this potential of GPUs will help in the early adoption of MC for routine planning in a clinical environment.
Discrete-Event Execution Alternatives on General Purpose Graphical Processing Units
International Nuclear Information System (INIS)
Perumalla, Kalyan S.
2006-01-01
Graphics cards, traditionally designed as accelerators for computer graphics, have evolved to support more general-purpose computation. General Purpose Graphical Processing Units (GPGPUs) are now being used as highly efficient, cost-effective platforms for executing certain simulation applications. While most of these applications belong to the category of time-stepped simulations, little is known about the applicability of GPGPUs to discrete event simulation (DES). Here, we identify some of the issues and challenges that the GPGPU stream-based interface raises for DES, and present some possible approaches to moving DES to GPGPUs. Initial performance results on simulation of a diffusion process show that DES-style execution on GPGPU runs faster than DES on CPU and also significantly faster than time-stepped simulations on either CPU or GPGPU.
International Nuclear Information System (INIS)
Gu Xuejun; Jia Xun; Jiang, Steve B; Jelen, Urszula; Li Jinsheng
2011-01-01
Targeting at the development of an accurate and efficient dose calculation engine for online adaptive radiotherapy, we have implemented a finite-size pencil beam (FSPB) algorithm with a 3D-density correction method on graphics processing unit (GPU). This new GPU-based dose engine is built on our previously published ultrafast FSPB computational framework (Gu et al 2009 Phys. Med. Biol. 54 6287-97). Dosimetric evaluations against Monte Carlo dose calculations are conducted on ten IMRT treatment plans (five head-and-neck cases and five lung cases). For all cases, there is improvement with the 3D-density correction over the conventional FSPB algorithm and for most cases the improvement is significant. Regarding the efficiency, because of the appropriate arrangement of memory access and the usage of GPU intrinsic functions, the dose calculation for an IMRT plan can be accomplished well within 1 s (except for one case) with this new GPU-based FSPB algorithm. Compared to the previous GPU-based FSPB algorithm without 3D-density correction, this new algorithm, though slightly sacrificing the computational efficiency (∼5-15% lower), has significantly improved the dose calculation accuracy, making it more suitable for online IMRT replanning.
Eslami, Taban; Saeed, Fahad
2018-04-20
Functional magnetic resonance imaging (fMRI) is a non-invasive brain imaging technique, which has been regularly used for studying brain’s functional activities in the past few years. A very well-used measure for capturing functional associations in brain is Pearson’s correlation coefficient. Pearson’s correlation is widely used for constructing functional network and studying dynamic functional connectivity of the brain. These are useful measures for understanding the effects of brain disorders on connectivities among brain regions. The fMRI scanners produce huge number of voxels and using traditional central processing unit (CPU)-based techniques for computing pairwise correlations is very time consuming especially when large number of subjects are being studied. In this paper, we propose a graphics processing unit (GPU)-based algorithm called Fast-GPU-PCC for computing pairwise Pearson’s correlation coefficient. Based on the symmetric property of Pearson’s correlation, this approach returns N ( N − 1 ) / 2 correlation coefficients located at strictly upper triangle part of the correlation matrix. Storing correlations in a one-dimensional array with the order as proposed in this paper is useful for further usage. Our experiments on real and synthetic fMRI data for different number of voxels and varying length of time series show that the proposed approach outperformed state of the art GPU-based techniques as well as the sequential CPU-based versions. We show that Fast-GPU-PCC runs 62 times faster than CPU-based version and about 2 to 3 times faster than two other state of the art GPU-based methods.
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster
Energy Technology Data Exchange (ETDEWEB)
Allada, Veerendra, Benjegerdes, Troy; Bode, Brett
2009-08-31
Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster
International Nuclear Information System (INIS)
Allada, Veerendra; Benjegerdes, Troy; Bode, Brett
2009-01-01
Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.
Defense Documentation Center, Alexandria, VA.
This unclassified-unlimited bibliography contains 183 references, with abstracts, dealing specifically with optical or graphic information processing. Citations are grouped under three headings: display devices and theory, character recognition, and pattern recognition. Within each group, they are arranged in accession number (AD-number) sequence.…
Cohen-Or, Daniel; Ju, Tao; Mitra, Niloy J; Shamir, Ariel; Sorkine-Hornung, Olga; Zhang, Hao (Richard)
2015-01-01
A Sampler of Useful Computational Tools for Applied Geometry, Computer Graphics, and Image Processing shows how to use a collection of mathematical techniques to solve important problems in applied mathematics and computer science areas. The book discusses fundamental tools in analytical geometry and linear algebra. It covers a wide range of topics, from matrix decomposition to curvature analysis and principal component analysis to dimensionality reduction.Written by a team of highly respected professors, the book can be used in a one-semester, intermediate-level course in computer science. It
GPU-Based 3D Cone-Beam CT Image Reconstruction for Large Data Volume
Directory of Open Access Journals (Sweden)
Xing Zhao
2009-01-01
Full Text Available Currently, 3D cone-beam CT image reconstruction speed is still a severe limitation for clinical application. The computational power of modern graphics processing units (GPUs has been harnessed to provide impressive acceleration of 3D volume image reconstruction. For extra large data volume exceeding the physical graphic memory of GPU, a straightforward compromise is to divide data volume into blocks. Different from the conventional Octree partition method, a new partition scheme is proposed in this paper. This method divides both projection data and reconstructed image volume into subsets according to geometric symmetries in circular cone-beam projection layout, and a fast reconstruction for large data volume can be implemented by packing the subsets of projection data into the RGBA channels of GPU, performing the reconstruction chunk by chunk and combining the individual results in the end. The method is evaluated by reconstructing 3D images from computer-simulation data and real micro-CT data. Our results indicate that the GPU implementation can maintain original precision and speed up the reconstruction process by 110–120 times for circular cone-beam scan, as compared to traditional CPU implementation.
Yu, H.; Wang, Z.; Zhang, C.; Chen, N.; Zhao, Y.; Sawchuk, A. P.; Dalsing, M. C.; Teague, S. D.; Cheng, Y.
2014-11-01
Existing research of patient-specific computational hemodynamics (PSCH) heavily relies on software for anatomical extraction of blood arteries. Data reconstruction and mesh generation have to be done using existing commercial software due to the gap between medical image processing and CFD, which increases computation burden and introduces inaccuracy during data transformation thus limits the medical applications of PSCH. We use lattice Boltzmann method (LBM) to solve the level-set equation over an Eulerian distance field and implicitly and dynamically segment the artery surfaces from radiological CT/MRI imaging data. The segments seamlessly feed to the LBM based CFD computation of PSCH thus explicit mesh construction and extra data management are avoided. The LBM is ideally suited for GPU (graphic processing unit)-based parallel computing. The parallel acceleration over GPU achieves excellent performance in PSCH computation. An application study will be presented which segments an aortic artery from a chest CT dataset and models PSCH of the segmented artery.
Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.
2017-07-01
Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).
A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics
Bard, Christopher; Dorelli, John C.
2013-01-01
We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
Moving-Target Position Estimation Using GPU-Based Particle Filter for IoT Sensing Applications
Directory of Open Access Journals (Sweden)
Seongseop Kim
2017-11-01
Full Text Available A particle filter (PF has been introduced for effective position estimation of moving targets for non-Gaussian and nonlinear systems. The time difference of arrival (TDOA method using acoustic sensor array has normally been used to for estimation by concealing the location of a moving target, especially underwater. In this paper, we propose a GPU -based acceleration of target position estimation using a PF and propose an efficient system and software architecture. The proposed graphic processing unit (GPU-based algorithm has more advantages in applying PF signal processing to a target system, which consists of large-scale Internet of Things (IoT-driven sensors because of the parallelization which is scalable. For the TDOA measurement from the acoustic sensor array, we use the generalized cross correlation phase transform (GCC-PHAT method to obtain the correlation coefficient of the signal using Fast Fourier Transform (FFT, and we try to accelerate the calculations of GCC-PHAT based TDOA measurements using FFT with GPU compute unified device architecture (CUDA. The proposed approach utilizes a parallelization method in the target position estimation algorithm using GPU-based PF processing. In addition, it could efficiently estimate sudden movement change of the target using GPU-based parallel computing which also can be used for multiple target tracking. It also provides scalability in extending the detection algorithm according to the increase of the number of sensors. Therefore, the proposed architecture can be applied in IoT sensing applications with a large number of sensors. The target estimation algorithm was verified using MATLAB and implemented using GPU CUDA. We implemented the proposed signal processing acceleration system using target GPU to analyze in terms of execution time. The execution time of the algorithm is reduced by 55% from to the CPU standalone operation in target embedded board, NVIDIA Jetson TX1. Also, to apply large
Energy Technology Data Exchange (ETDEWEB)
Bellezzo, Murillo
2014-09-01
As the most accurate method to estimate absorbed dose in radiotherapy, Monte Carlo Method (MCM) has been widely used in radiotherapy treatment planning. Nevertheless, its efficiency can be improved for clinical routine applications. In this thesis, the CUBMC code is presented, a GPU-based MC photon transport algorithm for dose calculation under the Compute Unified Device Architecture (CUDA) platform. The simulation of physical events is based on the algorithm used in PENELOPE, and the cross section table used is the one generated by the MATERIAL routine, also present in PENELOPE code. Photons are transported in voxel-based geometries with different compositions. There are two distinct approaches used for transport simulation. The rst of them forces the photon to stop at every voxel frontier, the second one is the Woodcock method, where the photon ignores the existence of borders and travels in homogeneous fictitious media. The CUBMC code aims to be an alternative of Monte Carlo simulator code that, by using the capability of parallel processing of graphics processing units (GPU), provide high performance simulations in low cost compact machines, and thus can be applied in clinical cases and incorporated in treatment planning systems for radiotherapy. (author)
Mathematics of shape description a morphological approach to image processing and computer graphics
Ghosh, Pijush K
2009-01-01
Image processing problems are often not well defined because real images are contaminated with noise and other uncertain factors. In Mathematics of Shape Description, the authors take a mathematical approach to address these problems using the morphological and set-theoretic approach to image processing and computer graphics by presenting a simple shape model using two basic shape operators called Minkowski addition and decomposition. This book is ideal for professional researchers and engineers in Information Processing, Image Measurement, Shape Description, Shape Representation and Computer Graphics. Post-graduate and advanced undergraduate students in pure and applied mathematics, computer sciences, robotics and engineering will also benefit from this book. Key FeaturesExplains the fundamental and advanced relationships between algebraic system and shape description through the set-theoretic approachPromotes interaction of image processing geochronology and mathematics in the field of algebraic geometryP...
Directory of Open Access Journals (Sweden)
Ion LUNGU
2012-01-01
Full Text Available In this paper, we research, analyze and develop optimization solutions for the parallel reduction function using graphics processing units (GPUs that implement the Compute Unified Device Architecture (CUDA, a modern and novel approach for improving the software performance of data processing applications and algorithms. Many of these applications and algorithms make use of the reduction function in their computational steps. After having designed the function and its algorithmic steps in CUDA, we have progressively developed and implemented optimization solutions for the reduction function. In order to confirm, test and evaluate the solutions' efficiency, we have developed a custom tailored benchmark suite. We have analyzed the obtained experimental results regarding: the comparison of the execution time and bandwidth when using graphic processing units covering the main CUDA architectures (Tesla GT200, Fermi GF100, Kepler GK104 and a central processing unit; the data type influence; the binary operator's influence.
Solving global optimization problems on GPU cluster
Energy Technology Data Exchange (ETDEWEB)
Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya [Lobachevsky State University of Nizhni Novgorod, Gagarin Avenue 23, 603950 Nizhni Novgorod (Russian Federation)
2016-06-08
The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.
On-the-fly generation and rendering of infinite cities on the GPU
Steinberger, Markus
2014-05-01
In this paper, we present a new approach for shape-grammar-based generation and rendering of huge cities in real-time on the graphics processing unit (GPU). Traditional approaches rely on evaluating a shape grammar and storing the geometry produced as a preprocessing step. During rendering, the pregenerated data is then streamed to the GPU. By interweaving generation and rendering, we overcome the problems and limitations of streaming pregenerated data. Using our methods of visibility pruning and adaptive level of detail, we are able to dynamically generate only the geometry needed to render the current view in real-time directly on the GPU. We also present a robust and efficient way to dynamically update a scene\\'s derivation tree and geometry, enabling us to exploit frame-to-frame coherence. Our combined generation and rendering is significantly faster than all previous work. For detailed scenes, we are capable of generating geometry more rapidly than even just copying pregenerated data from main memory, enabling us to render cities with thousands of buildings at up to 100 frames per second, even with the camera moving at supersonic speed. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
On-the-fly generation and rendering of infinite cities on the GPU
Steinberger, Markus; Kenzel, Michael; Kainz, Bernhard K.; Wonka, Peter; Schmalstieg, Dieter
2014-01-01
In this paper, we present a new approach for shape-grammar-based generation and rendering of huge cities in real-time on the graphics processing unit (GPU). Traditional approaches rely on evaluating a shape grammar and storing the geometry produced as a preprocessing step. During rendering, the pregenerated data is then streamed to the GPU. By interweaving generation and rendering, we overcome the problems and limitations of streaming pregenerated data. Using our methods of visibility pruning and adaptive level of detail, we are able to dynamically generate only the geometry needed to render the current view in real-time directly on the GPU. We also present a robust and efficient way to dynamically update a scene's derivation tree and geometry, enabling us to exploit frame-to-frame coherence. Our combined generation and rendering is significantly faster than all previous work. For detailed scenes, we are capable of generating geometry more rapidly than even just copying pregenerated data from main memory, enabling us to render cities with thousands of buildings at up to 100 frames per second, even with the camera moving at supersonic speed. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems
Energy Technology Data Exchange (ETDEWEB)
Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; van Dam, Hubertus JJ; Apra, Edoardo; Kowalski, Karol
2013-04-09
A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithm is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.
Data assimilation using a GPU accelerated path integral Monte Carlo approach
Quinn, John C.; Abarbanel, Henry D. I.
2011-09-01
The answers to data assimilation questions can be expressed as path integrals over all possible state and parameter histories. We show how these path integrals can be evaluated numerically using a Markov Chain Monte Carlo method designed to run in parallel on a graphics processing unit (GPU). We demonstrate the application of the method to an example with a transmembrane voltage time series of a simulated neuron as an input, and using a Hodgkin-Huxley neuron model. By taking advantage of GPU computing, we gain a parallel speedup factor of up to about 300, compared to an equivalent serial computation on a CPU, with performance increasing as the length of the observation time used for data assimilation increases.
International Nuclear Information System (INIS)
Anantharaman, Rahul; Abbas, Own Syed; Gundersen, Truls
2006-01-01
Pinch Analysis, Exergy Analysis and Optimization have all been used independently or in combination for the energy integration of process plants. In order to address the issue of energy integration, taking into account composition and pressure effects, the concept of energy level as proposed by [X. Feng, X.X. Zhu, Combining pinch and exergy analysis for process modifications, Appl. Therm. Eng. 17 (1997) 249] has been modified and expanded in this work. We have developed a strategy for energy integration that uses process simulation tools to define the interaction between the various subsystems in the plant and a graphical technique to help the engineer interpret the results of the simulation with physical insights that point towards exploring possible integration schemes to increase energy efficiency. The proposed graphical representation of energy levels of processes is very similar to the Composite Curves of Pinch Analysis-the interpretation of the Energy Level Composite Curves reduces to the Pinch Analysis case when dealing with heat transfer. Other similarities and differences are detailed in this work. Energy integration of a methanol plant is taken as a case study to test the efficacy of this methodology. Potential integration schemes are identified that would have been difficult to visualize without the help of the new graphical representation
Discovering epistasis in large scale genetic association studies by exploiting graphics cards.
Chen, Gary K; Guo, Yunfei
2013-12-03
Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations.
Discovering epistasis in large scale genetic association studies by exploiting graphics cards
Directory of Open Access Journals (Sweden)
Gary K Chen
2013-12-01
Full Text Available Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome wide association studies (GWAS, progress has been frustratingly slow in explaining much of the heritability in common disease. Today’s paradigm of testing independent hypotheses on each SNP marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyses genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU. We include tutorials on GPU technology, which will convey why they are growing in appeal with today’s numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2,600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x over standard CPU implementations.
A graphical method for estimating the tunneling factor for mode conversion processes
International Nuclear Information System (INIS)
Swanson, D.G.
1994-01-01
The fundamental parameter characterizing the strength of any mode conversion process is the tunneling parameter, which is typically determined from a model dispersion relation which is transformed into a differential equation. Here a graphical method is described which gives the tunneling parameter from quantities directly measured from a simple graph of the dispersion relation. The accuracy of the estimate depends only on the accuracy of the measurements
Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture
Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek
2015-01-01
This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
Energy Technology Data Exchange (ETDEWEB)
Jimenez, Edward S. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Orr, Laurel J. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Thompson, Kyle R. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2013-09-01
The goal of this work is to develop a fast computed tomography (CT) reconstruction algorithm based on graphics processing units (GPU) that achieves significant improvement over traditional central processing unit (CPU) based implementations. The main challenge in developing a CT algorithm that is capable of handling very large datasets is parallelizing the algorithm in such a way that data transfer does not hinder performance of the reconstruction algorithm. General Purpose Graphics Processing (GPGPU) is a new technology that the Science and Technology (S&T) community is starting to adopt in many fields where CPU-based computing is the norm. GPGPU programming requires a new approach to algorithm development that utilizes massively multi-threaded environments. Multi-threaded algorithms in general are difficult to optimize since performance bottlenecks occur that are non-existent in single-threaded algorithms such as memory latencies. If an efficient GPU-based CT reconstruction algorithm can be developed; computational times could be improved by a factor of 20. Additionally, cost benefits will be realized as commodity graphics hardware could potentially replace expensive supercomputers and high-end workstations. This project will take advantage of the CUDA programming environment and attempt to parallelize the task in such a way that multiple slices of the reconstruction volume are computed simultaneously. This work will also take advantage of the GPU memory by utilizing asynchronous memory transfers, GPU texture memory, and (when possible) pinned host memory so that the memory transfer bottleneck inherent to GPGPU is amortized. Additionally, this work will take advantage of GPU-specific hardware (i.e. fast texture memory, pixel-pipelines, hardware interpolators, and varying memory hierarchy) that will allow for additional performance improvements.
GPU-based fast pencil beam algorithm for proton therapy
International Nuclear Information System (INIS)
Fujimoto, Rintaro; Nagamine, Yoshihiko; Kurihara, Tsuneya
2011-01-01
Performance of a treatment planning system is an essential factor in making sophisticated plans. The dose calculation is a major time-consuming process in planning operations. The standard algorithm for proton dose calculations is the pencil beam algorithm which produces relatively accurate results, but is time consuming. In order to shorten the computational time, we have developed a GPU (graphics processing unit)-based pencil beam algorithm. We have implemented this algorithm and calculated dose distributions in the case of a water phantom. The results were compared to those obtained by a traditional method with respect to the computational time and discrepancy between the two methods. The new algorithm shows 5-20 times faster performance using the NVIDIA GeForce GTX 480 card in comparison with the Intel Core-i7 920 processor. The maximum discrepancy of the dose distribution is within 0.2%. Our results show that GPUs are effective for proton dose calculations.
Hou, Zhenlong; Huang, Danian
2017-09-01
In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.
Cost-effective GPU-grid for genome-wide epistasis calculations.
Pütz, B; Kam-Thong, T; Karbalai, N; Altmann, A; Müller-Myhsok, B
2013-01-01
Until recently, genotype studies were limited to the investigation of single SNP effects due to the computational burden incurred when studying pairwise interactions of SNPs. However, some genetic effects as simple as coloring (in plants and animals) cannot be ascribed to a single locus but only understood when epistasis is taken into account [1]. It is expected that such effects are also found in complex diseases where many genes contribute to the clinical outcome of affected individuals. Only recently have such problems become feasible computationally. The inherently parallel structure of the problem makes it a perfect candidate for massive parallelization on either grid or cloud architectures. Since we are also dealing with confidential patient data, we were not able to consider a cloud-based solution but had to find a way to process the data in-house and aimed to build a local GPU-based grid structure. Sequential epistatsis calculations were ported to GPU using CUDA at various levels. Parallelization on the CPU was compared to corresponding GPU counterparts with regards to performance and cost. A cost-effective solution was created by combining custom-built nodes equipped with relatively inexpensive consumer-level graphics cards with highly parallel GPUs in a local grid. The GPU method outperforms current cluster-based systems on a price/performance criterion, as a single GPU shows speed performance comparable up to 200 CPU cores. The outlined approach will work for problems that easily lend themselves to massive parallelization. Code for various tasks has been made available and ongoing development of tools will further ease the transition from sequential to parallel algorithms.
CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis
Directory of Open Access Journals (Sweden)
Federico Raimondo
2012-01-01
Full Text Available In recent years, Independent Component Analysis (ICA has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation.
Energy Technology Data Exchange (ETDEWEB)
Liu, T.; Ding, A.; Ji, W.; Xu, X. G. [Nuclear Engineering and Engineering Physics, Rensselaer Polytechnic Inst., Troy, NY 12180 (United States); Carothers, C. D. [Dept. of Computer Science, Rensselaer Polytechnic Inst. RPI (United States); Brown, F. B. [Los Alamos National Laboratory (LANL) (United States)
2012-07-01
Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of {approx}2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)
International Nuclear Information System (INIS)
Liu, T.; Ding, A.; Ji, W.; Xu, X. G.; Carothers, C. D.; Brown, F. B.
2012-01-01
Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of ∼2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)
A Fast MHD Code for Gravitationally Stratified Media using Graphical Processing Units: SMAUG
Griffiths, M. K.; Fedun, V.; Erdélyi, R.
2015-03-01
Parallelization techniques have been exploited most successfully by the gaming/graphics industry with the adoption of graphical processing units (GPUs), possessing hundreds of processor cores. The opportunity has been recognized by the computational sciences and engineering communities, who have recently harnessed successfully the numerical performance of GPUs. For example, parallel magnetohydrodynamic (MHD) algorithms are important for numerical modelling of highly inhomogeneous solar, astrophysical and geophysical plasmas. Here, we describe the implementation of SMAUG, the Sheffield Magnetohydrodynamics Algorithm Using GPUs. SMAUG is a 1-3D MHD code capable of modelling magnetized and gravitationally stratified plasma. The objective of this paper is to present the numerical methods and techniques used for porting the code to this novel and highly parallel compute architecture. The methods employed are justified by the performance benchmarks and validation results demonstrating that the code successfully simulates the physics for a range of test scenarios including a full 3D realistic model of wave propagation in the solar atmosphere.
Peng, Kuan; He, Ling; Zhu, Ziqiang; Tang, Jingtian; Xiao, Jiaying
2013-12-01
Compared with commonly used analytical reconstruction methods, the frequency-domain finite element method (FEM) based approach has proven to be an accurate and flexible algorithm for photoacoustic tomography. However, the FEM-based algorithm is computationally demanding, especially for three-dimensional cases. To enhance the algorithm's efficiency, in this work a parallel computational strategy is implemented in the framework of the FEM-based reconstruction algorithm using a graphic-processing-unit parallel frame named the "compute unified device architecture." A series of simulation experiments is carried out to test the accuracy and accelerating effect of the improved method. The results obtained indicate that the parallel calculation does not change the accuracy of the reconstruction algorithm, while its computational cost is significantly reduced by a factor of 38.9 with a GTX 580 graphics card using the improved method.
International Nuclear Information System (INIS)
Su, L.; Du, X.; Liu, T.; Xu, X. G.
2013-01-01
An electron-photon coupled Monte Carlo code ARCHER - Accelerated Radiation-transport Computations in Heterogeneous EnviRonments - is being developed at Rensselaer Polytechnic Institute as a software test-bed for emerging heterogeneous high performance computers that utilize accelerators such as GPUs (Graphics Processing Units). This paper presents the preliminary code development and the testing involving radiation dose related problems. In particular, the paper discusses the electron transport simulations using the class-II condensed history method. The considered electron energy ranges from a few hundreds of keV to 30 MeV. As for photon part, photoelectric effect, Compton scattering and pair production were simulated. Voxelized geometry was supported. A serial CPU (Central Processing Unit)code was first written in C++. The code was then transplanted to the GPU using the CUDA C 5.0 standards. The hardware involved a desktop PC with an Intel Xeon X5660 CPU and six NVIDIA Tesla M2090 GPUs. The code was tested for a case of 20 MeV electron beam incident perpendicularly on a water-aluminum-water phantom. The depth and later dose profiles were found to agree with results obtained from well tested MC codes. Using six GPU cards, 6*10 6 electron histories were simulated within 2 seconds. In comparison, the same case running the EGSnrc and MCNPX codes required 1645 seconds and 9213 seconds, respectively. On-going work continues to test the code for different medical applications such as radiotherapy and brachytherapy. (authors)
Cucheb: A GPU implementation of the filtered Lanczos procedure
Aurentz, Jared L.; Kalantzis, Vassilis; Saad, Yousef
2017-11-01
This paper describes the software package Cucheb, a GPU implementation of the filtered Lanczos procedure for the solution of large sparse symmetric eigenvalue problems. The filtered Lanczos procedure uses a carefully chosen polynomial spectral transformation to accelerate convergence of the Lanczos method when computing eigenvalues within a desired interval. This method has proven particularly effective for eigenvalue problems that arise in electronic structure calculations and density functional theory. We compare our implementation against an equivalent CPU implementation and show that using the GPU can reduce the computation time by more than a factor of 10. Program Summary Program title: Cucheb Program Files doi:http://dx.doi.org/10.17632/rjr9tzchmh.1 Licensing provisions: MIT Programming language: CUDA C/C++ Nature of problem: Electronic structure calculations require the computation of all eigenvalue-eigenvector pairs of a symmetric matrix that lie inside a user-defined real interval. Solution method: To compute all the eigenvalues within a given interval a polynomial spectral transformation is constructed that maps the desired eigenvalues of the original matrix to the exterior of the spectrum of the transformed matrix. The Lanczos method is then used to compute the desired eigenvectors of the transformed matrix, which are then used to recover the desired eigenvalues of the original matrix. The bulk of the operations are executed in parallel using a graphics processing unit (GPU). Runtime: Variable, depending on the number of eigenvalues sought and the size and sparsity of the matrix. Additional comments: Cucheb is compatible with CUDA Toolkit v7.0 or greater.
Fast GPU-based spot extraction for energy-dispersive X-ray Laue diffraction
International Nuclear Information System (INIS)
Alghabi, F.; Schipper, U.; Kolb, A.; Send, S.; Abboud, A.; Pashniak, N.; Pietsch, U.
2014-01-01
This paper describes a novel method for fast online analysis of X-ray Laue spots taken by means of an energy-dispersive X-ray 2D detector. Current pnCCD detectors typically operate at some 100 Hz (up to a maximum of 400 Hz) and have a resolution of 384 × 384 pixels, future devices head for even higher pixel counts and frame rates. The proposed online data analysis is based on a computer utilizing multiple Graphics Processing Units (GPUs), which allow for fast and parallel data processing. Our multi-GPU based algorithm is compliant with the rules of stream-based data processing, for which GPUs are optimized. The paper's main contribution is therefore an alternative algorithm for the determination of spot positions and energies over the full sequence of pnCCD data frames. Furthermore, an improved background suppression algorithm is presented.The resulting system is able to process data at the maximum acquisition rate of 400 Hz. We present a detailed analysis of the spot positions and energies deduced from a prior (single-core) CPU-based and the novel GPU-based data processing, showing that the parallel computed results using the GPU implementation are at least of the same quality as prior CPU-based results. Furthermore, the GPU-based algorithm is able to speed up the data processing by a factor of 7 (in comparison to single-core CPU-based algorithm) which effectively makes the detector system more suitable for online data processing
GPU's for event reconstruction in the FairRoot framework
International Nuclear Information System (INIS)
Al-Turany, M; Uhlig, F; Karabowicz, R
2010-01-01
FairRoot is the simulation and analysis framework used by CBM and PANDA experiments at FAIR/GSI. The use of graphics processor units (GPUs) for event reconstruction in FairRoot will be presented. The fact that CUDA (Nvidia's Compute Unified Device Architecture) development tools work alongside the conventional C/C++ compiler, makes it possible to mix GPU code with general-purpose code for the host CPU, based on this some of the reconstruction tasks can be send to the graphic cards. Moreover, tasks that run on the GPU's can also run in emulation mode on the host CPU, which has the advantage that the same code is used on both CPU and GPU.
High performance MRI simulations of motion on multi-GPU systems.
Xanthis, Christos G; Venetis, Ioannis E; Aletras, Anthony H
2014-07-04
MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer multi-GPU configuration. The incorporation
High-throughput GPU-based LDPC decoding
Chang, Yang-Lang; Chang, Cheng-Chun; Huang, Min-Yu; Huang, Bormin
2010-08-01
Low-density parity-check (LDPC) code is a linear block code known to approach the Shannon limit via the iterative sum-product algorithm. LDPC codes have been adopted in most current communication systems such as DVB-S2, WiMAX, WI-FI and 10GBASE-T. LDPC for the needs of reliable and flexible communication links for a wide variety of communication standards and configurations have inspired the demand for high-performance and flexibility computing. Accordingly, finding a fast and reconfigurable developing platform for designing the high-throughput LDPC decoder has become important especially for rapidly changing communication standards and configurations. In this paper, a new graphic-processing-unit (GPU) LDPC decoding platform with the asynchronous data transfer is proposed to realize this practical implementation. Experimental results showed that the proposed GPU-based decoder achieved 271x speedup compared to its CPU-based counterpart. It can serve as a high-throughput LDPC decoder.
AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics
Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.
2017-05-01
We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.
Explicit integration with GPU acceleration for large kinetic networks
International Nuclear Information System (INIS)
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike
2015-01-01
We demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. This orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
Processamento da rede neocognitron para reconhecimento facial em ambiente de alto desempenho GPU
Gustavo Poli Lameirão da Silva
2007-01-01
Neste trabalho é apresentada a implementação da Rede Neural Neocognitron, usando uma arquitetura de computação de alto desempenho baseada em GPU (Graphics Processing Unit). O Neocognitron é uma rede neural artificial, proposta por Fukushima e colaboradores, constituída de vários estágios de camadas de neurônios, organizados em matrizes bidimensionais denominadas planos celulares. Para o processamento de alto desempenho da aplicação de reconhecimento facial usando neocognitron foi utilizado o ...
SraTailor: graphical user interface software for processing and visualizing ChIP-seq data.
Oki, Shinya; Maehara, Kazumitsu; Ohkawa, Yasuyuki; Meno, Chikara
2014-12-01
Raw data from ChIP-seq (chromatin immunoprecipitation combined with massively parallel DNA sequencing) experiments are deposited in public databases as SRAs (Sequence Read Archives) that are publically available to all researchers. However, to graphically visualize ChIP-seq data of interest, the corresponding SRAs must be downloaded and converted into BigWig format, a process that involves complicated command-line processing. This task requires users to possess skill with script languages and sequence data processing, a requirement that prevents a wide range of biologists from exploiting SRAs. To address these challenges, we developed SraTailor, a GUI (Graphical User Interface) software package that automatically converts an SRA into a BigWig-formatted file. Simplicity of use is one of the most notable features of SraTailor: entering an accession number of an SRA and clicking the mouse are the only steps required to obtain BigWig-formatted files and to graphically visualize the extents of reads at given loci. SraTailor is also able to make peak calls, generate files of other formats, process users' own data, and accept various command-line-like options. Therefore, this software makes ChIP-seq data fully exploitable by a wide range of biologists. SraTailor is freely available at http://www.devbio.med.kyushu-u.ac.jp/sra_tailor/, and runs on both Mac and Windows machines. © 2014 The Authors Genes to Cells © 2014 by the Molecular Biology Society of Japan and Wiley Publishing Asia Pty Ltd.
Systems Biology Graphical Notation: Process Description language Level 1 Version 1.3.
Moodie, Stuart; Le Novère, Nicolas; Demir, Emek; Mi, Huaiyu; Villéger, Alice
2015-09-04
The Systems Biological Graphical Notation (SBGN) is an international community effort for standardized graphical representations of biological pathways and networks. The goal of SBGN is to provide unambiguous pathway and network maps for readers with different scientific backgrounds as well as to support efficient and accurate exchange of biological knowledge between different research communities, industry, and other players in systems biology. Three SBGN languages, Process Description (PD), Entity Relationship (ER) and Activity Flow (AF), allow for the representation of different aspects of biological and biochemical systems at different levels of detail. The SBGN Process Description language represents biological entities and processes between these entities within a network. SBGN PD focuses on the mechanistic description and temporal dependencies of biological interactions and transformations. The nodes (elements) are split into entity nodes describing, e.g., metabolites, proteins, genes and complexes, and process nodes describing, e.g., reactions and associations. The edges (connections) provide descriptions of relationships (or influences) between the nodes, such as consumption, production, stimulation and inhibition. Among all three languages of SBGN, PD is the closest to metabolic and regulatory pathways in biological literature and textbooks, but its well-defined semantics offer a superior precision in expressing biological knowledge.
Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S
2014-03-11
Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.
GPU-accelerated FDTD modeling of radio-frequency field-tissue interactions in high-field MRI.
Chi, Jieru; Liu, Feng; Weber, Ewald; Li, Yu; Crozier, Stuart
2011-06-01
The analysis of high-field RF field-tissue interactions requires high-performance finite-difference time-domain (FDTD) computing. Conventional CPU-based FDTD calculations offer limited computing performance in a PC environment. This study presents a graphics processing unit (GPU)-based parallel-computing framework, producing substantially boosted computing efficiency (with a two-order speedup factor) at a PC-level cost. Specific details of implementing the FDTD method on a GPU architecture have been presented and the new computational strategy has been successfully applied to the design of a novel 8-element transceive RF coil system at 9.4 T. Facilitated by the powerful GPU-FDTD computing, the new RF coil array offers optimized fields (averaging 25% improvement in sensitivity, and 20% reduction in loop coupling compared with conventional array structures of the same size) for small animal imaging with a robust RF configuration. The GPU-enabled acceleration paves the way for FDTD to be applied for both detailed forward modeling and inverse design of MRI coils, which were previously impractical.
Correia, J R C C C; Martins, C J A P
2017-10-01
Topological defects unavoidably form at symmetry breaking phase transitions in the early universe. To probe the parameter space of theoretical models and set tighter experimental constraints (exploiting the recent advances in astrophysical observations), one requires more and more demanding simulations, and therefore more hardware resources and computation time. Improving the speed and efficiency of existing codes is essential. Here we present a general purpose graphics-processing-unit implementation of the canonical Press-Ryden-Spergel algorithm for the evolution of cosmological domain wall networks. This is ported to the Open Computing Language standard, and as a consequence significant speedups are achieved both in two-dimensional (2D) and 3D simulations.
Correia, J. R. C. C. C.; Martins, C. J. A. P.
2017-10-01
Topological defects unavoidably form at symmetry breaking phase transitions in the early universe. To probe the parameter space of theoretical models and set tighter experimental constraints (exploiting the recent advances in astrophysical observations), one requires more and more demanding simulations, and therefore more hardware resources and computation time. Improving the speed and efficiency of existing codes is essential. Here we present a general purpose graphics-processing-unit implementation of the canonical Press-Ryden-Spergel algorithm for the evolution of cosmological domain wall networks. This is ported to the Open Computing Language standard, and as a consequence significant speedups are achieved both in two-dimensional (2D) and 3D simulations.
Solution of relativistic quantum optics problems using clusters of graphical processing units
Energy Technology Data Exchange (ETDEWEB)
Gordon, D.F., E-mail: daviel.gordon@nrl.navy.mil; Hafizi, B.; Helle, M.H.
2014-06-15
Numerical solution of relativistic quantum optics problems requires high performance computing due to the rapid oscillations in a relativistic wavefunction. Clusters of graphical processing units are used to accelerate the computation of a time dependent relativistic wavefunction in an arbitrary external potential. The stationary states in a Coulomb potential and uniform magnetic field are determined analytically and numerically, so that they can used as initial conditions in fully time dependent calculations. Relativistic energy levels in extreme magnetic fields are recovered as a means of validation. The relativistic ionization rate is computed for an ion illuminated by a laser field near the usual barrier suppression threshold, and the ionizing wavefunction is displayed.
Pizette, Patrick; Govender, Nicolin; Wilke, Daniel N.; Abriak, Nor-Edine
2017-06-01
The use of the Discrete Element Method (DEM) for industrial civil engineering industrial applications is currently limited due to the computational demands when large numbers of particles are considered. The graphics processing unit (GPU) with its highly parallelized hardware architecture shows potential to enable solution of civil engineering problems using discrete granular approaches. We demonstrate in this study the pratical utility of a validated GPU-enabled DEM modeling environment to simulate industrial scale granular problems. As illustration, the flow discharge of storage silos using 8 and 17 million particles is considered. DEM simulations have been performed to investigate the influence of particle size (equivalent size for the 20/40-mesh gravel) and induced shear stress for two hopper shapes. The preliminary results indicate that the shape of the hopper significantly influences the discharge rates for the same material. Specifically, this work shows that GPU-enabled DEM modeling environments can model industrial scale problems on a single portable computer within a day for 30 seconds of process time.
GPU accelerated flow solver for direct numerical simulation of turbulent flows
Energy Technology Data Exchange (ETDEWEB)
Salvadore, Francesco [CASPUR – via dei Tizii 6/b, 00185 Rome (Italy); Bernardini, Matteo, E-mail: matteo.bernardini@uniroma1.it [Department of Mechanical and Aerospace Engineering, University of Rome ‘La Sapienza’ – via Eudossiana 18, 00184 Rome (Italy); Botti, Michela [CASPUR – via dei Tizii 6/b, 00185 Rome (Italy)
2013-02-15
Graphical processing units (GPUs), characterized by significant computing performance, are nowadays very appealing for the solution of computationally demanding tasks in a wide variety of scientific applications. However, to run on GPUs, existing codes need to be ported and optimized, a procedure which is not yet standardized and may require non trivial efforts, even to high-performance computing specialists. In the present paper we accurately describe the porting to CUDA (Compute Unified Device Architecture) of a finite-difference compressible Navier–Stokes solver, suitable for direct numerical simulation (DNS) of turbulent flows. Porting and validation processes are illustrated in detail, with emphasis on computational strategies and techniques that can be applied to overcome typical bottlenecks arising from the porting of common computational fluid dynamics solvers. We demonstrate that a careful optimization work is crucial to get the highest performance from GPU accelerators. The results show that the overall speedup of one NVIDIA Tesla S2070 GPU is approximately 22 compared with one AMD Opteron 2352 Barcelona chip and 11 compared with one Intel Xeon X5650 Westmere core. The potential of GPU devices in the simulation of unsteady three-dimensional turbulent flows is proved by performing a DNS of a spatially evolving compressible mixing layer.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Directory of Open Access Journals (Sweden)
Jinwei Wang
2014-01-01
Full Text Available The active appearance model (AAM is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA on the Nvidia’s GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing
Guerrero, Ginés D.; Imbernón, Baldomero; García, José M.
2014-01-01
Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC) platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs) has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO). This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor. PMID:25025055
A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing
Directory of Open Access Journals (Sweden)
Ginés D. Guerrero
2014-01-01
Full Text Available Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO. This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor.
Energy Technology Data Exchange (ETDEWEB)
Loughry, Thomas A. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2015-02-01
As the volume of data acquired by space-based sensors increases, mission data compression/decompression and forward error correction code processing performance must likewise scale. This competency development effort was explored using the General Purpose Graphics Processing Unit (GPGPU) to accomplish high-rate Rice Decompression and high-rate Reed-Solomon (RS) decoding at the satellite mission ground station. Each algorithm was implemented and benchmarked on a single GPGPU. Distributed processing across one to four GPGPUs was also investigated. The results show that the GPGPU has considerable potential for performing satellite communication Data Signal Processing, with three times or better performance improvements and up to ten times reduction in cost over custom hardware, at least in the case of Rice Decompression and Reed-Solomon Decoding.
General purpose graphic processing unit implementation of adaptive pulse compression algorithms
Cai, Jingxiao; Zhang, Yan
2017-07-01
This study introduces a practical approach to implement real-time signal processing algorithms for general surveillance radar based on NVIDIA graphical processing units (GPUs). The pulse compression algorithms are implemented using compute unified device architecture (CUDA) libraries such as CUDA basic linear algebra subroutines and CUDA fast Fourier transform library, which are adopted from open source libraries and optimized for the NVIDIA GPUs. For more advanced, adaptive processing algorithms such as adaptive pulse compression, customized kernel optimization is needed and investigated. A statistical optimization approach is developed for this purpose without needing much knowledge of the physical configurations of the kernels. It was found that the kernel optimization approach can significantly improve the performance. Benchmark performance is compared with the CPU performance in terms of processing accelerations. The proposed implementation framework can be used in various radar systems including ground-based phased array radar, airborne sense and avoid radar, and aerospace surveillance radar.
Accelerating epistasis analysis in human genetics with consumer graphics hardware
Directory of Open Access Journals (Sweden)
Cancare Fabio
2009-07-01
Full Text Available Abstract Background Human geneticists are now capable of measuring more than one million DNA sequence variations from across the human genome. The new challenge is to develop computationally feasible methods capable of analyzing these data for associations with common human disease, particularly in the context of epistasis. Epistasis describes the situation where multiple genes interact in a complex non-linear manner to determine an individual's disease risk and is thought to be ubiquitous for common diseases. Multifactor Dimensionality Reduction (MDR is an algorithm capable of detecting epistasis. An exhaustive analysis with MDR is often computationally expensive, particularly for high order interactions. This challenge has previously been met with parallel computation and expensive hardware. The option we examine here exploits commodity hardware designed for computer graphics. In modern computers Graphics Processing Units (GPUs have more memory bandwidth and computational capability than Central Processing Units (CPUs and are well suited to this problem. Advances in the video game industry have led to an economy of scale creating a situation where these powerful components are readily available at very low cost. Here we implement and evaluate the performance of the MDR algorithm on GPUs. Of primary interest are the time required for an epistasis analysis and the price to performance ratio of available solutions. Findings We found that using MDR on GPUs consistently increased performance per machine over both a feature rich Java software package and a C++ cluster implementation. The performance of a GPU workstation running a GPU implementation reduces computation time by a factor of 160 compared to an 8-core workstation running the Java implementation on CPUs. This GPU workstation performs similarly to 150 cores running an optimized C++ implementation on a Beowulf cluster. Furthermore this GPU system provides extremely cost effective
Accelerating epistasis analysis in human genetics with consumer graphics hardware.
Sinnott-Armstrong, Nicholas A; Greene, Casey S; Cancare, Fabio; Moore, Jason H
2009-07-24
Human geneticists are now capable of measuring more than one million DNA sequence variations from across the human genome. The new challenge is to develop computationally feasible methods capable of analyzing these data for associations with common human disease, particularly in the context of epistasis. Epistasis describes the situation where multiple genes interact in a complex non-linear manner to determine an individual's disease risk and is thought to be ubiquitous for common diseases. Multifactor Dimensionality Reduction (MDR) is an algorithm capable of detecting epistasis. An exhaustive analysis with MDR is often computationally expensive, particularly for high order interactions. This challenge has previously been met with parallel computation and expensive hardware. The option we examine here exploits commodity hardware designed for computer graphics. In modern computers Graphics Processing Units (GPUs) have more memory bandwidth and computational capability than Central Processing Units (CPUs) and are well suited to this problem. Advances in the video game industry have led to an economy of scale creating a situation where these powerful components are readily available at very low cost. Here we implement and evaluate the performance of the MDR algorithm on GPUs. Of primary interest are the time required for an epistasis analysis and the price to performance ratio of available solutions. We found that using MDR on GPUs consistently increased performance per machine over both a feature rich Java software package and a C++ cluster implementation. The performance of a GPU workstation running a GPU implementation reduces computation time by a factor of 160 compared to an 8-core workstation running the Java implementation on CPUs. This GPU workstation performs similarly to 150 cores running an optimized C++ implementation on a Beowulf cluster. Furthermore this GPU system provides extremely cost effective performance while leaving the CPU available for other
International Nuclear Information System (INIS)
Huang Bormin; Mielikainen, Jarno; Oh, Hyunjong; Allen Huang, Hung-Lung
2011-01-01
Satellite-observed radiance is a nonlinear functional of surface properties and atmospheric temperature and absorbing gas profiles as described by the radiative transfer equation (RTE). In the era of hyperspectral sounders with thousands of high-resolution channels, the computation of the radiative transfer model becomes more time-consuming. The radiative transfer model performance in operational numerical weather prediction systems still limits the number of channels we can use in hyperspectral sounders to only a few hundreds. To take the full advantage of such high-resolution infrared observations, a computationally efficient radiative transfer model is needed to facilitate satellite data assimilation. In recent years the programmable commodity graphics processing unit (GPU) has evolved into a highly parallel, multi-threaded, many-core processor with tremendous computational speed and very high memory bandwidth. The radiative transfer model is very suitable for the GPU implementation to take advantage of the hardware's efficiency and parallelism where radiances of many channels can be calculated in parallel in GPUs. In this paper, we develop a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) launched in 2006 onboard the first European meteorological polar-orbiting satellites, METOP-A. Each IASI spectrum has 8461 spectral channels. The IASI radiative transfer model consists of three modules. The first module for computing the regression predictors takes less than 0.004% of CPU time, while the second module for transmittance computation and the third module for radiance computation take approximately 92.5% and 7.5%, respectively. Our GPU-based IASI radiative transfer model is developed to run on a low-cost personal supercomputer with four GPUs with total 960 compute cores, delivering near 4 TFlops theoretical peak performance. By massively parallelizing the second and third modules, we reached 364x
Heckbert, Paul S
1994-01-01
Graphics Gems IV contains practical techniques for 2D and 3D modeling, animation, rendering, and image processing. The book presents articles on polygons and polyhedral; a mix of formulas, optimized algorithms, and tutorial information on the geometry of 2D, 3D, and n-D space; transformations; and parametric curves and surfaces. The text also includes articles on ray tracing; shading 3D models; and frame buffer techniques. Articles on image processing; algorithms for graphical layout; basic interpolation methods; and subroutine libraries for vector and matrix algebra are also demonstrated. Com
GPU-based RFA simulation for minimally invasive cancer treatment of liver tumours.
Mariappan, Panchatcharam; Weir, Phil; Flanagan, Ronan; Voglreiter, Philip; Alhonnoro, Tuomas; Pollari, Mika; Moche, Michael; Busse, Harald; Futterer, Jurgen; Portugaller, Horst Rupert; Sequeiros, Roberto Blanco; Kolesnik, Marina
2017-01-01
Radiofrequency ablation (RFA) is one of the most popular and well-standardized minimally invasive cancer treatments (MICT) for liver tumours, employed where surgical resection has been contraindicated. Less-experienced interventional radiologists (IRs) require an appropriate planning tool for the treatment to help avoid incomplete treatment and so reduce the tumour recurrence risk. Although a few tools are available to predict the ablation lesion geometry, the process is computationally expensive. Also, in our implementation, a few patient-specific parameters are used to improve the accuracy of the lesion prediction. Advanced heterogeneous computing using personal computers, incorporating the graphics processing unit (GPU) and the central processing unit (CPU), is proposed to predict the ablation lesion geometry. The most recent GPU technology is used to accelerate the finite element approximation of Penne's bioheat equation and a three state cell model. Patient-specific input parameters are used in the bioheat model to improve accuracy of the predicted lesion. A fast GPU-based RFA solver is developed to predict the lesion by doing most of the computational tasks in the GPU, while reserving the CPU for concurrent tasks such as lesion extraction based on the heat deposition at each finite element node. The solver takes less than 3 min for a treatment duration of 26 min. When the model receives patient-specific input parameters, the deviation between real and predicted lesion is below 3 mm. A multi-centre retrospective study indicates that the fast RFA solver is capable of providing the IR with the predicted lesion in the short time period before the intervention begins when the patient has been clinically prepared for the treatment.
A New GPU-Enabled MODTRAN Thermal Model for the PLUME TRACKER Volcanic Emission Analysis Toolkit
Acharya, P. K.; Berk, A.; Guiang, C.; Kennett, R.; Perkins, T.; Realmuto, V. J.
2013-12-01
Real-time quantification of volcanic gaseous and particulate releases is important for (1) recognizing rapid increases in SO2 gaseous emissions which may signal an impending eruption; (2) characterizing ash clouds to enable safe and efficient commercial aviation; and (3) quantifying the impact of volcanic aerosols on climate forcing. The Jet Propulsion Laboratory (JPL) has developed state-of-the-art algorithms, embedded in their analyst-driven Plume Tracker toolkit, for performing SO2, NH3, and CH4 retrievals from remotely sensed multi-spectral Thermal InfraRed spectral imagery. While Plume Tracker provides accurate results, it typically requires extensive analyst time. A major bottleneck in this processing is the relatively slow but accurate FORTRAN-based MODTRAN atmospheric and plume radiance model, developed by Spectral Sciences, Inc. (SSI). To overcome this bottleneck, SSI in collaboration with JPL, is porting these slow thermal radiance algorithms onto massively parallel, relatively inexpensive and commercially-available GPUs. This paper discusses SSI's efforts to accelerate the MODTRAN thermal emission algorithms used by Plume Tracker. Specifically, we are developing a GPU implementation of the Curtis-Godson averaging and the Voigt in-band transmittances from near line center molecular absorption, which comprise the major computational bottleneck. The transmittance calculations were decomposed into separate functions, individually implemented as GPU kernels, and tested for accuracy and performance relative to the original CPU code. Speedup factors of 14 to 30× were realized for individual processing components on an NVIDIA GeForce GTX 295 graphics card with no loss of accuracy. Due to the separate host (CPU) and device (GPU) memory spaces, a redesign of the MODTRAN architecture was required to ensure efficient data transfer between host and device, and to facilitate high parallel throughput. Currently, we are incorporating the separate GPU kernels into a
Fast Gridding on Commodity Graphics Hardware
DEFF Research Database (Denmark)
Sørensen, Thomas Sangild; Schaeffter, Tobias; Noe, Karsten Østergaard
2007-01-01
is the far most time consuming of the three steps (Table 1). Modern graphics cards (GPUs) can be utilised as a fast parallel processor provided that algorithms are reformulated in a parallel solution. The purpose of this work is to test the hypothesis, that a non-cartesian reconstruction can be efficiently...... implemented on graphics hardware giving a significant speedup compared to CPU based alternatives. We present a novel GPU implementation of the convolution step that overcomes the problems of memory bandwidth that has limited the speed of previous GPU gridding algorithms [2]....
International Nuclear Information System (INIS)
Liu Yuan; Du Zhihui; Chung, Shin Kee; Hooper, Shaun; Blair, David; Wen Linqing
2012-01-01
We present a graphics processing unit (GPU)-accelerated time-domain low-latency algorithm to search for gravitational waves (GWs) from coalescing binaries of compact objects based on the summed parallel infinite impulse response (SPIIR) filtering technique. The aim is to facilitate fast detection of GWs with a minimum delay to allow prompt electromagnetic follow-up observations. To maximize the GPU acceleration, we apply an efficient batched parallel computing model that significantly reduces the number of synchronizations in SPIIR and optimizes the usage of the memory and hardware resource. Our code is tested on the CUDA ‘Fermi’ architecture in a GTX 480 graphics card and its performance is compared with a single core of Intel Core i7 920 (2.67 GHz). A 58-fold speedup is achieved while giving results in close agreement with the CPU implementation. Our result indicates that it is possible to conduct a full search for GWs from compact binary coalescence in real time with only one desktop computer equipped with a Fermi GPU card for the initial LIGO detectors which in the past required more than 100 CPUs. (paper)
Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.
Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji
2015-12-01
A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.
An implementation of the direct-forcing immersed boundary method using GPU power
Directory of Open Access Journals (Sweden)
Bulent Tutkun
2017-01-01
Full Text Available A graphics processing unit (GPU is utilized to apply the direct-forcing immersed boundary method. The code running on the GPU is generated with the help of the Compute Unified Device Architecture (CUDA. The first and second spatial derivatives of the incompressible Navier-Stokes equations are discretized by the sixth-order central compact finite-difference schemes. Two flow fields are simulated. The first test case is the simulated flow around a square cylinder, with the results providing good estimations of the wake region mechanics and vortex shedding. The second test case is the simulated flow around a circular cylinder. This case was selected to better understand the effects of sharp corners on the force coefficients. It was observed that the estimation of the force coefficients did not result in any troubles in the case of a circular cylinder. Additionally, the performance values obtained for the calculation time for the solution of the Poisson equation are compared with the values for other CPUs and GPUs from the literature. Consequently, approximately 3× and 20× speedups are achieved in comparison with GPU (using CUSP library and CPU, respectively. CUSP is an open-source library for sparse linear algebra and graph computations on CUDA.
A GPU Implementation of Local Search Operators for Symmetric Travelling Salesman Problem
Directory of Open Access Journals (Sweden)
Juraj Fosin
2013-06-01
Full Text Available The Travelling Salesman Problem (TSP is one of the most studied combinatorial optimization problem which is significant in many practical applications in transportation problems. The TSP problem is NP-hard problem and requires large computation power to be solved by the exact algorithms. In the past few years, fast development of general-purpose Graphics Processing Units (GPUs has brought huge improvement in decreasing the applications’ execution time. In this paper, we implement 2-opt and 3-opt local search operators for solving the TSP on the GPU using CUDA. The novelty presented in this paper is a new parallel iterated local search approach with 2-opt and 3-opt operators for symmetric TSP, optimized for the execution on GPUs. With our implementation large TSP problems (up to 85,900 cities can be solved using the GPU. We will show that our GPU implementation can be up to 20x faster without losing quality for all TSPlib problems as well as for our CRO TSP problem.
Printing--Graphic Arts--Graphic Communications
Hauenstein, A. Dean
1975-01-01
Recently, "graphic arts" has shifted from printing skills to a conceptual approach of production processes. "Graphic communications" must embrace the total system of communication through graphic media, to serve broad career education purposes; students taught concepts and principles can be flexible and adaptive. The author…
A GPU-based large-scale Monte Carlo simulation method for systems with long-range interactions
Liang, Yihao; Xing, Xiangjun; Li, Yaohang
2017-06-01
In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures, and adopts the sequential updating scheme of Metropolis algorithm. It makes no approximation in the computation of energy, and reaches a remarkable 440-fold speedup, compared with the serial implementation on CPU. We further use this method to simulate primitive model electrolytes, and measure very precisely all ion-ion pair correlation functions at high concentrations. From these data, we extract the renormalized Debye length, renormalized valences of constituent ions, and renormalized dielectric constants. These results demonstrate unequivocally physics beyond the classical Poisson-Boltzmann theory.
SU-E-P-59: A Graphical Interface for XCAT Phantom Configuration, Generation and Processing
International Nuclear Information System (INIS)
Myronakis, M; Cai, W; Dhou, S; Cifter, F; Lewis, J; Hurwitz, M
2015-01-01
Purpose: To design a comprehensive open-source, publicly available, graphical user interface (GUI) to facilitate the configuration, generation, processing and use of the 4D Extended Cardiac-Torso (XCAT) phantom. Methods: The XCAT phantom includes over 9000 anatomical objects as well as respiratory, cardiac and tumor motion. It is widely used for research studies in medical imaging and radiotherapy. The phantom generation process involves the configuration of a text script to parameterize the geometry, motion, and composition of the whole body and objects within it, and to generate simulated PET or CT images. To avoid the need for manual editing or script writing, our MATLAB-based GUI uses slider controls, drop-down lists, buttons and graphical text input to parameterize and process the phantom. Results: Our GUI can be used to: a) generate parameter files; b) generate the voxelized phantom; c) combine the phantom with a lesion; d) display the phantom; e) produce average and maximum intensity images from the phantom output files; f) incorporate irregular patient breathing patterns; and f) generate DICOM files containing phantom images. The GUI provides local help information using tool-tip strings on the currently selected phantom, minimizing the need for external documentation. The DICOM generation feature is intended to simplify the process of importing the phantom images into radiotherapy treatment planning systems or other clinical software. Conclusion: The GUI simplifies and automates the use of the XCAT phantom for imaging-based research projects in medical imaging or radiotherapy. This has the potential to accelerate research conducted with the XCAT phantom, or to ease the learning curve for new users. This tool does not include the XCAT phantom software itself. We would like to acknowledge funding from MRA, Varian Medical Systems Inc
SU-E-P-59: A Graphical Interface for XCAT Phantom Configuration, Generation and Processing
Energy Technology Data Exchange (ETDEWEB)
Myronakis, M; Cai, W; Dhou, S; Cifter, F; Lewis, J [Brigham and Women’s Hospital, Boston, MA (United States); Hurwitz, M [Newton, MA (United States)
2015-06-15
Purpose: To design a comprehensive open-source, publicly available, graphical user interface (GUI) to facilitate the configuration, generation, processing and use of the 4D Extended Cardiac-Torso (XCAT) phantom. Methods: The XCAT phantom includes over 9000 anatomical objects as well as respiratory, cardiac and tumor motion. It is widely used for research studies in medical imaging and radiotherapy. The phantom generation process involves the configuration of a text script to parameterize the geometry, motion, and composition of the whole body and objects within it, and to generate simulated PET or CT images. To avoid the need for manual editing or script writing, our MATLAB-based GUI uses slider controls, drop-down lists, buttons and graphical text input to parameterize and process the phantom. Results: Our GUI can be used to: a) generate parameter files; b) generate the voxelized phantom; c) combine the phantom with a lesion; d) display the phantom; e) produce average and maximum intensity images from the phantom output files; f) incorporate irregular patient breathing patterns; and f) generate DICOM files containing phantom images. The GUI provides local help information using tool-tip strings on the currently selected phantom, minimizing the need for external documentation. The DICOM generation feature is intended to simplify the process of importing the phantom images into radiotherapy treatment planning systems or other clinical software. Conclusion: The GUI simplifies and automates the use of the XCAT phantom for imaging-based research projects in medical imaging or radiotherapy. This has the potential to accelerate research conducted with the XCAT phantom, or to ease the learning curve for new users. This tool does not include the XCAT phantom software itself. We would like to acknowledge funding from MRA, Varian Medical Systems Inc.
GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES
Energy Technology Data Exchange (ETDEWEB)
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal [Department of Astronomy, University of Arizona, 933 N. Cherry Ave., Tucson, AZ 85721 (United States)
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
Real-time colouring and filtering with graphics shaders
Vohl, D.; Fluke, C. J.; Barnes, D. G.; Hassan, A. H.
2017-11-01
Despite the popularity of the Graphics Processing Unit (GPU) for general purpose computing, one should not forget about the practicality of the GPU for fast scientific visualization. As astronomers have increasing access to three-dimensional (3D) data from instruments and facilities like integral field units and radio interferometers, visualization techniques such as volume rendering offer means to quickly explore spectral cubes as a whole. As most 3D visualization techniques have been developed in fields of research like medical imaging and fluid dynamics, many transfer functions are not optimal for astronomical data. We demonstrate how transfer functions and graphics shaders can be exploited to provide new astronomy-specific explorative colouring methods. We present 12 shaders, including four novel transfer functions specifically designed to produce intuitive and informative 3D visualizations of spectral cube data. We compare their utility to classic colour mapping. The remaining shaders highlight how common computation like filtering, smoothing and line ratio algorithms can be integrated as part of the graphics pipeline. We discuss how this can be achieved by utilizing the parallelism of modern GPUs along with a shading language, letting astronomers apply these new techniques at interactive frame rates. All shaders investigated in this work are included in the open source software shwirl (Vohl 2017).
International Nuclear Information System (INIS)
Gao Wenhuan; Fu Changqing; Kang Kejun
1993-01-01
X Window is a network-oriented and network transparent windowing system, and now dominant in the Unix domain. The object-oriented programming technology can be used to change the extensibility of a software system remarkably. An introduction to graphics user interface is given. And how to develop a graphics user interface for radiation information processing system with object-oriented programming technology, which is based on X Window and independent of application is described briefly
Accelerating Pseudo-Random Number Generator for MCNP on GPU
Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu
2010-09-01
Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.
Event- and Time-Driven Techniques Using Parallel CPU-GPU Co-processing for Spiking Neural Networks.
Naveros, Francisco; Garrido, Jesus A; Carrillo, Richard R; Ros, Eduardo; Luque, Niceto R
2017-01-01
Modeling and simulating the neural structures which make up our central neural system is instrumental for deciphering the computational neural cues beneath. Higher levels of biological plausibility usually impose higher levels of complexity in mathematical modeling, from neural to behavioral levels. This paper focuses on overcoming the simulation problems (accuracy and performance) derived from using higher levels of mathematical complexity at a neural level. This study proposes different techniques for simulating neural models that hold incremental levels of mathematical complexity: leaky integrate-and-fire (LIF), adaptive exponential integrate-and-fire (AdEx), and Hodgkin-Huxley (HH) neural models (ranged from low to high neural complexity). The studied techniques are classified into two main families depending on how the neural-model dynamic evaluation is computed: the event-driven or the time-driven families. Whilst event-driven techniques pre-compile and store the neural dynamics within look-up tables, time-driven techniques compute the neural dynamics iteratively during the simulation time. We propose two modifications for the event-driven family: a look-up table recombination to better cope with the incremental neural complexity together with a better handling of the synchronous input activity. Regarding the time-driven family, we propose a modification in computing the neural dynamics: the bi-fixed-step integration method. This method automatically adjusts the simulation step size to better cope with the stiffness of the neural model dynamics running in CPU platforms. One version of this method is also implemented for hybrid CPU-GPU platforms. Finally, we analyze how the performance and accuracy of these modifications evolve with increasing levels of neural complexity. We also demonstrate how the proposed modifications which constitute the main contribution of this study systematically outperform the traditional event- and time-driven techniques under
Accelerated fluctuation analysis by graphic cards and complex pattern formation in financial markets
International Nuclear Information System (INIS)
Preis, Tobias; Virnau, Peter; Paul, Wolfgang; Schneider, Johannes J
2009-01-01
The compute unified device architecture is an almost conventional programming approach for managing computations on a graphics processing unit (GPU) as a data-parallel computing device. With a maximum number of 240 cores in combination with a high memory bandwidth, a recent GPU offers resources for computational physics. We apply this technology to methods of fluctuation analysis, which includes determination of the scaling behavior of a stochastic process and the equilibrium autocorrelation function. Additionally, the recently introduced pattern formation conformity (Preis T et al 2008 Europhys. Lett. 82 68005), which quantifies pattern-based complex short-time correlations of a time series, is calculated on a GPU and analyzed in detail. Results are obtained up to 84 times faster than on a current central processing unit core. When we apply this method to high-frequency time series of the German BUND future, we find significant pattern-based correlations on short time scales. Furthermore, an anti-persistent behavior can be found on short time scales. Additionally, we compare the recent GPU generation, which provides a theoretical peak performance of up to roughly 10 12 floating point operations per second with the previous one.
A GPU-based incompressible Navier-Stokes solver on moving overset grids
Chandar, Dominic D. J.; Sitaraman, Jayanarayanan; Mavriplis, Dimitri J.
2013-07-01
In pursuit of obtaining high fidelity solutions to the fluid flow equations in a short span of time, graphics processing units (GPUs) which were originally intended for gaming applications are currently being used to accelerate computational fluid dynamics (CFD) codes. With a high peak throughput of about 1 TFLOPS on a PC, GPUs seem to be favourable for many high-resolution computations. One such computation that involves a lot of number crunching is computing time accurate flow solutions past moving bodies. The aim of the present paper is thus to discuss the development of a flow solver on unstructured and overset grids and its implementation on GPUs. In its present form, the flow solver solves the incompressible fluid flow equations on unstructured/hybrid/overset grids using a fully implicit projection method. The resulting discretised equations are solved using a matrix-free Krylov solver using several GPU kernels such as gradient, Laplacian and reduction. Some of the simple arithmetic vector calculations are implemented using the CU++: An Object Oriented Framework for Computational Fluid Dynamics Applications using Graphics Processing Units, Journal of Supercomputing, 2013, doi:10.1007/s11227-013-0985-9 approach where GPU kernels are automatically generated at compile time. Results are presented for two- and three-dimensional computations on static and moving grids.
High energy electromagnetic particle transportation on the GPU
Energy Technology Data Exchange (ETDEWEB)
Canal, P. [Fermilab; Elvira, D. [Fermilab; Jun, S. Y. [Fermilab; Kowalkowski, J. [Fermilab; Paterno, M. [Fermilab; Apostolakis, J. [CERN
2014-01-01
We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.
Abdellah, Marwan; Eldeib, Ayman; Owis, Mohamed I
2015-01-01
This paper features an advanced implementation of the X-ray rendering algorithm that harnesses the giant computing power of the current commodity graphics processors to accelerate the generation of high resolution digitally reconstructed radiographs (DRRs). The presented pipeline exploits the latest features of NVIDIA Graphics Processing Unit (GPU) architectures, mainly bindless texture objects and dynamic parallelism. The rendering throughput is substantially improved by exploiting the interoperability mechanisms between CUDA and OpenGL. The benchmarks of our optimized rendering pipeline reflect its capability of generating DRRs with resolutions of 2048(2) and 4096(2) at interactive and semi interactive frame-rates using an NVIDIA GeForce 970 GTX device.
GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs.
Directory of Open Access Journals (Sweden)
Ahmed Shamsul Arefin
Full Text Available BACKGROUND: The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers. An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU, can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. RESULTS: We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50-60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. CONCLUSION: Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL at https://sourceforge.net/p/gpufsknn/.
Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics.
Zheng, Mo; Li, Xiaoxia; Guo, Li
2013-04-01
Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10-50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.'s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps' C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. Copyright © 2013 Elsevier Inc. All rights reserved.
GPU-accelerated depth map generation for X-ray simulations of complex CAD geometries
Grandin, Robert J.; Young, Gavin; Holland, Stephen D.; Krishnamurthy, Adarsh
2018-04-01
Interactive x-ray simulations of complex computer-aided design (CAD) models can provide valuable insights for better interpretation of the defect signatures such as porosity from x-ray CT images. Generating the depth map along a particular direction for the given CAD geometry is the most compute-intensive step in x-ray simulations. We have developed a GPU-accelerated method for real-time generation of depth maps of complex CAD geometries. We preprocess complex components designed using commercial CAD systems using a custom CAD module and convert them into a fine user-defined surface tessellation. Our CAD module can be used by different simulators as well as handle complex geometries, including those that arise from complex castings and composite structures. We then make use of a parallel algorithm that runs on a graphics processing unit (GPU) to convert the finely-tessellated CAD model to a voxelized representation. The voxelized representation can enable heterogeneous modeling of the volume enclosed by the CAD model by assigning heterogeneous material properties in specific regions. The depth maps are generated from this voxelized representation with the help of a GPU-accelerated ray-casting algorithm. The GPU-accelerated ray-casting method enables interactive (> 60 frames-per-second) generation of the depth maps of complex CAD geometries. This enables arbitrarily rotation and slicing of the CAD model, leading to better interpretation of the x-ray images by the user. In addition, the depth maps can be used to aid directly in CT reconstruction algorithms.
GRAVIDY, a GPU modular, parallel direct-summation N-body integrator: dynamics with softening
Maureira-Fredes, Cristián; Amaro-Seoane, Pau
2018-01-01
A wide variety of outstanding problems in astrophysics involve the motion of a large number of particles under the force of gravity. These include the global evolution of globular clusters, tidal disruptions of stars by a massive black hole, the formation of protoplanets and sources of gravitational radiation. The direct-summation of N gravitational forces is a complex problem with no analytical solution and can only be tackled with approximations and numerical methods. To this end, the Hermite scheme is a widely used integration method. With different numerical techniques and special-purpose hardware, it can be used to speed up the calculations. But these methods tend to be computationally slow and cumbersome to work with. We present a new graphics processing unit (GPU), direct-summation N-body integrator written from scratch and based on this scheme, which includes relativistic corrections for sources of gravitational radiation. GRAVIDY has high modularity, allowing users to readily introduce new physics, it exploits available computational resources and will be maintained by regular updates. GRAVIDY can be used in parallel on multiple CPUs and GPUs, with a considerable speed-up benefit. The single-GPU version is between one and two orders of magnitude faster than the single-CPU version. A test run using four GPUs in parallel shows a speed-up factor of about 3 as compared to the single-GPU version. The conception and design of this first release is aimed at users with access to traditional parallel CPU clusters or computational nodes with one or a few GPU cards.
GPU-based simulation of optical propagation through turbulence for active and passive imaging
Monnier, Goulven; Duval, François-Régis; Amram, Solène
2014-10-01
IMOTEP is a GPU-based (Graphical Processing Units) software relying on a fast parallel implementation of Fresnel diffraction through successive phase screens. Its applications include active imaging, laser telemetry and passive imaging through turbulence with anisoplanatic spatial and temporal fluctuations. Thanks to parallel implementation on GPU, speedups ranging from 40X to 70X are achieved. The present paper gives a brief overview of IMOTEP models, algorithms, implementation and user interface. It then focuses on major improvements recently brought to the anisoplanatic imaging simulation method. Previously, we took advantage of the computational power offered by the GPU to develop a simulation method based on large series of deterministic realisations of the PSF distorted by turbulence. The phase screen propagation algorithm, by reproducing higher moments of the incident wavefront distortion, provides realistic PSFs. However, we first used a coarse gaussian model to fit the numerical PSFs and characterise there spatial statistics through only 3 parameters (two-dimensional displacements of centroid and width). Meanwhile, this approach was unable to reproduce the effects related to the details of the PSF structure, especially the "speckles" leading to prominent high-frequency content in short-exposure images. To overcome this limitation, we recently implemented a new empirical model of the PSF, based on Principal Components Analysis (PCA), ought to catch most of the PSF complexity. The GPU implementation allows estimating and handling efficiently the numerous (up to several hundreds) principal components typically required under the strong turbulence regime. A first demanding computational step involves PCA, phase screen propagation and covariance estimates. In a second step, realistic instantaneous images, fully accounting for anisoplanatic effects, are quickly generated. Preliminary results are presented.
Lee, Kenneth K C; Mariampillai, Adrian; Yu, Joe X Z; Cadotte, David W; Wilson, Brian C; Standish, Beau A; Yang, Victor X D
2012-07-01
Advances in swept source laser technology continues to increase the imaging speed of swept-source optical coherence tomography (SS-OCT) systems. These fast imaging speeds are ideal for microvascular detection schemes, such as speckle variance (SV), where interframe motion can cause severe imaging artifacts and loss of vascular contrast. However, full utilization of the laser scan speed has been hindered by the computationally intensive signal processing required by SS-OCT and SV calculations. Using a commercial graphics processing unit that has been optimized for parallel data processing, we report a complete high-speed SS-OCT platform capable of real-time data acquisition, processing, display, and saving at 108,000 lines per second. Subpixel image registration of structural images was performed in real-time prior to SV calculations in order to reduce decorrelation from stationary structures induced by the bulk tissue motion. The viability of the system was successfully demonstrated in a high bulk tissue motion scenario of human fingernail root imaging where SV images (512 × 512 pixels, n = 4) were displayed at 54 frames per second.
GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF
Mielikainen, J.; Huang, B.; Huang, A.
2011-12-01
The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has
García Sánchez, Jesús N; Rodríguez Pérez, Celestino
2007-05-01
An experimental study of the influence of the recording interval and a graphic organizer on the processes of writing composition and on the final product is presented. We studied 326 participants, age 10 to 16 years old, by means of a nested design. Two groups were compared: one group was aided in the writing process with a graphic organizer and the other was not. Each group was subdivided into two further groups: one with a mean recording interval of 45 seconds and the other with approximately 90 seconds recording interval in a writing log. The results showed that the group aided by a graphic organizer obtained better results both in processes and writing product, and that the groups assessed with an average interval of 45 seconds obtained worse results. Implications for educational practice are discussed, and limitations and future perspectives are commented on.
ODYSSEY: A PUBLIC GPU-BASED CODE FOR GENERAL RELATIVISTIC RADIATIVE TRANSFER IN KERR SPACETIME
Energy Technology Data Exchange (ETDEWEB)
Pu, Hung-Yi [Institute of Astronomy and Astrophysics, Academia Sinica, 11F of Astronomy-Mathematics Building, AS/NTU No. 1, Taipei 10617, Taiwan (China); Yun, Kiyun; Yoon, Suk-Jin [Department of Astronomy and Center for Galaxy Evolution Research, Yonsei University, Seoul 120-749 (Korea, Republic of); Younsi, Ziri [Institut für Theoretische Physik, Max-von-Laue-Straße 1, D-60438 Frankfurt am Main (Germany)
2016-04-01
General relativistic radiative transfer calculations coupled with the calculation of geodesics in the Kerr spacetime are an essential tool for determining the images, spectra, and light curves from matter in the vicinity of black holes. Such studies are especially important for ongoing and upcoming millimeter/submillimeter very long baseline interferometry observations of the supermassive black holes at the centers of Sgr A* and M87. To this end we introduce Odyssey, a graphics processing unit (GPU) based code for ray tracing and radiative transfer in the Kerr spacetime. On a single GPU, the performance of Odyssey can exceed 1 ns per photon, per Runge–Kutta integration step. Odyssey is publicly available, fast, accurate, and flexible enough to be modified to suit the specific needs of new users. Along with a Graphical User Interface powered by a video-accelerated display architecture, we also present an educational software tool, Odyssey-Edu, for showing in real time how null geodesics around a Kerr black hole vary as a function of black hole spin and angle of incidence onto the black hole.
Hammitzsch, M.; Spazier, J.; Reißland, S.
2014-12-01
Usually, tsunami early warning and mitigation systems (TWS or TEWS) are based on several software components deployed in a client-server based infrastructure. The vast majority of systems importantly include desktop-based clients with a graphical user interface (GUI) for the operators in early warning centers. However, in times of cloud computing and ubiquitous computing the use of concepts and paradigms, introduced by continuously evolving approaches in information and communications technology (ICT), have to be considered even for early warning systems (EWS). Based on the experiences and the knowledge gained in three research projects - 'German Indonesian Tsunami Early Warning System' (GITEWS), 'Distant Early Warning System' (DEWS), and 'Collaborative, Complex, and Critical Decision-Support in Evolving Crises' (TRIDEC) - new technologies are exploited to implement a cloud-based and web-based prototype to open up new prospects for EWS. This prototype, named 'TRIDEC Cloud', merges several complementary external and in-house cloud-based services into one platform for automated background computation with graphics processing units (GPU), for web-mapping of hazard specific geospatial data, and for serving relevant functionality to handle, share, and communicate threat specific information in a collaborative and distributed environment. The prototype in its current version addresses tsunami early warning and mitigation. The integration of GPU accelerated tsunami simulation computations have been an integral part of this prototype to foster early warning with on-demand tsunami predictions based on actual source parameters. However, the platform is meant for researchers around the world to make use of the cloud-based GPU computation to analyze other types of geohazards and natural hazards and react upon the computed situation picture with a web-based GUI in a web browser at remote sites. The current website is an early alpha version for demonstration purposes to give the
cellGPU: Massively parallel simulations of dynamic vertex models
Sussman, Daniel M.
2017-10-01
Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun
2015-09-01
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun
2015-10-07
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)
International Nuclear Information System (INIS)
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun
2015-01-01
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon–electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783–97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48–0.53% for the electron beam cases and 0.15–0.17% for the photon beam cases. In terms of efficiency, goMC was ∼4–16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was
GPU Nuclear Corporation's radiation exposure management system
International Nuclear Information System (INIS)
Slobodien, M.J.; Bovino, A.A.; Perry, O.R.; Hildebrand, J.E.
1984-01-01
GPU Nuclear Corporation has developed a central main frame (IBM 3081) based radiation exposure management system which provides real time and batch transactions for three separate reactor facilities. The structure and function of the data base are discussed. The system's main features include real time on-line radiation work permit generation and personnel exposure tracking; dose accountability as a function of system and component, job type, worker classification, and work location; and personnel dosemeter (TLD and self-reading pocket dosemeters) data processing. The system also carries the qualifications of all radiation workers including RWP training, respiratory protection training, results of respirator fit tests and medical exams. A warning system is used to prevent non-qualified persons from entering controlled areas. The main frame system is interfaced with a variety of mini and micro computer systems for dosemetry, statistical and graphics applications. These are discussed. Some unique dosemetry features which are discussed include assessment of dose for up to 140 parts of the body with dose evaluations at 7,300 and 1000 mg/cm 2 for each part, tracking of MPC hours on a 7 day rolling schedule; automatic pairing of TLD and self-reading pocket dosemeter values, creation and updating of NRC Forms 4 and 5, generation of NRC required 20.407 and Reg Guide 1.16 reports. As of July 1983, over 20 remote on-line stations were in use with plans to add 20-30 more by May 1984. The system provides response times for on-line activities of 2-7 seconds and 23 1/2 hours per day ''up time''. Examples of the various on-line and batch transactions are described
Parallel fuzzy connected image segmentation on GPU.
Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W
2011-07-01
Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.
Resolution of the Vlasov-Maxwell system by PIC discontinuous Galerkin method on GPU with OpenCL
Directory of Open Access Journals (Sweden)
Crestetto Anaïs
2013-01-01
Full Text Available We present an implementation of a Vlasov-Maxwell solver for multicore processors. The Vlasov equation describes the evolution of charged particles in an electromagnetic field, solution of the Maxwell equations. The Vlasov equation is solved by a Particle-In-Cell method (PIC, while the Maxwell system is computed by a Discontinuous Galerkin method. We use the OpenCL framework, which allows our code to run on multicore processors or recent Graphic Processing Units (GPU. We present several numerical applications to two-dimensional test cases.
DEFF Research Database (Denmark)
Mosegaard, Jesper; Sørensen, Thomas Sangild
2005-01-01
Modern graphics processing units (GPUs) can be effectively used to solve physical systems. To use the GPUoptimally, the discretization of the physical system is often restricted to a regular grid. When grid values representspatial positions, a direct visualization can result in a jagged appearance....... In this paper we propose todecouple computation and visualization of such systems. We define mappings that enable the deformation of ahigh-resolution surface based on a physical simulation on a lower resolution uniform grid. More specifically weinvestigate new approaches for the visualization of a GPU based...
Fast ray-tracing of human eye optics on Graphics Processing Units.
Wei, Qi; Patkar, Saket; Pai, Dinesh K
2014-05-01
We present a new technique for simulating retinal image formation by tracing a large number of rays from objects in three dimensions as they pass through the optic apparatus of the eye to objects. Simulating human optics is useful for understanding basic questions of vision science and for studying vision defects and their corrections. Because of the complexity of computing such simulations accurately, most previous efforts used simplified analytical models of the normal eye. This makes them less effective in modeling vision disorders associated with abnormal shapes of the ocular structures which are hard to be precisely represented by analytical surfaces. We have developed a computer simulator that can simulate ocular structures of arbitrary shapes, for instance represented by polygon meshes. Topographic and geometric measurements of the cornea, lens, and retina from keratometer or medical imaging data can be integrated for individualized examination. We utilize parallel processing using modern Graphics Processing Units (GPUs) to efficiently compute retinal images by tracing millions of rays. A stable retinal image can be generated within minutes. We simulated depth-of-field, accommodation, chromatic aberrations, as well as astigmatism and correction. We also show application of the technique in patient specific vision correction by incorporating geometric models of the orbit reconstructed from clinical medical images. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
International Nuclear Information System (INIS)
Liang, Yicheng; Peng, Hao
2015-01-01
Depth-of-interaction (DOI) poses a major challenge for a PET system to achieve uniform spatial resolution across the field-of-view, particularly for small animal and organ-dedicated PET systems. In this work, we implemented an analytical method to model system matrix for resolution recovery, which was then incorporated in PET image reconstruction on a graphical processing unit platform, due to its parallel processing capacity. The method utilizes the concepts of virtual DOI layers and multi-ray tracing to calculate the coincidence detection response function for a given line-of-response. The accuracy of the proposed method was validated for a small-bore PET insert to be used for simultaneous PET/MR breast imaging. In addition, the performance comparisons were studied among the following three cases: 1) no physical DOI and no resolution modeling; 2) two physical DOI layers and no resolution modeling; and 3) no physical DOI design but with a different number of virtual DOI layers. The image quality was quantitatively evaluated in terms of spatial resolution (full-width-half-maximum and position offset), contrast recovery coefficient and noise. The results indicate that the proposed method has the potential to be used as an alternative to other physical DOI designs and achieve comparable imaging performances, while reducing detector/system design cost and complexity. (paper)
VACTIV: A graphical dialog based program for an automatic processing of line and band spectra
Zlokazov, V. B.
2013-05-01
The program VACTIV-Visual ACTIV-has been developed for an automatic analysis of spectrum-like distributions, in particular gamma-ray spectra or alpha-spectra and is a standard graphical dialog based Windows XX application, driven by a menu, mouse and keyboard. On the one hand, it was a conversion of an existing Fortran program ACTIV [1] to the DELPHI language; on the other hand, it is a transformation of the sequential syntax of Fortran programming to a new object-oriented style, based on the organization of event interactions. New features implemented in the algorithms of both the versions consisted in the following as peak model both an analytical function and a graphical curve could be used; the peak search algorithm was able to recognize not only Gauss peaks but also peaks with an irregular form; both narrow peaks (2-4 channels) and broad ones (50-100 channels); the regularization technique in the fitting guaranteed a stable solution in the most complicated cases of strongly overlapping or weak peaks. The graphical dialog interface of VACTIV is much more convenient than the batch mode of ACTIV. [1] V.B. Zlokazov, Computer Physics Communications, 28 (1982) 27-37. NEW VERSION PROGRAM SUMMARYProgram Title: VACTIV Catalogue identifier: ABAC_v2_0 Licensing provisions: no Programming language: DELPHI 5-7 Pascal. Computer: IBM PC series. Operating system: Windows XX. RAM: 1 MB Keywords: Nuclear physics, spectrum decomposition, least squares analysis, graphical dialog, object-oriented programming. Classification: 17.6. Catalogue identifier of previous version: ABAC_v1_0 Journal reference of previous version: Comput. Phys. Commun. 28 (1982) 27 Does the new version supersede the previous version?: Yes. Nature of problem: Program VACTIV is intended for precise analysis of arbitrary spectrum-like distributions, e.g. gamma-ray and X-ray spectra and allows the user to carry out the full cycle of automatic processing of such spectra, i.e. calibration, automatic peak search
Directory of Open Access Journals (Sweden)
Wang Kai
2011-05-01
Full Text Available Abstract Background Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs have multiple cores, whereas Graphics Processing Units (GPUs also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Findings Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1 the interaction of SNPs within it in parallel, and 2 the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. Conclusions GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/.
INTLIB-6, Graphic Device Interface Library for ENDF/B Processing Codes
International Nuclear Information System (INIS)
Dunford, L.
1999-01-01
1 - Description of program or function: The graphic subroutine libraries DISSPLA and GRALIB (USCD1211) generally produce output which is independent of the output graphic device. A set of device dependent interface routines is required to translate the device independent output to the form required for each graphic device available. The interface library INTLIB provides interface routines for the following output formats: TETRONIX - LN03 PLUS, - video display terminal; POSTSCRIPT - LN03 PLUS with PostScript, - LaserJet III in PostScript mode, - video display terminal; REGIS - VT240 and VT1200; HPGL - LaserJet III in HPGL mode; FR80 - COMP80 film, fiche and hard copy
a method of gravity and seismic sequential inversion and its GPU implementation
Liu, G.; Meng, X.
2011-12-01
In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing
ACHIEVING HIGH INTEGRITY OF PROCESS-CONTROL SOFTWARE BY GRAPHICAL DESIGN AND FORMAL VERIFICATION
HALANG, WA; Kramer, B.J.
The International Electrotechnical Commission is currently standardising four compatible languages for designing and implementing programmable logic controllers (PLCs). The language family includes a diagrammatic notation that supports the idea of software ICs to encourage graphical design
Atluri, Sravya; Frehlich, Matthew; Mei, Ye; Garcia Dominguez, Luis; Rogasch, Nigel C.; Wong, Willy; Daskalakis, Zafiris J.; Farzan, Faranak
2016-01-01
Concurrent recording of electroencephalography (EEG) during transcranial magnetic stimulation (TMS) is an emerging and powerful tool for studying brain health and function. Despite a growing interest in adaptation of TMS-EEG across neuroscience disciplines, its widespread utility is limited by signal processing challenges. These challenges arise due to the nature of TMS and the sensitivity of EEG to artifacts that often mask TMS-evoked potentials (TEP)s. With an increase in the complexity of data processing methods and a growing interest in multi-site data integration, analysis of TMS-EEG data requires the development of a standardized method to recover TEPs from various sources of artifacts. This article introduces TMSEEG, an open-source MATLAB application comprised of multiple algorithms organized to facilitate a step-by-step procedure for TMS-EEG signal processing. Using a modular design and interactive graphical user interface (GUI), this toolbox aims to streamline TMS-EEG signal processing for both novice and experienced users. Specifically, TMSEEG provides: (i) targeted removal of TMS-induced and general EEG artifacts; (ii) a step-by-step modular workflow with flexibility to modify existing algorithms and add customized algorithms; (iii) a comprehensive display and quantification of artifacts; (iv) quality control check points with visual feedback of TEPs throughout the data processing workflow; and (v) capability to label and store a database of artifacts. In addition to these features, the software architecture of TMSEEG ensures minimal user effort in initial setup and configuration of parameters for each processing step. This is partly accomplished through a close integration with EEGLAB, a widely used open-source toolbox for EEG signal processing. In this article, we introduce TMSEEG, validate its features and demonstrate its application in extracting TEPs across several single- and multi-pulse TMS protocols. As the first open-source GUI-based pipeline
Atluri, Sravya; Frehlich, Matthew; Mei, Ye; Garcia Dominguez, Luis; Rogasch, Nigel C; Wong, Willy; Daskalakis, Zafiris J; Farzan, Faranak
2016-01-01
Concurrent recording of electroencephalography (EEG) during transcranial magnetic stimulation (TMS) is an emerging and powerful tool for studying brain health and function. Despite a growing interest in adaptation of TMS-EEG across neuroscience disciplines, its widespread utility is limited by signal processing challenges. These challenges arise due to the nature of TMS and the sensitivity of EEG to artifacts that often mask TMS-evoked potentials (TEP)s. With an increase in the complexity of data processing methods and a growing interest in multi-site data integration, analysis of TMS-EEG data requires the development of a standardized method to recover TEPs from various sources of artifacts. This article introduces TMSEEG, an open-source MATLAB application comprised of multiple algorithms organized to facilitate a step-by-step procedure for TMS-EEG signal processing. Using a modular design and interactive graphical user interface (GUI), this toolbox aims to streamline TMS-EEG signal processing for both novice and experienced users. Specifically, TMSEEG provides: (i) targeted removal of TMS-induced and general EEG artifacts; (ii) a step-by-step modular workflow with flexibility to modify existing algorithms and add customized algorithms; (iii) a comprehensive display and quantification of artifacts; (iv) quality control check points with visual feedback of TEPs throughout the data processing workflow; and (v) capability to label and store a database of artifacts. In addition to these features, the software architecture of TMSEEG ensures minimal user effort in initial setup and configuration of parameters for each processing step. This is partly accomplished through a close integration with EEGLAB, a widely used open-source toolbox for EEG signal processing. In this article, we introduce TMSEEG, validate its features and demonstrate its application in extracting TEPs across several single- and multi-pulse TMS protocols. As the first open-source GUI-based pipeline
International Nuclear Information System (INIS)
Sharp, G C; Kandasamy, N; Singh, H; Folkert, M
2007-01-01
This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup-up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration
The application of projected conjugate gradient solvers on graphical processing units
International Nuclear Information System (INIS)
Lin, Youzuo; Renaut, Rosemary
2011-01-01
Graphical processing units introduce the capability for large scale computation at the desktop. Presented numerical results verify that efficiencies and accuracies of basic linear algebra subroutines of all levels when implemented in CUDA and Jacket are comparable. But experimental results demonstrate that the basic linear algebra subroutines of level three offer the greatest potential for improving efficiency of basic numerical algorithms. We consider the solution of the multiple right hand side set of linear equations using Krylov subspace-based solvers. Thus, for the multiple right hand side case, it is more efficient to make use of a block implementation of the conjugate gradient algorithm, rather than to solve each system independently. Jacket is used for the implementation. Furthermore, including projection from one system to another improves efficiency. A relevant example, for which simulated results are provided, is the reconstruction of a three dimensional medical image volume acquired from a positron emission tomography scanner. Efficiency of the reconstruction is improved by using projection across nearby slices.
Efficient molecular dynamics simulations with many-body potentials on graphics processing units
Fan, Zheyong; Chen, Wei; Vierimaa, Ville; Harju, Ari
2017-09-01
Graphics processing units have been extensively used to accelerate classical molecular dynamics simulations. However, there is much less progress on the acceleration of force evaluations for many-body potentials compared to pairwise ones. In the conventional force evaluation algorithm for many-body potentials, the force, virial stress, and heat current for a given atom are accumulated within different loops, which could result in write conflict between different threads in a CUDA kernel. In this work, we provide a new force evaluation algorithm, which is based on an explicit pairwise force expression for many-body potentials derived recently (Fan et al., 2015). In our algorithm, the force, virial stress, and heat current for a given atom can be accumulated within a single thread and is free of write conflicts. We discuss the formulations and algorithms and evaluate their performance. A new open-source code, GPUMD, is developed based on the proposed formulations. For the Tersoff many-body potential, the double precision performance of GPUMD using a Tesla K40 card is equivalent to that of the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) molecular dynamics code running with about 100 CPU cores (Intel Xeon CPU X5670 @ 2.93 GHz).
The application of projected conjugate gradient solvers on graphical processing units
Energy Technology Data Exchange (ETDEWEB)
Lin, Youzuo [Los Alamos National Laboratory; Renaut, Rosemary [ARIZONA STATE UNIV.
2011-01-26
Graphical processing units introduce the capability for large scale computation at the desktop. Presented numerical results verify that efficiencies and accuracies of basic linear algebra subroutines of all levels when implemented in CUDA and Jacket are comparable. But experimental results demonstrate that the basic linear algebra subroutines of level three offer the greatest potential for improving efficiency of basic numerical algorithms. We consider the solution of the multiple right hand side set of linear equations using Krylov subspace-based solvers. Thus, for the multiple right hand side case, it is more efficient to make use of a block implementation of the conjugate gradient algorithm, rather than to solve each system independently. Jacket is used for the implementation. Furthermore, including projection from one system to another improves efficiency. A relevant example, for which simulated results are provided, is the reconstruction of a three dimensional medical image volume acquired from a positron emission tomography scanner. Efficiency of the reconstruction is improved by using projection across nearby slices.
Zubair, Mohammad; Nielsen, Eric; Luitjens, Justin; Hammond, Dana
2016-01-01
In the field of computational fluid dynamics, the Navier-Stokes equations are often solved using an unstructuredgrid approach to accommodate geometric complexity. Implicit solution methodologies for such spatial discretizations generally require frequent solution of large tightly-coupled systems of block-sparse linear equations. The multicolor point-implicit solver used in the current work typically requires a significant fraction of the overall application run time. In this work, an efficient implementation of the solver for graphics processing units is proposed. Several factors present unique challenges to achieving an efficient implementation in this environment. These include the variable amount of parallelism available in different kernel calls, indirect memory access patterns, low arithmetic intensity, and the requirement to support variable block sizes. In this work, the solver is reformulated to use standard sparse and dense Basic Linear Algebra Subprograms (BLAS) functions. However, numerical experiments show that the performance of the BLAS functions available in existing CUDA libraries is suboptimal for matrices representative of those encountered in actual simulations. Instead, optimized versions of these functions are developed. Depending on block size, the new implementations show performance gains of up to 7x over the existing CUDA library functions.
Paardekooper, S.-J.
2017-08-01
We present a new method for numerical hydrodynamics which uses a multidimensional generalization of the Roe solver and operates on an unstructured triangular mesh. The main advantage over traditional methods based on Riemann solvers, which commonly use one-dimensional flux estimates as building blocks for a multidimensional integration, is its inherently multidimensional nature, and as a consequence its ability to recognize multidimensional stationary states that are not hydrostatic. A second novelty is the focus on graphics processing units (GPUs). By tailoring the algorithms specifically to GPUs, we are able to get speedups of 100-250 compared to a desktop machine. We compare the multidimensional upwind scheme to a traditional, dimensionally split implementation of the Roe solver on several test problems, and we find that the new method significantly outperforms the Roe solver in almost all cases. This comes with increased computational costs per time-step, which makes the new method approximately a factor of 2 slower than a dimensionally split scheme acting on a structured grid.
Parallelization and checkpointing of GPU applications through program transformation
Energy Technology Data Exchange (ETDEWEB)
Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
Zhang, Airong; Zhang, Song; Bian, Cuirong
2018-02-01
Cortical bone provides the main form of support in humans and other vertebrates against various forces. Thus, capturing its mechanical properties is important. In this study, the mechanical properties of cortical bone were investigated by using automated ball indentation and graphics processing at both the macroscopic and microstructural levels under dry conditions. First, all polished samples were photographed under a metallographic microscope, and the area ratio of the circumferential lamellae and osteons was calculated through the graphics processing method. Second, fully-computer-controlled automated ball indentation (ABI) tests were performed to explore the micro-mechanical properties of the cortical bone at room temperature and a constant indenter speed. The indentation defects were examined with a scanning electron microscope. Finally, the macroscopic mechanical properties of the cortical bone were estimated with the graphics processing method and mixture rule. Combining ABI and graphics processing proved to be an effective tool to obtaining the mechanical properties of the cortical bone, and the indenter size had a significant effect on the measurement. The methods presented in this paper provide an innovative approach to acquiring the macroscopic mechanical properties of cortical bone in a nondestructive manner. Copyright © 2017 Elsevier Ltd. All rights reserved.
A New Parallel Approach for Accelerating the GPU-Based Execution of Edge Detection Algorithms.
Emrani, Zahra; Bateni, Soroosh; Rabbani, Hossein
2017-01-01
Real-time image processing is used in a wide variety of applications like those in medical care and industrial processes. This technique in medical care has the ability to display important patient information graphi graphically, which can supplement and help the treatment process. Medical decisions made based on real-time images are more accurate and reliable. According to the recent researches, graphic processing unit (GPU) programming is a useful method for improving the speed and quality of medical image processing and is one of the ways of real-time image processing. Edge detection is an early stage in most of the image processing methods for the extraction of features and object segments from a raw image. The Canny method, Sobel and Prewitt filters, and the Roberts' Cross technique are some examples of edge detection algorithms that are widely used in image processing and machine vision. In this work, these algorithms are implemented using the Compute Unified Device Architecture (CUDA), Open Source Computer Vision (OpenCV), and Matrix Laboratory (MATLAB) platforms. An existing parallel method for Canny approach has been modified further to run in a fully parallel manner. This has been achieved by replacing the breadth- first search procedure with a parallel method. These algorithms have been compared by testing them on a database of optical coherence tomography images. The comparison of results shows that the proposed implementation of the Canny method on GPU using the CUDA platform improves the speed of execution by 2-100× compared to the central processing unit-based implementation using the OpenCV and MATLAB platforms.
Madanecki, Piotr; Bałut, Magdalena; Buckley, Patrick G; Ochocka, J Renata; Bartoszewski, Rafał; Crossman, David K; Messiaen, Ludwine M; Piotrowski, Arkadiusz
2018-01-01
High-throughput technologies generate considerable amount of data which often requires bioinformatic expertise to analyze. Here we present High-Throughput Tabular Data Processor (HTDP), a platform independent Java program. HTDP works on any character-delimited column data (e.g. BED, GFF, GTF, PSL, WIG, VCF) from multiple text files and supports merging, filtering and converting of data that is produced in the course of high-throughput experiments. HTDP can also utilize itemized sets of conditions from external files for complex or repetitive filtering/merging tasks. The program is intended to aid global, real-time processing of large data sets using a graphical user interface (GUI). Therefore, no prior expertise in programming, regular expression, or command line usage is required of the user. Additionally, no a priori assumptions are imposed on the internal file composition. We demonstrate the flexibility and potential of HTDP in real-life research tasks including microarray and massively parallel sequencing, i.e. identification of disease predisposing variants in the next generation sequencing data as well as comprehensive concurrent analysis of microarray and sequencing results. We also show the utility of HTDP in technical tasks including data merge, reduction and filtering with external criteria files. HTDP was developed to address functionality that is missing or rudimentary in other GUI software for processing character-delimited column data from high-throughput technologies. Flexibility, in terms of input file handling, provides long term potential functionality in high-throughput analysis pipelines, as the program is not limited by the currently existing applications and data formats. HTDP is available as the Open Source software (https://github.com/pmadanecki/htdp).
Directory of Open Access Journals (Sweden)
Kateryna P. Osadcha
2017-12-01
Full Text Available The article is devoted to some aspects of the formation of future bachelor's graphic competence in computer sciences while teaching the fundamentals for working with three-dimensional modelling means. The analysis, classification and systematization of three-dimensional modelling means are given. The aim of research consists in investigating the set of instruments and classification of three-dimensional modelling means and correlation of skills, which are being formed, concerning inquired ones at the labour market in order to use them further in the process of forming graphic competence during training future bachelors in computer sciences. The peculiarities of the process of forming future bachelor's graphic competence in computer sciences by means of revealing, analyzing and systematizing three-dimensional modelling means and types of three-dimensional graphics at present stage of the development of informational technologies are traced a line round. The result of the research is a soft-ware choice in three-dimensional modelling for the process of training future bachelors in computer sciences.
Proton Testing of nVidia GTX 1050 GPU
Wyrwas, E. J.
2017-01-01
Single-Event Effects (SEE) testing was conducted on the nVidia GTX 1050 Graphics Processor Unit (GPU); herein referred to as device under test (DUT). Testing was conducted at Massachusetts General Hospitals (MGH) Francis H. Burr Proton Therapy Center on April 9th, 2017 using 200-MeV protons. This testing trip was purposed to provide a baseline assessment of the radiation susceptibility of the DUT as no previous testing had been conducted on this component.
GPU-Vote: A Framework for Accelerating Voting Algorithms on GPU.
Braak, van den G.J.W.; Nugteren, C.; Mesman, B.; Corporaal, H.; Kaklamanis, C.; Papatheodorou, T.; Spirakis, P.G.
2012-01-01
Voting algorithms, such as histogram and Hough transforms, are frequently used algorithms in various domains, such as statistics and image processing. Algorithms in these domains may be accelerated using GPUs. Implementing voting algorithms efficiently on a GPU however is far from trivial due to
Fast parallel tandem mass spectral library searching using GPU hardware acceleration.
Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K; Martin, Daniel B
2011-06-03
Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate-limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper, we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching), is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA, which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment.
Directory of Open Access Journals (Sweden)
Yu Liu
2015-01-01
Full Text Available The Smith-Waterman (SW algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
Cai, Yong; Cui, Xiangyang; Li, Guangyao; Liu, Wenyang
2018-04-01
The edge-smooth finite element method (ES-FEM) can improve the computational accuracy of triangular shell elements and the mesh partition efficiency of complex models. In this paper, an approach is developed to perform explicit finite element simulations of contact-impact problems with a graphical processing unit (GPU) using a special edge-smooth triangular shell element based on ES-FEM. Of critical importance for this problem is achieving finer-grained parallelism to enable efficient data loading and to minimize communication between the device and host. Four kinds of parallel strategies are then developed to efficiently solve these ES-FEM based shell element formulas, and various optimization methods are adopted to ensure aligned memory access. Special focus is dedicated to developing an approach for the parallel construction of edge systems. A parallel hierarchy-territory contact-searching algorithm (HITA) and a parallel penalty function calculation method are embedded in this parallel explicit algorithm. Finally, the program flow is well designed, and a GPU-based simulation system is developed, using Nvidia's CUDA. Several numerical examples are presented to illustrate the high quality of the results obtained with the proposed methods. In addition, the GPU-based parallel computation is shown to significantly reduce the computing time.
A. Agbaria (Adnan); D. Minor (David); N. Peterfreund (Natan); E. Rozenberg (Eyal); O. Rosenberg (Ofer); Huawei Research
2016-01-01
textabstractExisting work on accelerating analytic DB query processing with (discrete) GPUs fails to fully realize their potential for speedup through parallelism: Published results do not achieve significant speedup over more performant CPU-only DBMSes when processing complete queries. This
Parallel, distributed and GPU computing technologies in single-particle electron microscopy.
Schmeisser, Martin; Heisen, Burkhard C; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger
2009-07-01
Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today's technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined.
International Nuclear Information System (INIS)
Wan, H; Tseung, Chan; Beltran, C
2016-01-01
Purpose: To demonstrate fast and accurate Monte Carlo (MC) calculations of proton dose-averaged linear energy transfer (LETd) and biological dose (BD) on a Graphics Processing Unit (GPU) card. Methods: A previously validated GPU-based MC simulation of proton transport was used to rapidly generate LETd distributions for proton treatment plans. Since this MC handles proton-nuclei interactions on an event-by-event using a Bertini intranuclear cascade-evaporation model, secondary protons were taken into account. The smaller contributions of secondary neutrons and recoil nuclei were ignored. Recent work has shown that LETd values are sensitive to the scoring method. The GPU-based LETd calculations were verified by comparing with a TOPAS custom scorer that uses tabulated stopping powers, following recommendations by other authors. Comparisons were made for prostate and head-and-neck patients. A python script is used to convert the MC-generated LETd distributions to BD using a variety of published linear quadratic models, and to export the BD in DICOM format for subsequent evaluation. Results: Very good agreement is obtained between TOPAS and our GPU MC. Given a complex head-and-neck plan with 1 mm voxel spacing, the physical dose, LETd and BD calculations for 10"8 proton histories can be completed in ∼5 minutes using a NVIDIA Titan X card. The rapid turnover means that MC feedback can be obtained on dosimetric plan accuracy as well as BD hotspot locations, particularly in regards to their proximity to critical structures. In our institution the GPU MC-generated dose, LETd and BD maps are used to assess plan quality for all patients undergoing treatment. Conclusion: Fast and accurate MC-based LETd calculations can be performed on the GPU. The resulting BD maps provide valuable feedback during treatment plan review. Partially funded by Varian Medical Systems.
Energy Technology Data Exchange (ETDEWEB)
Wan, H; Tseung, Chan; Beltran, C [Mayo Clinic, Rochester, MN (United States)
2016-06-15
Purpose: To demonstrate fast and accurate Monte Carlo (MC) calculations of proton dose-averaged linear energy transfer (LETd) and biological dose (BD) on a Graphics Processing Unit (GPU) card. Methods: A previously validated GPU-based MC simulation of proton transport was used to rapidly generate LETd distributions for proton treatment plans. Since this MC handles proton-nuclei interactions on an event-by-event using a Bertini intranuclear cascade-evaporation model, secondary protons were taken into account. The smaller contributions of secondary neutrons and recoil nuclei were ignored. Recent work has shown that LETd values are sensitive to the scoring method. The GPU-based LETd calculations were verified by comparing with a TOPAS custom scorer that uses tabulated stopping powers, following recommendations by other authors. Comparisons were made for prostate and head-and-neck patients. A python script is used to convert the MC-generated LETd distributions to BD using a variety of published linear quadratic models, and to export the BD in DICOM format for subsequent evaluation. Results: Very good agreement is obtained between TOPAS and our GPU MC. Given a complex head-and-neck plan with 1 mm voxel spacing, the physical dose, LETd and BD calculations for 10{sup 8} proton histories can be completed in ∼5 minutes using a NVIDIA Titan X card. The rapid turnover means that MC feedback can be obtained on dosimetric plan accuracy as well as BD hotspot locations, particularly in regards to their proximity to critical structures. In our institution the GPU MC-generated dose, LETd and BD maps are used to assess plan quality for all patients undergoing treatment. Conclusion: Fast and accurate MC-based LETd calculations can be performed on the GPU. The resulting BD maps provide valuable feedback during treatment plan review. Partially funded by Varian Medical Systems.
Fykse, Egil
2013-01-01
The objective of this thesis is to compare the suitability of FPGAs, GPUs and DSPs for digital image processing applications. Normalized cross-correlation is used as a benchmark, because this algorithm includes convolution, a common operation in image processing and elsewhere. Normalized cross-correlation is a template matching algorithm that is used to locate predefined objects in a scene image. Because the throughput of DSPs is low for efficient calculation of normalized cross-correlation, ...
Directory of Open Access Journals (Sweden)
David S. Hardin
2013-04-01
Full Text Available As Graphics Processing Units (GPUs have gained in capability and GPU development environments have matured, developers are increasingly turning to the GPU to off-load the main host CPU of numerically-intensive, parallelizable computations. Modern GPUs feature hundreds of cores, and offer programming niceties such as double-precision floating point, and even limited recursion. This shift from CPU to GPU, however, raises the question: how do we know that these new GPU-based algorithms are correct? In order to explore this new verification frontier, we formalized a parallelizable all-pairs shortest path (APSP algorithm for weighted graphs, originally coded in NVIDIA's CUDA language, in ACL2. The ACL2 specification is written using a single-threaded object (stobj and tail recursion, as the stobj/tail recursion combination yields the most straightforward translation from imperative programming languages, as well as efficient, scalable executable specifications within ACL2 itself. The ACL2 version of the APSP algorithm can process millions of vertices and edges with little to no garbage generation, and executes at one-sixth the speed of a host-based version of APSP coded in C – a very respectable result for a theorem prover. In addition to formalizing the APSP algorithm (which uses Dijkstra's shortest path algorithm at its core, we have also provided capability that the original APSP code lacked, namely shortest path recovery. Path recovery is accomplished using a secondary ACL2 stobj implementing a LIFO stack, which is proven correct. To conclude the experiment, we ported the ACL2 version of the APSP kernels back to C, resulting in a less than 5% slowdown, and also performed a partial back-port to CUDA, which, surprisingly, yielded a slight performance increase.
GPU Enhancement of the Trigger to Extend Physics Reach at the LHC
Lujan, P.; Hunt, A.; Jindal, P.; LeGresley, P.
2014-01-01
At the Large Hadron Collider (LHC), the trigger systems for the detectors must be able to process a very large amount of data in a very limited amount of time, so that the nominal collision rate of 40 MHz can be reduced to a data rate that can be stored and processed in a reasonable amount of time. This need for high performance places very stringent requirements on the complexity of the algorithms that can be used for identifying events of interest in the trigger system, which potentially limits the ability to trigger on signatures of various new physics models. In this paper, we present an alternative tracking algorithm, based on the Hough transform, which avoids many of the problems associated with the standard combinatorial track finding currently used. The Hough transform is also well-adapted for Graphics Processing Unit (GPU)-based computing, and such GPU-based systems could be easily integrated into the existing High-Level Trigger (HLT). This algorithm offers the ability to trigger on topological signa...
Edge-preserving image denoising via group coordinate descent on the GPU.
McGaffin, Madison Gray; Fessler, Jeffrey A
2015-04-01
Image denoising is a fundamental operation in image processing, and its applications range from the direct (photographic enhancement) to the technical (as a subproblem in image reconstruction algorithms). In many applications, the number of pixels has continued to grow, while the serial execution speed of computational hardware has begun to stall. New image processing algorithms must exploit the power offered by massively parallel architectures like graphics processing units (GPUs). This paper describes a family of image denoising algorithms well-suited to the GPU. The algorithms iteratively perform a set of independent, parallel 1D pixel-update subproblems. To match GPU memory limitations, they perform these pixel updates in-place and only store the noisy data, denoised image, and problem parameters. The algorithms can handle a wide range of edge-preserving roughness penalties, including differentiable convex penalties and anisotropic total variation. Both algorithms use the majorize-minimize framework to solve the 1D pixel update subproblem. Results from a large 2D image denoising problem and a 3D medical imaging denoising problem demonstrate that the proposed algorithms converge rapidly in terms of both iteration and run-time.
Kwon, Min-Woo; Kim, Seung-Cheol; Kim, Eun-Soo
2016-01-20
A three-directional motion-compensation mask-based novel look-up table method is proposed and implemented on graphics processing units (GPUs) for video-rate generation of digital holographic videos of three-dimensional (3D) scenes. Since the proposed method is designed to be well matched with the software and memory structures of GPUs, the number of compute-unified-device-architecture kernel function calls can be significantly reduced. This results in a great increase of the computational speed of the proposed method, allowing video-rate generation of the computer-generated hologram (CGH) patterns of 3D scenes. Experimental results reveal that the proposed method can generate 39.8 frames of Fresnel CGH patterns with 1920×1080 pixels per second for the test 3D video scenario with 12,088 object points on dual GPU boards of NVIDIA GTX TITANs, and they confirm the feasibility of the proposed method in the practical application fields of electroholographic 3D displays.
Sanchez, Julio
2003-01-01
Part I - Graphics Fundamentals PC GRAPHICS OVERVIEW History and Evolution Short History of PC Video PS/2 Video Systems SuperVGA Graphics Coprocessors and Accelerators Graphics Applications State-of-the-Art in PC Graphics 3D Application Programming Interfaces POLYGONAL MODELING Vector and Raster Data Coordinate Systems Modeling with Polygons IMAGE TRANSFORMATIONS Matrix-based Representations Matrix Arithmetic 3D Transformations PROGRAMMING MATRIX TRANSFORMATIONS Numeric Data in Matrix Form Array Processing PROJECTIONS AND RENDERING Perspective The Rendering Pipeline LIGHTING AND SHADING Lightin
Arvo, James
1991-01-01
Graphics Gems II is a collection of articles shared by a diverse group of people that reflect ideas and approaches in graphics programming which can benefit other computer graphics programmers.This volume presents techniques for doing well-known graphics operations faster or easier. The book contains chapters devoted to topics on two-dimensional and three-dimensional geometry and algorithms, image processing, frame buffer techniques, and ray tracing techniques. The radiosity approach, matrix techniques, and numerical and programming techniques are likewise discussed.Graphics artists and comput
Real-Time Incompressible Fluid Simulation on the GPU
Directory of Open Access Journals (Sweden)
Xiao Nie
2015-01-01
Full Text Available We present a parallel framework for simulating incompressible fluids with predictive-corrective incompressible smoothed particle hydrodynamics (PCISPH on the GPU in real time. To this end, we propose an efficient GPU streaming pipeline to map the entire computational task onto the GPU, fully exploiting the massive computational power of state-of-the-art GPUs. In PCISPH-based simulations, neighbor search is the major performance obstacle because this process is performed several times at each time step. To eliminate this bottleneck, an efficient parallel sorting method for this time-consuming step is introduced. Moreover, we discuss several optimization techniques including using fast on-chip shared memory to avoid global memory bandwidth limitations and thus further improve performance on modern GPU hardware. With our framework, the realism of real-time fluid simulation is significantly improved since our method enforces incompressibility constraint which is typically ignored due to efficiency reason in previous GPU-based SPH methods. The performance results illustrate that our approach can efficiently simulate realistic incompressible fluid in real time and results in a speed-up factor of up to 23 on a high-end NVIDIA GPU in comparison to single-threaded CPU-based implementation.
GRAPHICS PROCESSING UNITS: MORE THAN THE PATHWAY TO REALISTIC VIDEO-GAMES
Directory of Open Access Journals (Sweden)
CARLOS TRUJILLO
2011-01-01
Full Text Available El amplio mercado de los juegos de video ha impulsado un acelerado progreso del hardware y software orientado a lograr ambientes de juego de mayor realidad. Entre estos desarrollos se cuentan las unidades de procesamiento gráfico (GPU, cuyo objetivo es liberar la unidad de procesamiento principal (CPU de los elaborados cómputos que proporcionan "vida" a los juegos de video. Para lograrlo, las GPUs son equipadas con múltiples núcleos de procesamiento operando en paralelo, lo cual permite utilizarlas en tareas mucho más diversas que el desarrollo de juegos de video. En este artículo se presenta una breve descripción de las características de compute unified device architecture (CUDA TM, una arquitectura de cómputo paralelo en GPUs. Se presenta una aplicación de esta arquitectura en la reconstrucción numérica de hologramas, para la cual se reporta una aceleración de 11X con respecto al desempeño alcanzado en una CPU.
Mobile ultrasound plane wave beamforming on iPhone or iPad using metal-based GPU processing
Hewener, H.; Tretbar, S.
2015-01-01
Mobile and cost effective ultrasound devices are being used in point of care scenarios or the drama room. To reduce the costs of such devices we already presented the possibilities of consumer devices like Apple iPad for full signal processing of raw data for ultraound image generation. Using technologies like plane wave imaging to generate a full image with only one excitation/reception event the acquisition times and power consumption of ultrasound imaging can be reduced for low power mobil...
Multi-GPU hybrid programming accelerated three-dimensional phase-field model in binary alloy
Directory of Open Access Journals (Sweden)
Changsheng Zhu
2018-03-01
Full Text Available In the process of dendritic growth simulation, the computational efficiency and the problem scales have extremely important influence on simulation efficiency of three-dimensional phase-field model. Thus, seeking for high performance calculation method to improve the computational efficiency and to expand the problem scales has a great significance to the research of microstructure of the material. A high performance calculation method based on MPI+CUDA hybrid programming model is introduced. Multi-GPU is used to implement quantitative numerical simulations of three-dimensional phase-field model in binary alloy under the condition of multi-physical processes coupling. The acceleration effect of different GPU nodes on different calculation scales is explored. On the foundation of multi-GPU calculation model that has been introduced, two optimization schemes, Non-blocking communication optimization and overlap of MPI and GPU computing optimization, are proposed. The results of two optimization schemes and basic multi-GPU model are compared. The calculation results show that the use of multi-GPU calculation model can improve the computational efficiency of three-dimensional phase-field obviously, which is 13 times to single GPU, and the problem scales have been expanded to 8193. The feasibility of two optimization schemes is shown, and the overlap of MPI and GPU computing optimization has better performance, which is 1.7 times to basic multi-GPU model, when 21 GPUs are used.
GGEMS-Brachy: GPU GEant4-based Monte Carlo simulation for brachytherapy applications
International Nuclear Information System (INIS)
Lemaréchal, Yannick; Bert, Julien; Schick, Ulrike; Pradier, Olivier; Garcia, Marie-Paule; Boussion, Nicolas; Visvikis, Dimitris; Falconnet, Claire; Després, Philippe; Valeri, Antoine
2015-01-01
In brachytherapy, plans are routinely calculated using the AAPM TG43 formalism which considers the patient as a simple water object. An accurate modeling of the physical processes considering patient heterogeneity using Monte Carlo simulation (MCS) methods is currently too time-consuming and computationally demanding to be routinely used. In this work we implemented and evaluated an accurate and fast MCS on Graphics Processing Units (GPU) for brachytherapy low dose rate (LDR) applications. A previously proposed Geant4 based MCS framework implemented on GPU (GGEMS) was extended to include a hybrid GPU navigator, allowing navigation within voxelized patient specific images and analytically modeled 125 I seeds used in LDR brachytherapy. In addition, dose scoring based on track length estimator including uncertainty calculations was incorporated. The implemented GGEMS-brachy platform was validated using a comparison with Geant4 simulations and reference datasets. Finally, a comparative dosimetry study based on the current clinical standard (TG43) and the proposed platform was performed on twelve prostate cancer patients undergoing LDR brachytherapy. Considering patient 3D CT volumes of 400 × 250 × 65 voxels and an average of 58 implanted seeds, the mean patient dosimetry study run time for a 2% dose uncertainty was 9.35 s (≈500 ms 10 −6 simulated particles) and 2.5 s when using one and four GPUs, respectively. The performance of the proposed GGEMS-brachy platform allows envisaging the use of Monte Carlo simulation based dosimetry studies in brachytherapy compatible with clinical practice. Although the proposed platform was evaluated for prostate cancer, it is equally applicable to other LDR brachytherapy clinical applications. Future extensions will allow its application in high dose rate brachytherapy applications. (paper)
GGEMS-Brachy: GPU GEant4-based Monte Carlo simulation for brachytherapy applications
Lemaréchal, Yannick; Bert, Julien; Falconnet, Claire; Després, Philippe; Valeri, Antoine; Schick, Ulrike; Pradier, Olivier; Garcia, Marie-Paule; Boussion, Nicolas; Visvikis, Dimitris
2015-07-01
In brachytherapy, plans are routinely calculated using the AAPM TG43 formalism which considers the patient as a simple water object. An accurate modeling of the physical processes considering patient heterogeneity using Monte Carlo simulation (MCS) methods is currently too time-consuming and computationally demanding to be routinely used. In this work we implemented and evaluated an accurate and fast MCS on Graphics Processing Units (GPU) for brachytherapy low dose rate (LDR) applications. A previously proposed Geant4 based MCS framework implemented on GPU (GGEMS) was extended to include a hybrid GPU navigator, allowing navigation within voxelized patient specific images and analytically modeled 125I seeds used in LDR brachytherapy. In addition, dose scoring based on track length estimator including uncertainty calculations was incorporated. The implemented GGEMS-brachy platform was validated using a comparison with Geant4 simulations and reference datasets. Finally, a comparative dosimetry study based on the current clinical standard (TG43) and the proposed platform was performed on twelve prostate cancer patients undergoing LDR brachytherapy. Considering patient 3D CT volumes of 400 × 250 × 65 voxels and an average of 58 implanted seeds, the mean patient dosimetry study run time for a 2% dose uncertainty was 9.35 s (≈500 ms 10-6 simulated particles) and 2.5 s when using one and four GPUs, respectively. The performance of the proposed GGEMS-brachy platform allows envisaging the use of Monte Carlo simulation based dosimetry studies in brachytherapy compatible with clinical practice. Although the proposed platform was evaluated for prostate cancer, it is equally applicable to other LDR brachytherapy clinical applications. Future extensions will allow its application in high dose rate brachytherapy applications.
Sewell, Stephen
This thesis introduces a software framework that effectively utilizes low-cost commercially available Graphic Processing Units (GPUs) to simulate complex scientific plasma phenomena that are modeled using the Particle-In-Cell (PIC) paradigm. The software framework that was developed conforms to the Compute Unified Device Architecture (CUDA), a standard for general purpose graphic processing that was introduced by NVIDIA Corporation. This framework has been verified for correctness and applied to advance the state of understanding of the electromagnetic aspects of the development of the Aurora Borealis and Aurora Australis. For each phase of the PIC methodology, this research has identified one or more methods to exploit the problem's natural parallelism and effectively map it for execution on the graphic processing unit and its host processor. The sources of overhead that can reduce the effectiveness of parallelization for each of these methods have also been identified. One of the novel aspects of this research was the utilization of particle sorting during the grid interpolation phase. The final representation resulted in simulations that executed about 38 times faster than simulations that were run on a single-core general-purpose processing system. The scalability of this framework to larger problem sizes and future generation systems has also been investigated.
Gpufit: An open-source toolkit for GPU-accelerated curve fitting.
Przybylski, Adrian; Thiel, Björn; Keller-Findeisen, Jan; Stock, Bernd; Bates, Mark
2017-11-16
We present a general purpose, open-source software library for estimation of non-linear parameters by the Levenberg-Marquardt algorithm. The software, Gpufit, runs on a Graphics Processing Unit (GPU) and executes computations in parallel, resulting in a significant gain in performance. We measured a speed increase of up to 42 times when comparing Gpufit with an identical CPU-based algorithm, with no loss of precision or accuracy. Gpufit is designed such that it is easily incorporated into existing applications or adapted for new ones. Multiple software interfaces, including to C, Python, and Matlab, ensure that Gpufit is accessible from most programming environments. The full source code is published as an open source software repository, making its function transparent to the user and facilitating future improvements and extensions. As a demonstration, we used Gpufit to accelerate an existing scientific image analysis package, yielding significantly improved processing times for super-resolution fluorescence microscopy datasets.
Aerodynamic optimization of supersonic compressor cascade using differential evolution on GPU
Energy Technology Data Exchange (ETDEWEB)
Aissa, Mohamed Hasanine; Verstraete, Tom [Von Karman Institute for Fluid Dynamics (VKI) 1640 Sint-Genesius-Rode (Belgium); Vuik, Cornelis [Delft University of Technology 2628 CD Delft (Netherlands)
2016-06-08
Differential Evolution (DE) is a powerful stochastic optimization method. Compared to gradient-based algorithms, DE is able to avoid local minima but requires at the same time more function evaluations. In turbomachinery applications, function evaluations are performed with time-consuming CFD simulation, which results in a long, non affordable, design cycle. Modern High Performance Computing systems, especially Graphic Processing Units (GPUs), are able to alleviate this inconvenience by accelerating the design evaluation itself. In this work we present a validated CFD Solver running on GPUs, able to accelerate the design evaluation and thus the entire design process. An achieved speedup of 20x to 30x enabled the DE algorithm to run on a high-end computer instead of a costly large cluster. The GPU-enhanced DE was used to optimize the aerodynamics of a supersonic compressor cascade, achieving an aerodynamic loss minimization of 20%.
GPU seeks new funding for TMI cleanup
International Nuclear Information System (INIS)
Utroska, D.
1982-01-01
General Public Utilities (GPU) wants approval for annual transfer of money from base rate increases in other accounts to pay for the cleanup at Three Mile Island (TMI) until TMI-1 returns to service or the public utility commission takes further action. This proposal confirms fears of a delay in TMI-1 startup and demonstrates that the January negotiated settlement will produce little funding for TMI-2 cleanup. A review of the settlement terms outlines the three-step process for base rate increases and revenue adjustments after the startup of TMI-1, and points out where controversy and delays due to psychological stress make a new source of money essential. GPU thinks customer funding will motivate other parties to a broad-based cost-sharing agreement
Ha, Sanghyun; Park, Junshin; You, Donghyun
2018-01-01
Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. The Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution methods used in the semi-implicit fractional-step method take advantage of multiple tridiagonal matrices whose inversion is known as the major bottleneck for acceleration on a typical multi-core machine. A novel implementation of the semi-implicit fractional-step method designed for GPU acceleration of the incompressible Navier-Stokes equations is presented. Aspects of the programing model of Compute Unified Device Architecture (CUDA), which are critical to the bandwidth-bound nature of the present method are discussed in detail. A data layout for efficient use of CUDA libraries is proposed for acceleration of tridiagonal matrix inversion and fast Fourier transform. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while the Navier-Stokes equations are computed on a GPU. Performance of the present method using CUDA is assessed by comparing the speed of solving three tridiagonal matrices using ADI with the speed of solving one heptadiagonal matrix using a conjugate gradient method. An overall speedup of 20 times is achieved using a Tesla K40 GPU in comparison with a single-core Xeon E5-2660 v3 CPU in simulations of turbulent boundary-layer flow over a flat plate conducted on over 134 million grids. Enhanced performance of 48 times speedup is reached for the same problem using a Tesla P100 GPU.
Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU
Directory of Open Access Journals (Sweden)
Zhou Zhe
2017-04-01
Full Text Available According to previous reports, information could be leaked from GPU memory; however, the security implications of such a threat were mostly over-looked, because only limited information could be indirectly extracted through side-channel attacks. In this paper, we propose a novel algorithm for recovering raw data directly from the GPU memory residues of many popular applications such as Google Chrome and Adobe PDF reader. Our algorithm enables harvesting highly sensitive information including credit card numbers and email contents from GPU memory residues. Evaluation results also indicate that nearly all GPU-accelerated applications are vulnerable to such attacks, and adversaries can launch attacks without requiring any special privileges both on traditional multi-user operating systems, and emerging cloud computing scenarios.
Papaya Tree Detection with UAV Images Using a GPU-Accelerated Scale-Space Filtering Method
Directory of Open Access Journals (Sweden)
Hao Jiang
2017-07-01
Full Text Available The use of unmanned aerial vehicles (UAV can allow individual tree detection for forest inventories in a cost-effective way. The scale-space filtering (SSF algorithm is commonly used and has the capability of detecting trees of different crown sizes. In this study, we made two improvements with regard to the existing method and implementations. First, we incorporated SSF with a Lab color transformation to reduce over-detection problems associated with the original luminance image. Second, we ported four of the most time-consuming processes to the graphics processing unit (GPU to improve computational efficiency. The proposed method was implemented using PyCUDA, which enabled access to NVIDIA’s compute unified device architecture (CUDA through high-level scripting of the Python language. Our experiments were conducted using two images captured by the DJI Phantom 3 Professional and a most recent NVIDIA GPU GTX1080. The resulting accuracy was high, with an F-measure larger than 0.94. The speedup achieved by our parallel implementation was 44.77 and 28.54 for the first and second test image, respectively. For each 4000 × 3000 image, the total runtime was less than 1 s, which was sufficient for real-time performance and interactive application.
Multi–GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL
Directory of Open Access Journals (Sweden)
Jan Masek
2016-06-01
Full Text Available Using modern Graphic Processing Units (GPUs becomes very useful for computing complex and time consuming processes. GPUs provide high–performance computation capabilities with a good price. This paper deals with a multi–GPU OpenCL and CUDA implementations of k–Nearest Neighbor (k–NN algorithm. This work compares performances of OpenCLand CUDA implementations where each of them is suitable for different number of used attributes. The proposed CUDA algorithm achieves acceleration up to 880x in comparison witha single thread CPU version. The common k-NN was modified to be faster when the lower number of k neighbors is set. The performance of algorithm was verified with two GPUs dual-core NVIDIA GeForce GTX 690 and CPU Intel Core i7 3770 with 4.1 GHz frequency. The results of speed up were measured for one GPU, two GPUs, three and four GPUs. We performed several tests with data sets containing up to 4 million elements with various number of attributes.
Large-Scale Multi-Dimensional Document Clustering on GPU Clusters
Energy Technology Data Exchange (ETDEWEB)
Cui, Xiaohui [ORNL; Mueller, Frank [North Carolina State University; Zhang, Yongpeng [ORNL; Potok, Thomas E [ORNL
2010-01-01
Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteennode GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrates the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
A Novel CPU/GPU Simulation Environment for Large-Scale Biologically-Realistic Neural Modeling
Directory of Open Access Journals (Sweden)
Roger V Hoang
2013-10-01
Full Text Available Computational Neuroscience is an emerging field that provides unique opportunities to studycomplex brain structures through realistic neural simulations. However, as biological details are added tomodels, the execution time for the simulation becomes longer. Graphics Processing Units (GPUs are now being utilized to accelerate simulations due to their ability to perform computations in parallel. As such, they haveshown significant improvement in execution time compared to Central Processing Units (CPUs. Most neural simulators utilize either multiple CPUs or a single GPU for better performance, but still show limitations in execution time when biological details are not sacrificed. Therefore, we present a novel CPU/GPU simulation environment for large-scale biological networks,the NeoCortical Simulator version 6 (NCS6. NCS6 is a free, open-source, parallelizable, and scalable simula-tor, designed to run on clusters of multiple machines, potentially with high performance computing devicesin each of them. It has built-in leaky-integrate-and-fire (LIF and Izhikevich (IZH neuron models, but usersalso have the capability to design their own plug-in interface for different neuron types as desired. NCS6is currently able to simulate one million cells and 100 million synapses in quasi real time by distributing dataacross these heterogeneous clusters of CPUs and GPUs.
GPU-Accelerated Foreground Segmentation and Labeling for Real-Time Video Surveillance
Directory of Open Access Journals (Sweden)
Wei Song
2016-09-01
Full Text Available Real-time and accurate background modeling is an important researching topic in the fields of remote monitoring and video surveillance. Meanwhile, effective foreground detection is a preliminary requirement and decision-making basis for sustainable energy management, especially in smart meters. The environment monitoring results provide a decision-making basis for energy-saving strategies. For real-time moving object detection in video, this paper applies a parallel computing technology to develop a feedback foreground–background segmentation method and a parallel connected component labeling (PCCL algorithm. In the background modeling method, pixel-wise color histograms in graphics processing unit (GPU memory is generated from sequential images. If a pixel color in the current image does not locate around the peaks of its histogram, it is segmented as a foreground pixel. From the foreground segmentation results, a PCCL algorithm is proposed to cluster the foreground pixels into several groups in order to distinguish separate blobs. Because the noisy spot and sparkle in the foreground segmentation results always contain a small quantity of pixels, the small blobs are removed as noise in order to refine the segmentation results. The proposed GPU-based image processing algorithms are implemented using the compute unified device architecture (CUDA toolkit. The testing results show a significant enhancement in both speed and accuracy.
Anderson, B. H.; Putt, C. W.; Giamati, C. C.
1981-01-01
Color coding techniques used in the processing of remote sensing imagery were adapted and applied to the fluid dynamics problems associated with turbofan mixer nozzles. The computer generated color graphics were found to be useful in reconstructing the measured flow field from low resolution experimental data to give more physical meaning to this information and in scanning and interpreting the large volume of computer generated data from the three dimensional viscous computer code used in the analysis.
GPU implementations of online track finding algorithms at PANDA
Energy Technology Data Exchange (ETDEWEB)
Herten, Andreas; Stockmanns, Tobias; Ritman, James [Institut fuer Kernphysik, Forschungszentrum Juelich GmbH (Germany); Adinetz, Andrew; Pleiter, Dirk [Juelich Supercomputing Centre, Forschungszentrum Juelich GmbH (Germany); Kraus, Jiri [NVIDIA GmbH (Germany); Collaboration: PANDA-Collaboration
2014-07-01
The PANDA experiment is a hadron physics experiment that will investigate antiproton annihilation in the charm quark mass region. The experiment is now being constructed as one of the main parts of the FAIR facility. At an event rate of 2 . 10{sup 7}/s a data rate of 200 GB/s is expected. A reduction of three orders of magnitude is required in order to save the data for further offline analysis. Since signal and background processes at PANDA have similar signatures, no hardware-level trigger is foreseen for the experiment. Instead, a fast online event filter is substituting this element. We investigate the possibility of using